Resolving the issue where pipeline_parallel=2 crashes on V100

Resolving the issue where pipeline_parallel=2 crashes on V100

5 min read

 
Tech Blog
GPUStack
vLLM
Tesla V100
Multi-Node

GPU Cluster Operations Memo

Solving the pipeline_parallel=2 crash on V100 and
achieving stable operation with tensor_parallel=4 + 64K context #

While running Qwen2.5-14B-Instruct on multi-node (4× Tesla V100) with GPUStack v2.1.1 + vLLM 0.17.1, the engine crashed with an AttributeError in _TorchTensorAcceleratorChannel every time a chat message was sent. The root cause was incompatibility between pipeline_parallel and V100. By switching to tensor_parallel=4 / pipeline_parallel=1, we achieved a complete fix and further extended context length to 64K. This is a record of that process.

AI-GPUST01: GPU-Worker-A / Tesla V100×2  |  AI-GPUST02: GPU-Worker-B / Tesla V100×2  |  GPUStack v2.1.1 / vLLM 0.17.1 / Ray 2.54.0 / CUDA 12.9

Symptoms Observed #

The model started normally, but the engine invariably crashed several minutes to tens of minutes after sending a chat message. The following errors were repeatedly logged.

AttributeError: '_TorchTensorAcceleratorChannel' object
has no attribute '_accelerator_group'.
Did you mean: '_accelerator_group_id'?

ray.exceptions.ActorUnavailableError: The actor is temporarily unavailable:
  RpcError: RPC error: Socket closed rpc_code: 14.

This error is a Ray 2.54.0 bug that occurs in cleanup (teardown) after a crash, not the direct cause. The true root cause lies upstream: initialization failure of the GPU accelerator communication channel between pipeline_parallel stages.

Root Cause: V100 Does Not Support pipeline_parallel P2P GPU Transfers #

vLLM’s pipeline_parallel (pipeline parallelism) uses _TorchTensorAcceleratorChannel (direct P2P transfer between GPUs) for activation transfer between stages. This requires NVLink or high-speed GPU P2P, which is not available on Tesla V100-PCIE (compute capability 7.0).

pipeline_parallel=2 Problem #

  • Uses P2P GPU channel for inter-stage activation transfer
  • V100 lacks NVLink/P2P support, so initialization terminates incompletely
  • Crashes with AttributeError during inference or cleanup
  • Becomes easy to reproduce when combined with Ray 2.54.0 bug

tensor_parallel=4 Operation #

  • Model weights horizontally split across 4 GPUs (attention head basis)
  • GPU-to-GPU communication uses only NCCL AllReduce
  • NCCL operates normally on V100 (nccl==2.27.5 confirmed)
  • Completely avoids the bug by not using P2P channels at all
Note: This issue is specific to V100 (compute 7.0). Ampere and later (compute 8.0+) do not experience this problem because P2P works with pipeline_parallel.

Solution: Complete Fix with Configuration Changes Only #

Changing the model configuration in the GPUStack UI as follows solved the problem completely. No infrastructure changes required.

Before Change (Crashing Configuration) #

--tensor-parallel-size 2
--pipeline-parallel-size 2
--distributed-executor-backend ray
--max-model-len 32768
--generation-config vllm

After Change (Stable Configuration) #

--tensor-parallel-size 4
--pipeline-parallel-size 1
--distributed-executor-backend ray
--max-model-len 65536
--generation-config vllm
--enable-auto-tool-choice
--tool-call-parser hermes

Additionally, RAY_CGRAPH_get_timeout is a configuration for inter-stage communication timeout in pipeline_parallel, so it becomes unnecessary when pipeline_parallel=1. It can be safely removed from environment variables.

Comparison of Effects from Changes #

VRAM Usage #

Model distributed evenly across 4 GPUs, significantly reducing consumption per GPU.

# Before change (handled by 2 GPUs)
Model loading took 15-17 GiB / GPU

# After change (handled by 4 GPUs)
Model loading took 6.95 GiB / GPU

KV Cache #

VRAM surplus converts directly into cache.

# Before change
KV cache: Limited (VRAM constrained)

# After change
Available KV cache memory: 20.97 GiB
GPU KV cache size: 458,

Stability Change #

Before the change, crashes occurred with every chat submission. After the change, confirmed that multiple long-text requests submitted simultaneously do not cause crashes. Generation speed reduction during concurrent processing is normal queuing behavior and not problematic.

Extending Context Length to 64K #

With tensor_parallel=4 providing VRAM headroom, we attempted to extend context length.

About Qwen2.5-14B Context Length #

The max_position_embeddings in config.json is 32768, but this is the default setting value. By enabling YaRN scaling (rope_scaling), it can be extended to 128K. However, vLLM rejects values exceeding 32768 by default, so explicit permission via environment variable is required.

Add to Environment Variables:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

Backend Parameters:

--max-model-len 65536

Startup verification:

INFO: Using max model len 65536
INFO: Available KV cache memory: 20.97 GiB
INFO: GPU KV cache size: 458,
INFO: Maximum concurrency for 32, per request: 13.98x
INFO: Starting vLLM API server ← Normal startup
Note: RoPE positional encoding beyond exceeds the architectural design range. Short to medium contexts operate without issue, but when handling ultra-long text beyond 32,, there is a risk of inference quality degradation (nan occurrence). Use with understanding of this limitation.

Final Stable Operation Configuration Summary #

Backend Parameters #

--distributed-executor-backend ray
--tensor-parallel-size 4
--pipeline-parallel-size 1
--max-model-len 65536
--generation-config vllm
--enable-auto-tool-choice
--tool-call-parser hermes

Environment Variables #

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

gpustack-worker Startup Options (Additional) #

--ipc=host
--shm-size=64g

Operation Verification Results #

Qwen2.5-14B-Instruct (float16) confirmed to operate stably on 4× Tesla V100-PCIE-32GB (2-node configuration) with context length 65, and high-load concurrent requests. VRAM usage of 6.95GiB per GPU and KV cache of 20.97GiB secured.

Checklist #

  • In V100 environments, use --pipeline-parallel-size 1 (pipeline_parallel uses P2P GPU communication, which V100 does not support)
  • When using 4 GPUs in multi-node configuration, unify to --tensor-parallel-size 4
  • RAY_CGRAPH_get_timeout is unnecessary with pipeline_parallel=1 and can be removed
  • For 64K context, add VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 to Environment Variables
  • Start the gpustack-worker container with --ipc=host --shm-size=64g options
  • Perform load testing with long text and concurrent requests before production deployment to verify stability

Related Articles #

This article is a troubleshooting record from a production environment. Internal information such as IP and tokens has been abstracted. Behavior may vary depending on GPU environment, driver, and vLLM version.

Updated on 2026年6月9日

What are your feelings

  • Happy
  • Normal
  • Sad