GPUStack
vLLM
Tesla V100
Multi-Node
GPU Cluster Operations Memo
Solving the pipeline_parallel=2 crash on V100 and
achieving stable operation with tensor_parallel=4 + 64K context #
While running Qwen2.5-14B-Instruct on multi-node (4× Tesla V100) with GPUStack v2.1.1 + vLLM 0.17.1, the engine crashed with an AttributeError in _TorchTensorAcceleratorChannel every time a chat message was sent. The root cause was incompatibility between pipeline_parallel and V100. By switching to tensor_parallel=4 / pipeline_parallel=1, we achieved a complete fix and further extended context length to 64K. This is a record of that process.
Symptoms Observed #
The model started normally, but the engine invariably crashed several minutes to tens of minutes after sending a chat message. The following errors were repeatedly logged.
AttributeError: '_TorchTensorAcceleratorChannel' object
has no attribute '_accelerator_group'.
Did you mean: '_accelerator_group_id'?
ray.exceptions.ActorUnavailableError: The actor is temporarily unavailable:
RpcError: RPC error: Socket closed rpc_code: 14.This error is a Ray 2.54.0 bug that occurs in cleanup (teardown) after a crash, not the direct cause. The true root cause lies upstream: initialization failure of the GPU accelerator communication channel between pipeline_parallel stages.
Root Cause: V100 Does Not Support pipeline_parallel P2P GPU Transfers #
vLLM’s pipeline_parallel (pipeline parallelism) uses _TorchTensorAcceleratorChannel (direct P2P transfer between GPUs) for activation transfer between stages. This requires NVLink or high-speed GPU P2P, which is not available on Tesla V100-PCIE (compute capability 7.0).
pipeline_parallel=2 Problem #
- Uses P2P GPU channel for inter-stage activation transfer
- V100 lacks NVLink/P2P support, so initialization terminates incompletely
- Crashes with AttributeError during inference or cleanup
- Becomes easy to reproduce when combined with Ray 2.54.0 bug
tensor_parallel=4 Operation #
- Model weights horizontally split across 4 GPUs (attention head basis)
- GPU-to-GPU communication uses only NCCL AllReduce
- NCCL operates normally on V100 (nccl==2.27.5 confirmed)
- Completely avoids the bug by not using P2P channels at all
Solution: Complete Fix with Configuration Changes Only #
Changing the model configuration in the GPUStack UI as follows solved the problem completely. No infrastructure changes required.
Before Change (Crashing Configuration) #
--tensor-parallel-size 2
--pipeline-parallel-size 2
--distributed-executor-backend ray
--max-model-len 32768
--generation-config vllmAfter Change (Stable Configuration) #
--tensor-parallel-size 4
--pipeline-parallel-size 1
--distributed-executor-backend ray
--max-model-len 65536
--generation-config vllm
--enable-auto-tool-choice
--tool-call-parser hermesAdditionally, RAY_CGRAPH_get_timeout is a configuration for inter-stage communication timeout in pipeline_parallel, so it becomes unnecessary when pipeline_parallel=1. It can be safely removed from environment variables.
Comparison of Effects from Changes #
VRAM Usage #
Model distributed evenly across 4 GPUs, significantly reducing consumption per GPU.
# Before change (handled by 2 GPUs)
Model loading took 15-17 GiB / GPU
# After change (handled by 4 GPUs)
Model loading took 6.95 GiB / GPUKV Cache #
VRAM surplus converts directly into cache.
# Before change
KV cache: Limited (VRAM constrained)
# After change
Available KV cache memory: 20.97 GiB
GPU KV cache size: 458,Stability Change #
Before the change, crashes occurred with every chat submission. After the change, confirmed that multiple long-text requests submitted simultaneously do not cause crashes. Generation speed reduction during concurrent processing is normal queuing behavior and not problematic.
Extending Context Length to 64K #
With tensor_parallel=4 providing VRAM headroom, we attempted to extend context length.
About Qwen2.5-14B Context Length #
The max_position_embeddings in config.json is 32768, but this is the default setting value. By enabling YaRN scaling (rope_scaling), it can be extended to 128K. However, vLLM rejects values exceeding 32768 by default, so explicit permission via environment variable is required.
Add to Environment Variables:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1Backend Parameters:
--max-model-len 65536Startup verification:
INFO: Using max model len 65536
INFO: Available KV cache memory: 20.97 GiB
INFO: GPU KV cache size: 458,
INFO: Maximum concurrency for 32, per request: 13.98x
INFO: Starting vLLM API server ← Normal startupFinal Stable Operation Configuration Summary #
Backend Parameters #
--distributed-executor-backend ray
--tensor-parallel-size 4
--pipeline-parallel-size 1
--max-model-len 65536
--generation-config vllm
--enable-auto-tool-choice
--tool-call-parser hermesOperation Verification Results #
Qwen2.5-14B-Instruct (float16) confirmed to operate stably on 4× Tesla V100-PCIE-32GB (2-node configuration) with context length 65, and high-load concurrent requests. VRAM usage of 6.95GiB per GPU and KV cache of 20.97GiB secured.
Checklist #
- In V100 environments, use
--pipeline-parallel-size 1(pipeline_parallel uses P2P GPU communication, which V100 does not support) - When using 4 GPUs in multi-node configuration, unify to
--tensor-parallel-size 4 RAY_CGRAPH_get_timeoutis unnecessary with pipeline_parallel=1 and can be removed- For 64K context, add
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1to Environment Variables - Start the gpustack-worker container with
--ipc=host --shm-size=64goptions - Perform load testing with long text and concurrent requests before production deployment to verify stability
Related Articles #
- GPUStack v2.1.1 Multi-Node Inference Startup Issues Completely Resolved
- Migrating GPUStack from v0.7.1 to v2.1.1
This article is a troubleshooting record from a production environment. Internal information such as IP and tokens has been abstracted. Behavior may vary depending on GPU environment, driver, and vLLM version.