GPUStack v2.1.1 徹底解決多節點推論無法啟動的所有問題

GPUStack v2.1.1 徹底解決多節點推論無法啟動的所有問題

4 min read

Tech Blog
GPUStack
Multi-Node vLLM
Tesla V100
Troubleshooting

GPU Cluster Operations Memo

Resolved All Issues with GPUStack v2.1.1 Multi-Node Inference Not Starting
Complete Record of Running Qwen2.5-32B on 2-Node 4×V100 #

When attempting multi-node distributed inference using GPUStack v2.1.1 + vLLM backend with 2 GPU workers (each with Tesla V100-PCIE-32GB × 2 GPUs), we encountered a succession of different problems. From Ray cluster connection failures, Gloo communication errors, /dev/shm shortage, to V100-specific CUDA kernel errors—we document all troubles and workarounds in chronological order.

GPU-Worker-A: / Tesla V100×2 | GPU-Worker-B: / Tesla V100×2 | GPU-Server: GPUStack Server | GPUStack v2.1.1 / vLLM 0.17.1 / CUDA 12.9

Configuration Overview #

The management node (GPU-Server) handles GPUStack Server, with inference distributed across 2 GPU workers. Models are stored in NFS-shared /models. Starting from Qwen3.5-35B-A3B (VLM), we eventually confirmed stable operation with Qwen2.5-32B-Instruct.

Target Model #

Qwen2.5-32B-Instruct (float16)
tensor-parallel×2 / pipeline-parallel×2

GPU Configuration #

Tesla V100-PCIE-32GB × 4 GPUs
(2 nodes × 2 GPUs)
compute capability 7.0

Final Results #

Stable operation at 16./s
max-model-len: 8192
Workarounds applied

Complete Overview of Issues and Solutions #

Issue 1

Ray Placement Group Waits Indefinitely #

When starting the model, Waiting for creating a placement group continues endlessly from 10 seconds → 30 seconds → 150 seconds → 630 seconds, and the model does not start up.

WARNING: Tensor parallel size (4) exceeds available GPUs (2).
INFO: Waiting for creating a placement group of specs for 630 seconds.
specs=[{'node:': 0.001, 'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]

Cause: Ray only runs within the container on the head node side; the GPUs on the other worker node cannot join the Ray cluster.

Verification Command: Run ray status --address=<HEAD_IP>:41000 inside the vllm runner container.
If a Ray version mismatch error (2.54.0 vs 2.48.0) appears, GPUStack image version discrepancy is suspected.

Issue 2

Gloo Communication Failure Due to 127.0.1.1 Entry in /etc/hosts #

Ray nodes can connect to each other, but after model loading, distributed environment initialization (Gloo’s connectFullMesh) fails.

(RayWorkerWrapper pid=1722) ERROR failed to connect, retry=4, retryLimit=3,
  local=[127.0.0.1]:35509, remote=[127.0.1.1]:51638$1,
  error=SO_ERROR: Connection refused

RuntimeError: Gloo connectFullMesh failed with timed out connecting:
  SO_ERROR: Connection refused, remote=[127.0.1.1]:51638$1

Cause: Ubuntu/Debian’s cloud-init by default writes 127.0.1.1 hostname to /etc/hosts. Gloo references this entry and judges the node’s own IP as 127.0.1.1 (loopback), attempting to connect to other nodes and getting refused.

Solution: On each node, correct the 127.0.1.1 entry in /etc/hosts to the actual IP address.

# On GPU-Worker-A
sudo sed -i 's/127.0.1.1\s*GPU-Worker-A/ GPU-Worker-A/' /etc/hosts

# On GPU-Worker-B
sudo sed -i 's/127.0.1.1\s*GPU-Worker-B/ GPU-Worker-B/' /etc/hosts
Note: In this environment, cloud-init (manage_etc_hosts=True) automatically set the actual IP, so correction was not necessary. Verify directly in the runner container’s /etc/hosts that 127.0.1.1 does not exist.

Issue 3

Qwen3.5-35B-A3B Visual Encoder Does Not Work on V100 #

After resolving the Gloo issue, model loading succeeds but immediately crashes during the profile_run phase.

File "vllm/model_executor/layers/conv.py", line 236, in _forward_conv
    x = F.conv3d(
        ^^^^^^^^^
RuntimeError: GET was unable to find an engine to execute this computation

Cause: Qwen3.5-35B-A3B is a Vision Language Model, and its Visual Encoder’s Conv3D lacks a cuDNN algorithm compiled for V100 (compute capability 7.0). This cannot be worked around with environment variables like VLLM_DISABLE_MULTIMODAL=1. Since vLLM always initializes the Visual Encoder during profile_run, this is a structural limitation.

Solution: Switch the model to Qwen2.5-32B-Instruct (text-only). Since it has no Visual Encoder, this issue does not occur.

Issue 4

Engine Crashes Due to /dev/shm Shortage During Chat #

The model starts up, but the engine crashes the moment a chat message is sent.

WARNING (raylet) store_runner.cc:83: System memory request exceeds memory available in /dev/shm.
  The request is for 10200547328 bytes, and the amount available is 9663676416 bytes.
  If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size'

CUDA error: no kernel image is available for execution on the device

Cause: Docker’s default /dev/shm size is 64MB. Ray’s pipeline_parallel uses large amounts of shared memory for inter-node communication, which is exhausted during inference. The gpustack-worker container was not launched with --shm-size specified.

Solution: Restart the gpustack-worker container with --shm-size=16g.

docker stop gpustack-worker && docker rm gpustack-worker

docker run -d --name gpustack-worker \
  --restart=unless-stopped \
  --privileged \
  --network=host \
  --shm-size=16g \
  --volume /var/run/docker.sock:/var/run/docker.sock \
  --volume /models:/models \
  --volume /var/lib/gpustack-data:/var/lib/gpustack \
  --runtime nvidia \
  gpustack/gpustack:v2.1.1 \
  --server-url http://<SERVER_IP> \
  --token <TOKEN> \
  --cache-dir /models
Note: GPUStack v2.1.1’s “Mirrored Deployment” feature carries forward the gpustack-worker settings to the runner container, but --shm-size is not inherited. This is a GPUStack-side issue and must be supplemented with environment variables as described below.

Issue 5

Runner Container’s shm Is Not Inherited; RayChannelTimeoutError Continues #

Even after restarting gpustack-worker, the engine still crashes during chat.

ray.exceptions.RayChannelTimeoutError: System error:
  If the execution is expected to take a long time,
  increase RAY_CGRAPH_get_timeout which is currently 300 seconds.
  Otherwise, this may indicate that the execution is hanging.

Cause: The vllm/ray runner container is created fresh by GPUStack each time, so the --shm-size setting from gpustack-worker is not inherited across containers. The runner container’s own shm remains at 64MB.

Solution: Add the following to Environment Variables in GPUStack UI model settings.

RAY_CGRAPH_get_timeout=600
RAY_memory_store_capacity=17179869184

Issue 6

generation_config repetition_penalty Kernel Is Not V100-Compatible; Loop Output #

Chat produces abnormal text output with repeating symbols (♡♡♡…) in a loop. Or inference crashes with a CUDA kernel error.

torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
# ↑ Occurs during apply_penalties (repetition_penalty) execution

WARNING: Default vLLM sampling parameters have been overridden by the model's
  generation_config.json: {'repetition_penalty': 1.05, 'temperature': 0.7, ...}

Cause: The CUDA kernel for repetition_penalty set in Qwen2.5’s generation_config.json is not compiled for V100.

Solution: In GPUStack UI model settings → Backend Parameters, add:

--generation-config vllm

This ignores the model’s generation_config.json and uses vLLM’s default settings instead.

Final Stable Operation Configuration Summary #

gpustack-worker Startup Command (Common to Both Nodes) #

docker run -d --name gpustack-worker \
  --restart=unless-stopped \
  --privileged \
  --network=host \
  --shm-size=16g \
  --volume /var/run/docker.sock:/var/run/docker.sock \
  --volume /models:/models \
  --volume /var/lib/gpustack-data:/var/lib/gpustack \
  --runtime nvidia \
  gpustack/gpustack:v2.1.1 \
  --server-url http://<SERVER_IP> \
  --token <TOKEN> \
  --cache-dir /models

GPUStack UI Model Settings #

Backend Parameters:

--tensor-parallel-size 2
--pipeline-parallel-size 2
--distributed-executor-backend ray
--max-model-len 8192
--enforce-eager
--generation-config vllm

Environment Variables:

RAY_CGRAPH_get_timeout=600
RAY_memory_store_capacity=17179869184

Operation Verification Results #

Qwen2.5-32B-Instruct (float16) runs stably on 4×Tesla V100-PCIE-32GB (2-node configuration) at 16./s. VRAM usage is 93–96% per GPU. max-model-len is .

V100 (compute capability 7.0) Specific Limitations #

❌ Does Not Work #

  • Flash Attention 2 (compute 8.0 or higher required) → Falls back to Triton ATTN
  • bfloat16 (→ automatically converts to float16)
  • Custom AllReduce (NVLink P2P not supported)
  • SymmMemCommunicator
  • VLM Conv3D (Qwen3.5-35B-A3B, etc.)
  • generation_config repetition_penalty kernel

✅ Works #

  • Triton ATTN backend (Flash Attention alternative)
  • NCCL communication (nccl==2.27.5)
  • Ray distributed execution (pipeline + tensor parallel)
  • torch.compile / CUDA Graphs (recommended to disable with enforce-eager)
  • Text-only model inference such as Qwen2.5

Troubleshooting Checklist #

  • Verify that all node GPUs are recognized with ray status (inside container)
  • Verify that hostname in each node’s /etc/hosts resolves to actual IP, not 127.0.1.1
  • Specify --shm-size=16g or larger for gpustack-worker container
  • Add RAY_CGRAPH_get_timeout=600 and RAY_memory_store_capacity=17179869184 to Environment Variables in model settings
  • For V100 environments, add --generation-config vllm to Backend Parameters to ignore generation_config.json
  • VLM (Vision Language Model) does not work on V100. Use text-only models
  • Consider adding --enforce-eager on V100 (avoids CUDA Graph-related instability)
  • If VRAM is tight, adjust KV cache allocation with --gpu-memory-utilization 0.85

References #

This article is a record of troubleshooting in a production environment. IPs, tokens, and similar items are abstracted. Some issues may not occur on GPU environments other than V100. The situation may change with updates to vLLM and GPUStack versions.

Updated on 2026年6月9日

What are your feelings

  • Happy
  • Normal
  • Sad