Not Good Luck

#2
by websterdav - opened

Out Linux Ubuntu 24.04 VM is on a Windows Server 2025. We are installing this model onto 2 - H200s with NVLink. Model comes up and starts no problem. However, after running for a little bit I takes out the card with nvidia-smi showing ERR. Come to find out it takes the Windows Server down with it. The only option is to reboot. Reminds me of the old days where the GPU drivers would blue screen you more than anyone else. (Too bad I did not mine for coins back then). We tried this many of times with the same result over and over. So, we are finding this unstable at this point. Also the vllm:nightly from 2/1 (like 9.0 rc1) worked to bring it and other models up. The vllm:nightly from 2/15 will not load this model. A lot of nasty errors. Just don't do it.

Below is the compose file that works.

version: "3.9"

services:
vllm-minimax-m2.5:
image: vllm/vllm-openai:nightly
container_name: vllm-minimax-m2.5
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["2", "3"]
capabilities: [gpu]

ports:
  - "8003:8000"

environment:
  NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
  LD_LIBRARY_PATH: "/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64"
  HF_HUB_OFFLINE: "1"
  SAFETENSORS_FAST_GPU: 1
  VLLM_USE_DEEP_GEMM: 0
  VLLM_USE_FLASHINFER_MOE_FP16: 1
  VLLM_USE_FLASHINFER_SAMPLER: 0
  OMP_NUM_THREADS: 2


shm_size: "16gb"

volumes:
  - /opt/models/MiniMax-M2.5-AWQ-QT:/QuantTrio/MiniMax-M2.5-AWQ
  - /opt/vllm_cache:/root/.cache/huggingface
  - /opt/vllm_pip_cache:/root/.cache/pip
  - /etc/encodings:/encodings:ro

entrypoint: 
  - bash
  - -c
  - |
    nvidia-smi
    python3 -c "import torch; print('cuda:', torch.cuda.is_available()); print('count:', torch.cuda.device_count())"
    exec vllm serve /QuantTrio/MiniMax-M2.5-AWQ \
      --served-model-name MiniMax-M2.5-AWQ \
      --swap-space 8 \
      --max-model-len 131072  \
      --gpu-memory-utilization 0.9 \
      --tensor-parallel-size 2 \
      --enable-auto-tool-choice \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2_append_think \
      --trust-remote-code \
      --max-num-batched-tokens 8196 \
      --host 0.0.0.0 \
      --port 8000
websterdav changed discussion status to closed

Sign up or log in to comment