Mingke977 commited on
Commit
7f3e6ce
·
verified ·
1 Parent(s): af7188b

add tp1 deployment

Browse files
Files changed (1) hide show
  1. docs/deploy_guidance.md +16 -3
docs/deploy_guidance.md CHANGED
@@ -7,7 +7,7 @@
7
 
8
  ## vLLM Deployment
9
 
10
- Here is the example to serve this model on a H200 single node with TP8 via vLLM:
11
 
12
  1. pull the Docker image.
13
  ```bash
@@ -15,6 +15,12 @@ docker pull jdopensource/joyai-llm-vllm:v0.13.0-joyai_llm_flash
15
  ```
16
  2. launch JoyAI-LLM Flash model with dense MTP.
17
  ```bash
 
 
 
 
 
 
18
  vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
19
  --tool-call-parser qwen3_coder --enable-auto-tool-choice \
20
  --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
@@ -24,7 +30,7 @@ vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
24
 
25
  ## SGLang Deployment
26
 
27
- Similarly, here is the example to run with TP8 on H200 in a single node via SGLang:
28
 
29
  1. pull the Docker image.
30
  ```bash
@@ -33,10 +39,17 @@ docker pull jdopensource/joyai-llm-sglang:v0.5.8-joyai_llm_flash
33
  2. launch JoyAI-LLM Flash model with dense MTP.
34
 
35
  ```bash
36
- python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 8 --trust-remote-code \
 
37
  --tool-call-parser qwen3_coder \
38
  --speculative-algorithm EAGLE --speculative-draft-model-path ${MTP_MODEL_PATH} \
39
  --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
 
 
 
 
 
 
40
  ```
41
  **Key notes:**
42
  - `--tool-call-parser qwen3_coder`: Required when enabling tool usage.
 
7
 
8
  ## vLLM Deployment
9
 
10
+ Here is the example to serve this model on a H200 single node via vLLM:
11
 
12
  1. pull the Docker image.
13
  ```bash
 
15
  ```
16
  2. launch JoyAI-LLM Flash model with dense MTP.
17
  ```bash
18
+ # TP1 for memory efficiency
19
+ vllm serve ${MODEL_PATH} --tp 1 --trust-remote-code \
20
+ --tool-call-parser qwen3_coder --enable-auto-tool-choice \
21
+ --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
22
+
23
+ # TP8 for extreme speed and long context
24
  vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
25
  --tool-call-parser qwen3_coder --enable-auto-tool-choice \
26
  --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
 
30
 
31
  ## SGLang Deployment
32
 
33
+ Similarly, here is the example to run on a H200 single node via SGLang:
34
 
35
  1. pull the Docker image.
36
  ```bash
 
39
  2. launch JoyAI-LLM Flash model with dense MTP.
40
 
41
  ```bash
42
+ # TP1 for memory efficiency
43
+ python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --trust-remote-code \
44
  --tool-call-parser qwen3_coder \
45
  --speculative-algorithm EAGLE --speculative-draft-model-path ${MTP_MODEL_PATH} \
46
  --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
47
+
48
+ # TP8 for extreme speed and long context
49
+ python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 8 --trust-remote-code \
50
+ --tool-call-parser qwen3_coder \
51
+ --speculative-algorithm EAGLE \
52
+ --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
53
  ```
54
  **Key notes:**
55
  - `--tool-call-parser qwen3_coder`: Required when enabling tool usage.