zwpride commited on
Commit
c18e85b
·
verified ·
1 Parent(s): 7356634

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -0
README.md CHANGED
@@ -68,6 +68,14 @@ The paper's main finding is that PLT loop-count scaling is non-monotonic. The tw
68
 
69
  This checkpoint uses a custom PLT model architecture. Load it in an environment that provides support for `IQuestPLTCoderForCausalLM` and the custom tokenizer/configuration files in this repository.
70
 
 
 
 
 
 
 
 
 
71
  ```python
72
  import torch
73
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
68
 
69
  This checkpoint uses a custom PLT model architecture. Load it in an environment that provides support for `IQuestPLTCoderForCausalLM` and the custom tokenizer/configuration files in this repository.
70
 
71
+ For vLLM inference, install vLLM from [yxing-bj/vllm](https://github.com/yxing-bj/vllm) and use `transformers==4.57.1`, then start the server with the following command:
72
+
73
+ ```bash
74
+ vllm serve $MODEL --port 8080 \
75
+ --max-num-batched-tokens 8192 --max-num-seqs 512 -tp 1 -dp 1 --trust-remote-code \
76
+ --cudagraph-capture-sizes 1 2 4 8 12 16 24 32
77
+ ```
78
+
79
  ```python
80
  import torch
81
  from transformers import AutoModelForCausalLM, AutoTokenizer