Instructions to use codesage/codesage-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codesage/codesage-base with Transformers:
# Load model directly from transformers import CodeSage model = CodeSage.from_pretrained("codesage/codesage-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update code snippet to use sentence-level embeddings; update title
#2
by tomaarsen HF Staff - opened
README.md
CHANGED
|
@@ -7,7 +7,7 @@ language:
|
|
| 7 |
- code
|
| 8 |
---
|
| 9 |
|
| 10 |
-
## CodeSage-
|
| 11 |
|
| 12 |
### Model description
|
| 13 |
CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
|
|
@@ -24,7 +24,7 @@ This checkpoint is first trained on code data via masked language modeling (MLM)
|
|
| 24 |
### How to use
|
| 25 |
This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
|
| 26 |
|
| 27 |
-
```
|
| 28 |
from transformers import AutoModel, AutoTokenizer
|
| 29 |
|
| 30 |
checkpoint = "codesage/codesage-base"
|
|
@@ -33,10 +33,10 @@ device = "cuda" # for GPU usage or "cpu" for CPU usage
|
|
| 33 |
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
| 34 |
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
| 35 |
|
| 36 |
-
inputs = tokenizer
|
| 37 |
-
embedding = model(inputs)
|
| 38 |
-
print(f'Dimension of the embedding: {embedding
|
| 39 |
-
# Dimension of the embedding: torch.Size([
|
| 40 |
```
|
| 41 |
|
| 42 |
### BibTeX entry and citation info
|
|
|
|
| 7 |
- code
|
| 8 |
---
|
| 9 |
|
| 10 |
+
## CodeSage-Base
|
| 11 |
|
| 12 |
### Model description
|
| 13 |
CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
|
|
|
|
| 24 |
### How to use
|
| 25 |
This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
|
| 26 |
|
| 27 |
+
```python
|
| 28 |
from transformers import AutoModel, AutoTokenizer
|
| 29 |
|
| 30 |
checkpoint = "codesage/codesage-base"
|
|
|
|
| 33 |
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
|
| 34 |
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
|
| 35 |
|
| 36 |
+
inputs = tokenizer("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
|
| 37 |
+
embedding = model(**inputs).pooler_output
|
| 38 |
+
print(f'Dimension of the embedding: {embedding.size()}')
|
| 39 |
+
# Dimension of the embedding: torch.Size([1, 1024])
|
| 40 |
```
|
| 41 |
|
| 42 |
### BibTeX entry and citation info
|