Instructions to use mahdin70/graphcodebert-devign-code-vulnerability-detector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mahdin70/graphcodebert-devign-code-vulnerability-detector with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="mahdin70/graphcodebert-devign-code-vulnerability-detector")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("mahdin70/graphcodebert-devign-code-vulnerability-detector") model = AutoModelForSequenceClassification.from_pretrained("mahdin70/graphcodebert-devign-code-vulnerability-detector") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - Code | |
| - Vulnerability | |
| - Detection | |
| - C/C++ | |
| datasets: | |
| - DetectVul/devign | |
| language: | |
| - en | |
| base_model: | |
| - microsoft/graphcodebert-base | |
| license: mit | |
| metrics: | |
| - accuracy | |
| - precision | |
| - f1 | |
| - recall | |
| ## GraphCodeBERT for Code Vulnerability Detection | |
| ## Model Summary | |
| This model is a fine-tuned version of **microsoft/graphcodebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **DetectVul/devign** dataset. | |
| The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. | |
| ## Model Details | |
| - **Developed by:** Mukit Mahdin | |
| - **Finetuned from:** `microsoft/graphcodebert-base` | |
| - **Language(s):** English (for code comments & metadata), C/C++ | |
| - **License:** MIT | |
| - **Task:** Code vulnerability detection | |
| - **Dataset Used:** `DetectVul/devign` | |
| - **Architecture:** Transformer-based sequence classification | |
| ## Uses | |
| ### Direct Use | |
| This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: | |
| - **Developers**: To analyze their code for potential security flaws. | |
| - **Security Teams**: To scan repositories for known vulnerabilities. | |
| - **Researchers**: To study vulnerability detection in AI-powered systems. | |
| ### Downstream Use | |
| This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. | |
| ### Out-of-Scope Use | |
| - The model is **not meant to replace human security experts**. | |
| - It may not generalize well to **languages other than C/C++**. | |
| - False positives/negatives may occur due to dataset limitations. | |
| ## Bias, Risks, and Limitations | |
| - **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. | |
| - **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. | |
| - **Dataset Bias:** The training data may not cover all possible vulnerabilities. | |
| ### Recommendations | |
| Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. | |
| ## How to Get Started with the Model | |
| Use the code below to load the model and run inference on a sample code snippet: | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| # Load the fine-tuned model | |
| tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base") | |
| model = AutoModelForSequenceClassification.from_pretrained("mahdin70/graphcodebert-devign-code-vulnerability-detector") | |
| # Sample code snippet | |
| code_snippet = ''' | |
| void process(char *input) { | |
| char buffer[50]; | |
| strcpy(buffer, input); // Potential buffer overflow | |
| } | |
| ''' | |
| # Tokenize the input | |
| inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) | |
| # Run inference | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) | |
| predicted_label = torch.argmax(predictions, dim=1).item() | |
| # Output the result | |
| print("Vulnerable Code" if predicted_label == 1 else "Safe Code") | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| - **Dataset:** `DetectVul/devign` | |
| - **Classes:** `0 (Safe)`, `1 (Vulnerable)` | |
| - **Size:** `21800` Code Snippets | |
| ### Training Procedure | |
| - **Optimizer:** AdamW | |
| - **Loss Function:** CrossEntropyLoss | |
| - **Batch Size:** 16 | |
| - **Learning Rate:** 2e-05 | |
| - **Epochs:** 3 | |
| - **Hardware Used:** 2x T4 GPU | |
| ### Metrics | |
| | Metric | Score | | |
| |------------|-------------| | |
| | **Train Loss** | 0.6112 | | |
| | **Evaluation Loss** | 0.605983 | | |
| | **Accuracy** | 64.27% | | |
| | **F1 Score** | 51.8% | | |
| | **Precision** | 68.04% | | |
| | **Recall** | 41.9% | | |
| ## Environmental Impact | |
| | Factor | Value | | |
| |-----------|----------| | |
| | **GPU Used** | 2x T4 GPU | | |
| | **Training Time** | ~1 hour | |