scthornton commited on
Commit
dc8bf5e
·
verified ·
1 Parent(s): 5cbdf56

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -309
README.md CHANGED
@@ -1,384 +1,207 @@
1
  ---
2
- license: apache-2.0
3
  base_model: bigcode/starcoder2-15b-instruct-v0.1
4
  tags:
5
- - code
6
- - security
7
- - starcoder
8
- - bigcode
9
- - securecode
10
- - owasp
11
- - vulnerability-detection
 
 
 
12
  datasets:
13
- - scthornton/securecode-v2
14
- language:
15
- - en
16
- library_name: transformers
17
  pipeline_tag: text-generation
18
- arxiv: 2512.18542
 
 
19
  ---
20
 
21
- # StarCoder2 15B - SecureCode Edition
22
 
23
  <div align="center">
24
 
25
- [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
26
- [![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
27
- [![Base Model](https://img.shields.io/badge/base-StarCoder2%2015B-orange.svg)](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
28
- [![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)
29
 
30
- **The most powerful multi-language security model - 600+ programming languages**
31
 
32
- [📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai)
33
 
34
  </div>
35
 
36
  ---
37
 
38
- ## 🎯 What is This?
39
-
40
- This is **StarCoder2 15B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - the most comprehensive multi-language code model available, trained on **4 trillion tokens** across **600+ programming languages**, now enhanced with production-grade security knowledge.
41
-
42
- StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:
43
-
44
- ✅ **Unprecedented language coverage** - Security awareness across 600+ languages
45
- ✅ **State-of-the-art code generation** - Best open-source model performance
46
- ✅ **Complex security reasoning** - 15B parameters for sophisticated vulnerability analysis
47
- ✅ **Production-ready quality** - Trained on The Stack v2 with rigorous data curation
48
-
49
- **The Result:** The most powerful and versatile security-aware code model in the SecureCode collection.
50
-
51
- **Why StarCoder2 15B?** This model offers:
52
- - 🌍 **600+ languages** - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
53
- - 🏆 **SOTA performance** - Best open-source code model
54
- - 🧠 **Complex reasoning** - 15B parameters for sophisticated security analysis
55
- - 🔬 **Research-grade** - Built on The Stack v2 with extensive curation
56
- - 🌟 **Community-driven** - BigCode initiative backed by ServiceNow + HuggingFace
57
-
58
- ---
59
-
60
- ## 🚨 The Problem This Solves
61
-
62
- **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.
63
 
64
- **Multi-language security challenges:**
65
- - Solidity smart contracts: **$3+ billion** stolen in Web3 exploits (2021-2024)
66
- - Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
67
- - Legacy systems (COBOL/Fortran): Undocumented security flaws
68
- - Emerging languages (Rust/Zig): New security patterns needed
69
 
70
- StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.
 
 
 
71
 
72
- ---
73
 
74
- ## 💡 Key Features
75
 
76
- ### 🌍 Unmatched Language Coverage
 
 
 
 
 
 
 
 
 
 
77
 
78
- StarCoder2 15B trained on **600+ programming languages**:
79
- - **Mainstream:** Python, JavaScript, Java, C++, Go, Rust
80
- - **Web3:** Solidity, Vyper, Cairo, Move
81
- - **Mobile:** Kotlin, Swift, Dart
82
- - **Systems:** C, Rust, Zig, Assembly
83
- - **Functional:** Haskell, OCaml, Scala, Elixir
84
- - **Legacy:** COBOL, Fortran, Pascal
85
- - **And 580+ more...**
86
 
87
- Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025.
88
-
89
- ### 🏆 State-of-the-Art Performance
90
-
91
- StarCoder2 15B delivers cutting-edge results:
92
- - HumanEval: **72.6%** pass@1 (best open-source at release)
93
- - MultiPL-E: **52.3%** average across languages
94
- - Leading performance on long-context code tasks
95
- - Trained on The Stack v2 (4T tokens)
96
-
97
- ### 🔐 Comprehensive Security Training
98
-
99
- Trained on real-world security incidents:
100
- - **224 examples** of Broken Access Control
101
- - **199 examples** of Authentication Failures
102
- - **125 examples** of Injection attacks
103
- - **115 examples** of Cryptographic Failures
104
- - Complete **OWASP Top 10:2025** coverage
105
-
106
- ### 📋 Advanced Security Analysis
107
-
108
- Every response includes:
109
- 1. **Multi-language vulnerability patterns**
110
- 2. **Secure implementations** with language-specific best practices
111
- 3. **Attack demonstrations** with realistic exploits
112
- 4. **Cross-language security guidance** - patterns that apply across languages
113
-
114
- ---
115
-
116
- ## 📊 Training Details
117
-
118
- | Parameter | Value |
119
- |-----------|-------|
120
- | **Base Model** | bigcode/starcoder2-15b-instruct-v0.1 |
121
- | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
122
- | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) |
123
- | **Dataset Size** | 841 training examples |
124
- | **Training Epochs** | 3 |
125
- | **LoRA Rank (r)** | 16 |
126
- | **LoRA Alpha** | 32 |
127
- | **Learning Rate** | 2e-4 |
128
- | **Quantization** | 4-bit (bitsandbytes) |
129
- | **Trainable Parameters** | ~78M (0.52% of 15B total) |
130
- | **Total Parameters** | 15B |
131
- | **Context Window** | 16K tokens |
132
- | **GPU Used** | NVIDIA A100 40GB |
133
- | **Training Time** | ~125 minutes (estimated) |
134
-
135
- ### Training Methodology
136
-
137
- **LoRA fine-tuning** preserves StarCoder2's exceptional multi-language capabilities:
138
- - Trains only 0.52% of parameters
139
- - Maintains SOTA code generation quality
140
- - Adds cross-language security understanding
141
- - Efficient deployment for 15B model
142
-
143
- **4-bit quantization** enables deployment on 24GB+ GPUs while maintaining quality.
144
-
145
- ---
146
-
147
- ## 🚀 Usage
148
-
149
- ### Quick Start
150
 
151
  ```python
152
- from transformers import AutoModelForCausalLM, AutoTokenizer
153
  from peft import PeftModel
154
-
155
- # Load base model
156
- base_model = "bigcode/starcoder2-15b-instruct-v0.1"
157
- model = AutoModelForCausalLM.from_pretrained(
158
- base_model,
159
- device_map="auto",
160
- torch_dtype="auto",
161
- trust_remote_code=True
162
- )
163
- tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
164
-
165
- # Load SecureCode adapter
166
- model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
167
-
168
- # Generate secure Solidity smart contract
169
- prompt = """### User:
170
- Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.
171
-
172
- ### Assistant:
173
- """
174
-
175
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
176
- outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
177
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
178
- print(response)
179
- ```
180
-
181
- ### Multi-Language Security Analysis
182
-
183
- ```python
184
- # Analyze Rust code for memory safety issues
185
- rust_prompt = """### User:
186
- Review this Rust web server code for security vulnerabilities:
187
-
188
- ```rust
189
- use actix_web::{web, App, HttpResponse, HttpServer};
190
-
191
- async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
192
- let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
193
- let result = execute_query(&query).await;
194
- HttpResponse::Ok().json(result)
195
- }
196
- ```
197
-
198
- ### Assistant:
199
- """
200
-
201
- # Analyze Kotlin Android code
202
- kotlin_prompt = """### User:
203
- Identify authentication vulnerabilities in this Kotlin Android app:
204
-
205
- ```kotlin
206
- class LoginActivity : AppCompatActivity() {
207
- fun login(username: String, password: String) {
208
- val prefs = getSharedPreferences("auth", MODE_PRIVATE)
209
- prefs.edit().putString("token", generateToken(username, password)).apply()
210
- }
211
- }
212
- ```
213
-
214
- ### Assistant:
215
- """
216
- ```
217
-
218
- ### Production Deployment (4-bit Quantization)
219
-
220
- ```python
221
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
222
- from peft import PeftModel
223
 
224
- # 4-bit quantization - runs on 24GB+ GPU
225
  bnb_config = BitsAndBytesConfig(
226
  load_in_4bit=True,
227
- bnb_4bit_use_double_quant=True,
228
  bnb_4bit_quant_type="nf4",
229
- bnb_4bit_compute_dtype="bfloat16"
230
  )
231
 
232
- model = AutoModelForCausalLM.from_pretrained(
233
  "bigcode/starcoder2-15b-instruct-v0.1",
234
  quantization_config=bnb_config,
235
  device_map="auto",
236
- trust_remote_code=True
237
  )
 
 
238
 
239
- model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
240
- tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)
241
- ```
242
-
243
- ---
244
-
245
- ## 🎯 Use Cases
246
-
247
- ### 1. **Web3/Blockchain Security**
248
- Analyze smart contracts across multiple chains:
249
- ```
250
- Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues
251
- ```
252
 
253
- ### 2. **Multi-Language Codebase Security**
254
- Review polyglot applications:
255
- ```
256
- Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities
257
  ```
258
 
259
- ### 3. **Mobile App Security**
260
- Secure iOS and Android apps:
261
- ```
262
- Review this Swift iOS app for authentication bypass and data exposure vulnerabilities
263
- ```
264
-
265
- ### 4. **Legacy System Modernization**
266
- Secure legacy code:
267
- ```
268
- Identify security flaws in this COBOL mainframe application and provide modernization guidance
269
- ```
270
 
271
- ### 5. **Emerging Language Security**
272
- Security for new languages:
273
- ```
274
- Write a secure Zig HTTP server with memory safety and input validation
275
- ```
276
 
277
- ---
278
 
279
- ## ⚠️ Limitations
 
 
 
 
280
 
281
- ### What This Model Does Well
282
- ✅ Multi-language security analysis (600+ languages)
283
- ✅ State-of-the-art code generation
284
- ✅ Complex security reasoning
285
- ✅ Cross-language pattern recognition
286
 
287
- ### What This Model Doesn't Do
288
- ❌ Not a smart contract auditing firm
289
- Cannot guarantee bug-free code
290
- Not legal/compliance advice
291
- Not a replacement for security experts
 
 
 
 
 
 
 
 
 
 
 
292
 
293
- ### Resource Requirements
294
- - **Larger model** - Requires 24GB+ GPU for optimal performance
295
- - **Higher memory** - 40GB+ RAM recommended
296
- - **Longer inference** - Slower than smaller models
297
 
298
- ---
299
 
300
- ## 📈 Performance Benchmarks
301
 
302
- ### Hardware Requirements
303
 
304
- **Minimum:**
305
- - 40GB RAM
306
- - 24GB GPU VRAM (with 4-bit quantization)
307
 
308
- **Recommended:**
309
- - 64GB RAM
310
- - 40GB+ GPU (A100, RTX 6000 Ada)
311
 
312
- **Inference Speed (on A100 40GB):**
313
- - ~60 tokens/second (4-bit quantization)
314
- - ~85 tokens/second (bfloat16)
315
 
316
- ### Code Generation (Base Model Scores)
317
 
318
- | Benchmark | Score | Rank |
319
- |-----------|-------|------|
320
- | HumanEval | 72.6% | Best open-source |
321
- | MultiPL-E | 52.3% | Top 3 overall |
322
- | Long context | SOTA | #1 |
323
 
324
- ---
325
 
326
- ## 🔬 Dataset Information
 
 
 
 
 
 
 
 
 
327
 
328
- Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**:
329
- - **1,209 examples** with real CVE grounding
330
- - **100% incident validation**
331
- - **OWASP Top 10:2025** complete coverage
332
- - **Multi-language security patterns**
333
 
334
- ---
335
 
336
- ## 📄 License
 
 
 
 
337
 
338
- **Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0
339
 
340
- Powered by the **BigCode OpenRAIL-M** license commitment.
 
 
 
 
341
 
342
- ---
 
 
 
343
 
344
- ## 📚 Citation
345
 
346
  ```bibtex
347
- @misc{thornton2025securecode-starcoder2,
348
- title={StarCoder2 15B - SecureCode Edition},
349
  author={Thornton, Scott},
350
- year={2025},
351
  publisher={perfecXion.ai},
352
- url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
 
353
  }
354
  ```
355
 
356
- ---
357
-
358
- ## 🙏 Acknowledgments
359
 
360
- - **BigCode Project** (ServiceNow + Hugging Face) for StarCoder2
361
- - **The Stack v2** contributors for dataset curation
362
- - **OWASP Foundation** for vulnerability taxonomy
363
- - **Web3 security community** for blockchain vulnerability research
364
 
365
- ---
366
-
367
- ## 🔗 Related Models
368
-
369
- - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B)
370
- - **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B)
371
- - **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B)
372
- - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Enterprise trusted (13B)
373
-
374
- [View Collection](https://huggingface.co/collections/scthornton/securecode)
375
-
376
- ---
377
 
378
- <div align="center">
379
-
380
- **Built with ❤️ for secure multi-language software development**
381
-
382
- [perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)
383
-
384
- </div>
 
1
  ---
2
+ license: bigcode-openrail-m
3
  base_model: bigcode/starcoder2-15b-instruct-v0.1
4
  tags:
5
+ - security
6
+ - cybersecurity
7
+ - secure-coding
8
+ - ai-security
9
+ - owasp
10
+ - code-generation
11
+ - qlora
12
+ - lora
13
+ - fine-tuned
14
+ - securecode
15
  datasets:
16
+ - scthornton/securecode
17
+ library_name: peft
 
 
18
  pipeline_tag: text-generation
19
+ language:
20
+ - code
21
+ - en
22
  ---
23
 
24
+ # StarCoder2 15B SecureCode
25
 
26
  <div align="center">
27
 
28
+ ![Parameters](https://img.shields.io/badge/params-15B-blue.svg)
29
+ ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
+ ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
+ ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
32
 
33
+ **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
 
35
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
 
37
  </div>
38
 
39
  ---
40
 
41
+ ## What This Model Does
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
+ This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
 
 
 
 
44
 
45
+ - Identifies the security risks in common coding patterns
46
+ - Provides vulnerable *and* secure implementations side by side
47
+ - Explains how attackers would exploit the vulnerability
48
+ - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
49
 
50
+ The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
51
 
52
+ ## Model Details
53
 
54
+ | | |
55
+ |---|---|
56
+ | **Base Model** | [StarCoder2 15B Instruct](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
57
+ | **Parameters** | 15B |
58
+ | **Architecture** | StarCoder2 |
59
+ | **Tier** | Tier 3: Large Model |
60
+ | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
+ | **LoRA Rank** | 16 (alpha=32) |
62
+ | **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
63
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
+ | **Hardware** | NVIDIA A100 40GB |
65
 
66
+ BigCode's flagship model trained on The Stack v2. Broad language coverage with strong code understanding.
 
 
 
 
 
 
 
67
 
68
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```python
 
71
  from peft import PeftModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
+ import torch
74
 
75
+ # Load with 4-bit quantization (matches training)
76
  bnb_config = BitsAndBytesConfig(
77
  load_in_4bit=True,
 
78
  bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
  )
81
 
82
+ base_model = AutoModelForCausalLM.from_pretrained(
83
  "bigcode/starcoder2-15b-instruct-v0.1",
84
  quantization_config=bnb_config,
85
  device_map="auto",
 
86
  )
87
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
88
+ model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
89
 
90
+ # Ask a security-relevant coding question
91
+ messages = [
92
+ {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
+ ]
 
 
 
 
 
 
 
 
 
94
 
95
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
+ outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
98
  ```
99
 
100
+ ## Training Details
 
 
 
 
 
 
 
 
 
 
101
 
102
+ ### Dataset
 
 
 
 
103
 
104
+ Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
105
 
106
+ - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
+ - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
+ - **12+ programming languages** and **49+ frameworks**
109
+ - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
+ - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
111
 
112
+ ### Hyperparameters
 
 
 
 
113
 
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | LoRA rank | 16 |
117
+ | LoRA alpha | 32 |
118
+ | LoRA dropout | 0.05 |
119
+ | Target modules | 4 linear layers |
120
+ | Quantization | 4-bit NormalFloat (NF4) |
121
+ | Learning rate | 2e-4 |
122
+ | LR scheduler | Cosine with 100-step warmup |
123
+ | Epochs | 3 |
124
+ | Per-device batch size | 1 |
125
+ | Gradient accumulation | 16x |
126
+ | Effective batch size | 16 |
127
+ | Max sequence length | 4096 tokens |
128
+ | Optimizer | paged_adamw_8bit |
129
+ | Precision | bf16 |
130
 
131
+ **Notes:** Compact LoRA targeting attention layers only (4 modules). Tight A100 40GB memory budget.
 
 
 
132
 
133
+ ## Security Coverage
134
 
135
+ ### Web Security (1,435 examples)
136
 
137
+ OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
138
 
139
+ Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
 
 
140
 
141
+ ### AI/ML Security (750 examples)
 
 
142
 
143
+ OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
 
 
144
 
145
+ Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
146
 
147
+ ## SecureCode Model Collection
 
 
 
 
148
 
149
+ This model is part of the **SecureCode** collection of 8 security-specialized models:
150
 
151
+ | Model | Base | Size | Tier | HuggingFace |
152
+ |-------|------|------|------|-------------|
153
+ | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
+ | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
+ | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
+ | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
+ | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
+ | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
+ | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
+ | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
+ Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
 
 
 
 
163
 
164
+ ## SecureCode Dataset Family
165
 
166
+ | Dataset | Examples | Focus | Link |
167
+ |---------|----------|-------|------|
168
+ | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
+ | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
+ | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
171
 
172
+ ## Intended Use
173
 
174
+ **Use this model for:**
175
+ - Training AI coding assistants to write secure code
176
+ - Security education and training
177
+ - Vulnerability research and secure code review
178
+ - Building security-aware development tools
179
 
180
+ **Do not use this model for:**
181
+ - Offensive exploitation or automated attack generation
182
+ - Circumventing security controls
183
+ - Any activity that violates the base model's license
184
 
185
+ ## Citation
186
 
187
  ```bibtex
188
+ @misc{thornton2026securecode,
189
+ title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
  author={Thornton, Scott},
191
+ year={2026},
192
  publisher={perfecXion.ai},
193
+ url={https://huggingface.co/datasets/scthornton/securecode},
194
+ note={arXiv:2512.18542}
195
  }
196
  ```
197
 
198
+ ## Links
 
 
199
 
200
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
+ - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
+ - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
+ - **Author**: [perfecXion.ai](https://perfecxion.ai)
204
 
205
+ ## License
 
 
 
 
 
 
 
 
 
 
 
206
 
207
+ This model is released under the **bigcode-openrail-m** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.