Add paper citation and comprehensive model card

794f871 verified 29 days ago

12 kB

	---
	license: apache-2.0
	base_model: bigcode/starcoder2-15b-instruct-v0.1
	tags:
	- code
	- security
	- starcoder
	- bigcode
	- securecode
	- owasp
	- vulnerability-detection
	datasets:
	- scthornton/securecode-v2
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	arxiv: 2512.18542
	---

	# StarCoder2 15B - SecureCode Edition

	<div align="center">

	[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
	[![Base Model](https://img.shields.io/badge/base-StarCoder2%2015B-orange.svg)](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
	[![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)

	The most powerful multi-language security model - 600+ programming languages

	[📄 Paper](https://arxiv.org/abs/2512.18542) \| [🤗 Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) \| [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) \| [💻 perfecXion.ai](https://perfecxion.ai)

	</div>

	---

	## 🎯 What is This?

	This is StarCoder2 15B Instruct fine-tuned on the SecureCode v2.0 dataset - the most comprehensive multi-language code model available, trained on 4 trillion tokens across 600+ programming languages, now enhanced with production-grade security knowledge.

	StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:

	✅ Unprecedented language coverage - Security awareness across 600+ languages
	✅ State-of-the-art code generation - Best open-source model performance
	✅ Complex security reasoning - 15B parameters for sophisticated vulnerability analysis
	✅ Production-ready quality - Trained on The Stack v2 with rigorous data curation

	The Result: The most powerful and versatile security-aware code model in the SecureCode collection.

	Why StarCoder2 15B? This model offers:
	- 🌍 600+ languages - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
	- 🏆 SOTA performance - Best open-source code model
	- 🧠 Complex reasoning - 15B parameters for sophisticated security analysis
	- 🔬 Research-grade - Built on The Stack v2 with extensive curation
	- 🌟 Community-driven - BigCode initiative backed by ServiceNow + HuggingFace

	---

	## 🚨 The Problem This Solves

	AI coding assistants produce vulnerable code in 45% of security-relevant scenarios (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.

	Multi-language security challenges:
	- Solidity smart contracts: $3+ billion stolen in Web3 exploits (2021-2024)
	- Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
	- Legacy systems (COBOL/Fortran): Undocumented security flaws
	- Emerging languages (Rust/Zig): New security patterns needed

	StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.

	---

	## 💡 Key Features

	### 🌍 Unmatched Language Coverage

	StarCoder2 15B trained on 600+ programming languages:
	- Mainstream: Python, JavaScript, Java, C++, Go, Rust
	- Web3: Solidity, Vyper, Cairo, Move
	- Mobile: Kotlin, Swift, Dart
	- Systems: C, Rust, Zig, Assembly
	- Functional: Haskell, OCaml, Scala, Elixir
	- Legacy: COBOL, Fortran, Pascal
	- And 580+ more...

	Now enhanced with 1,209 security-focused examples covering OWASP Top 10:2025.

	### 🏆 State-of-the-Art Performance

	StarCoder2 15B delivers cutting-edge results:
	- HumanEval: 72.6% pass@1 (best open-source at release)
	- MultiPL-E: 52.3% average across languages
	- Leading performance on long-context code tasks
	- Trained on The Stack v2 (4T tokens)

	### 🔐 Comprehensive Security Training

	Trained on real-world security incidents:
	- 224 examples of Broken Access Control
	- 199 examples of Authentication Failures
	- 125 examples of Injection attacks
	- 115 examples of Cryptographic Failures
	- Complete OWASP Top 10:2025 coverage

	### 📋 Advanced Security Analysis

	Every response includes:
	1. Multi-language vulnerability patterns
	2. Secure implementations with language-specific best practices
	3. Attack demonstrations with realistic exploits
	4. Cross-language security guidance - patterns that apply across languages

	---

	## 📊 Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| bigcode/starcoder2-15b-instruct-v0.1 \|
	\| Fine-tuning Method \| LoRA (Low-Rank Adaptation) \|
	\| Training Dataset \| [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) \|
	\| Dataset Size \| 841 training examples \|
	\| Training Epochs \| 3 \|
	\| LoRA Rank (r) \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| Learning Rate \| 2e-4 \|
	\| Quantization \| 4-bit (bitsandbytes) \|
	\| Trainable Parameters \| ~78M (0.52% of 15B total) \|
	\| Total Parameters \| 15B \|
	\| Context Window \| 16K tokens \|
	\| GPU Used \| NVIDIA A100 40GB \|
	\| Training Time \| ~125 minutes (estimated) \|

	### Training Methodology

	LoRA fine-tuning preserves StarCoder2's exceptional multi-language capabilities:
	- Trains only 0.52% of parameters
	- Maintains SOTA code generation quality
	- Adds cross-language security understanding
	- Efficient deployment for 15B model

	4-bit quantization enables deployment on 24GB+ GPUs while maintaining quality.

	---

	## 🚀 Usage

	### Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model
	base_model = "bigcode/starcoder2-15b-instruct-v0.1"
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	device_map="auto",
	torch_dtype="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

	# Load SecureCode adapter
	model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")

	# Generate secure Solidity smart contract
	prompt = """### User:
	Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.

	### Assistant:
	"""

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Multi-Language Security Analysis

	```python
	# Analyze Rust code for memory safety issues
	rust_prompt = """### User:
	Review this Rust web server code for security vulnerabilities:

	```rust
	use actix_web::{web, App, HttpResponse, HttpServer};

	async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
	let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
	let result = execute_query(&query).await;
	HttpResponse::Ok().json(result)
	}
	```

	### Assistant:
	"""

	# Analyze Kotlin Android code
	kotlin_prompt = """### User:
	Identify authentication vulnerabilities in this Kotlin Android app:

	```kotlin
	class LoginActivity : AppCompatActivity() {
	fun login(username: String, password: String) {
	val prefs = getSharedPreferences("auth", MODE_PRIVATE)
	prefs.edit().putString("token", generateToken(username, password)).apply()
	}
	}
	```

	### Assistant:
	"""
	```

	### Production Deployment (4-bit Quantization)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel

	# 4-bit quantization - runs on 24GB+ GPU
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype="bfloat16"
	)

	model = AutoModelForCausalLM.from_pretrained(
	"bigcode/starcoder2-15b-instruct-v0.1",
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True
	)

	model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
	tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)
	```

	---

	## 🎯 Use Cases

	### 1. Web3/Blockchain Security
	Analyze smart contracts across multiple chains:
	```
	Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues
	```

	### 2. Multi-Language Codebase Security
	Review polyglot applications:
	```
	Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities
	```

	### 3. Mobile App Security
	Secure iOS and Android apps:
	```
	Review this Swift iOS app for authentication bypass and data exposure vulnerabilities
	```

	### 4. Legacy System Modernization
	Secure legacy code:
	```
	Identify security flaws in this COBOL mainframe application and provide modernization guidance
	```

	### 5. Emerging Language Security
	Security for new languages:
	```
	Write a secure Zig HTTP server with memory safety and input validation
	```

	---

	## ⚠️ Limitations

	### What This Model Does Well
	✅ Multi-language security analysis (600+ languages)
	✅ State-of-the-art code generation
	✅ Complex security reasoning
	✅ Cross-language pattern recognition

	### What This Model Doesn't Do
	❌ Not a smart contract auditing firm
	❌ Cannot guarantee bug-free code
	❌ Not legal/compliance advice
	❌ Not a replacement for security experts

	### Resource Requirements
	- Larger model - Requires 24GB+ GPU for optimal performance
	- Higher memory - 40GB+ RAM recommended
	- Longer inference - Slower than smaller models

	---

	## 📈 Performance Benchmarks

	### Hardware Requirements

	Minimum:
	- 40GB RAM
	- 24GB GPU VRAM (with 4-bit quantization)

	Recommended:
	- 64GB RAM
	- 40GB+ GPU (A100, RTX 6000 Ada)

	Inference Speed (on A100 40GB):
	- ~60 tokens/second (4-bit quantization)
	- ~85 tokens/second (bfloat16)

	### Code Generation (Base Model Scores)

	\| Benchmark \| Score \| Rank \|
	\|-----------\|-------\|------\|
	\| HumanEval \| 72.6% \| Best open-source \|
	\| MultiPL-E \| 52.3% \| Top 3 overall \|
	\| Long context \| SOTA \| #1 \|

	---

	## 🔬 Dataset Information

	Trained on [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2):
	- 1,209 examples with real CVE grounding
	- 100% incident validation
	- OWASP Top 10:2025 complete coverage
	- Multi-language security patterns

	---

	## 📄 License

	Model: Apache 2.0 \| Dataset: CC BY-NC-SA 4.0

	Powered by the BigCode OpenRAIL-M license commitment.

	---

	## 📚 Citation

	```bibtex
	@misc{thornton2025securecode-starcoder2,
	title={StarCoder2 15B - SecureCode Edition},
	author={Thornton, Scott},
	year={2025},
	publisher={perfecXion.ai},
	url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
	}
	```

	---

	## 🙏 Acknowledgments

	- BigCode Project (ServiceNow + Hugging Face) for StarCoder2
	- The Stack v2 contributors for dataset curation
	- OWASP Foundation for vulnerability taxonomy
	- Web3 security community for blockchain vulnerability research

	---

	## 🔗 Related Models

	- [llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode) - Most accessible (3B)
	- [qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode) - Best code model (7B)
	- [deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) - Security-optimized (6.7B)
	- [codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode) - Enterprise trusted (13B)

	[View Collection](https://huggingface.co/collections/scthornton/securecode)

	---

	<div align="center">

	Built with ❤️ for secure multi-language software development

	[perfecXion.ai](https://perfecxion.ai) \| [Contact](mailto:scott@perfecxion.ai)

	</div>