Update README.md

16f4882 verified 4 months ago

4.85 kB

	# Cloud Log Classifier using CodeBERT

	This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from AWS, Azure, and GCP with high accuracy.

	The model was fine-tuned on a dataset of simulated cloud logs using `microsoft/codebert-base-mlm` as the base model.

	## 📂 Project Structure

	```
	.
	├── Cloud_Classifier_using_codebert.ipynb # Jupyter Notebook containing training and evaluation code
	├── cloud-log-classifier-final/ # Saved model directory (generated after training)
	│ ├── config.json
	│ ├── pytorch_model.bin
	│ ├── tokenizer_config.json
	│ └── ...
	├── cloud-log-classifier-final.zip # Zipped model for distribution
	└── README.md # This file
	```

	## 🚀 Usage

	You can use the fine-tuned model in your own projects using the `CloudLogClassifier` class.

	### Prerequisites

	```bash
	pip install torch transformers scikit-learn numpy
	```

	### Python Inference Code

	Save the following code as `classifier.py` or use it directly in your python scripts. Ensure you have the `cloud-log-classifier-final` folder (unzipped) in your working directory.

	```python
	import torch
	import json
	import numpy as np
	from transformers import RobertaForSequenceClassification, RobertaTokenizer

	class CloudLogClassifier:
	"""
	Reusable classifier for cloud platform detection from logs.
	"""

	def __init__(self, model_path):
	"""
	Load the fine-tuned model and tokenizer

	Args:
	model_path (str): Path to the directory containing the saved model files
	"""
	self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	try:
	self.model = RobertaForSequenceClassification.from_pretrained(model_path)
	self.tokenizer = RobertaTokenizer.from_pretrained(model_path)
	self.model.to(self.device)
	self.model.eval()

	# Load label mapping
	with open(f'{model_path}/label_mapping.json', 'r') as f:
	mappings = json.load(f)
	# Convert keys back to integers for the dictionary
	self.id2label = {int(k): v for k, v in mappings['id2label'].items()}

	except Exception as e:
	raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}")

	def predict(self, log_text):
	"""
	Predict cloud platform from log text

	Args:
	log_text (str): Log text to classify

	Returns:
	dict: Prediction results with label and confidence
	"""
	# Tokenize input
	inputs = self.tokenizer(
	log_text,
	return_tensors='pt',
	truncation=True,
	padding='max_length',
	max_length=128
	).to(self.device)

	# Get prediction
	with torch.no_grad():
	outputs = self.model(**inputs)
	logits = outputs.logits
	probabilities = torch.nn.functional.softmax(logits, dim=-1)
	predicted_class = torch.argmax(probabilities, dim=-1).item()
	confidence = probabilities[0][predicted_class].item()

	return {
	'platform': self.id2label[predicted_class],
	'confidence': confidence,
	'all_probabilities': {
	self.id2label[i]: prob.item()
	for i, prob in enumerate(probabilities[0])
	}
	}

	# Usage Example
	if __name__ == "__main__":
	# Path to your unzipped model folder
	model_path = './cloud-log-classifier-final'

	try:
	classifier = CloudLogClassifier(model_path)

	test_logs = [
	"[ 3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)",
	"AzureLinuxAgent: INFO Starting Azure Linux Agent",
	"google_guest_agent INFO GCE Agent running"
	]

	print("Predictions:")
	for log in test_logs:
	result = classifier.predict(log)
	print(f"\nLog: {log}")
	print(f"Predicted Platform: {result['platform'].upper()}")
	print(f"Confidence: {result['confidence']:.2%}")

	except Exception as e:
	print(f"Error: {e}")
	```

	## 📊 Model Performance

	The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP).

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Accuracy \| ~98% \|
	\| Precision \| >95% \|
	\| Recall \| >95% \|
	\| F1-Score \| >95% \|

	## 🛠️ Training

	The model was trained using the `Trainer` API from HuggingFace Transformers.

	- Base Model: microsoft/codebert-base-mlm
	- Epochs: 5
	- Batch Size: 16
	- Learning Rate: Default (5e-5)