| # Cloud Log Classifier using CodeBERT |
|
|
| This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from **AWS**, **Azure**, and **GCP** with high accuracy. |
|
|
| The model was fine-tuned on a dataset of simulated cloud logs using `microsoft/codebert-base-mlm` as the base model. |
|
|
| ## π Project Structure |
|
|
| ``` |
| . |
| βββ Cloud_Classifier_using_codebert.ipynb # Jupyter Notebook containing training and evaluation code |
| βββ cloud-log-classifier-final/ # Saved model directory (generated after training) |
| β βββ config.json |
| β βββ pytorch_model.bin |
| β βββ tokenizer_config.json |
| β βββ ... |
| βββ cloud-log-classifier-final.zip # Zipped model for distribution |
| βββ README.md # This file |
| ``` |
|
|
| ## π Usage |
|
|
| You can use the fine-tuned model in your own projects using the `CloudLogClassifier` class. |
|
|
| ### Prerequisites |
|
|
| ```bash |
| pip install torch transformers scikit-learn numpy |
| ``` |
|
|
| ### Python Inference Code |
|
|
| Save the following code as `classifier.py` or use it directly in your python scripts. Ensure you have the `cloud-log-classifier-final` folder (unzipped) in your working directory. |
|
|
| ```python |
| import torch |
| import json |
| import numpy as np |
| from transformers import RobertaForSequenceClassification, RobertaTokenizer |
| |
| class CloudLogClassifier: |
| """ |
| Reusable classifier for cloud platform detection from logs. |
| """ |
| |
| def __init__(self, model_path): |
| """ |
| Load the fine-tuned model and tokenizer |
| |
| Args: |
| model_path (str): Path to the directory containing the saved model files |
| """ |
| self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| try: |
| self.model = RobertaForSequenceClassification.from_pretrained(model_path) |
| self.tokenizer = RobertaTokenizer.from_pretrained(model_path) |
| self.model.to(self.device) |
| self.model.eval() |
| |
| # Load label mapping |
| with open(f'{model_path}/label_mapping.json', 'r') as f: |
| mappings = json.load(f) |
| # Convert keys back to integers for the dictionary |
| self.id2label = {int(k): v for k, v in mappings['id2label'].items()} |
| |
| except Exception as e: |
| raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}") |
| |
| def predict(self, log_text): |
| """ |
| Predict cloud platform from log text |
| |
| Args: |
| log_text (str): Log text to classify |
| |
| Returns: |
| dict: Prediction results with label and confidence |
| """ |
| # Tokenize input |
| inputs = self.tokenizer( |
| log_text, |
| return_tensors='pt', |
| truncation=True, |
| padding='max_length', |
| max_length=128 |
| ).to(self.device) |
| |
| # Get prediction |
| with torch.no_grad(): |
| outputs = self.model(**inputs) |
| logits = outputs.logits |
| probabilities = torch.nn.functional.softmax(logits, dim=-1) |
| predicted_class = torch.argmax(probabilities, dim=-1).item() |
| confidence = probabilities[0][predicted_class].item() |
| |
| return { |
| 'platform': self.id2label[predicted_class], |
| 'confidence': confidence, |
| 'all_probabilities': { |
| self.id2label[i]: prob.item() |
| for i, prob in enumerate(probabilities[0]) |
| } |
| } |
| |
| # Usage Example |
| if __name__ == "__main__": |
| # Path to your unzipped model folder |
| model_path = './cloud-log-classifier-final' |
| |
| try: |
| classifier = CloudLogClassifier(model_path) |
| |
| test_logs = [ |
| "[ 3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)", |
| "AzureLinuxAgent: INFO Starting Azure Linux Agent", |
| "google_guest_agent INFO GCE Agent running" |
| ] |
| |
| print("Predictions:") |
| for log in test_logs: |
| result = classifier.predict(log) |
| print(f"\nLog: {log}") |
| print(f"Predicted Platform: {result['platform'].upper()}") |
| print(f"Confidence: {result['confidence']:.2%}") |
| |
| except Exception as e: |
| print(f"Error: {e}") |
| ``` |
|
|
| ## π Model Performance |
|
|
| The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP). |
|
|
| | Metric | Score | |
| | :--- | :--- | |
| | **Accuracy** | ~98% | |
| | **Precision** | >95% | |
| | **Recall** | >95% | |
| | **F1-Score** | >95% | |
|
|
| ## π οΈ Training |
|
|
| The model was trained using the `Trainer` API from HuggingFace Transformers. |
|
|
| - **Base Model**: microsoft/codebert-base-mlm |
| - **Epochs**: 5 |
| - **Batch Size**: 16 |
| - **Learning Rate**: Default (5e-5) |
|
|