| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: text-classification |
| tags: |
| - code |
| metrics: |
| - accuracy |
| - f1 |
| --- |
| # CodeBERT-SO |
| Repository for CodeBERT, fine-tuned on Stack Overflow snippets with respect to NL-PL pairs of 6 languages (Python, Java, JavaScript, PHP, Ruby, Go). |
| ## Training Objective |
| This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets. |
| ## Training Regime |
| Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords. |
| Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original CodeBERT implementation. |
|
|
| Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8. |
| A random 20% sample of the entire dataset was used as the validation set. |
| ## Performance |
| * Final validation accuracy: 0.822 |
| * Final validation F1: 0.809 |
| * Final validation loss: 0.5 |