modelling101
/

CodeBERT-SO

Text Classification

text-embeddings-inference

Model card Files Files and versions

CodeBERT-SO / README.md

modelling101's picture

Update README.md

e1acfdf verified almost 2 years ago

|

history blame contribute delete

1.23 kB

	---
	license: cc-by-4.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- code
	metrics:
	- accuracy
	- f1
	---
	# CodeBERT-SO
	Repository for CodeBERT, fine-tuned on Stack Overflow snippets with respect to NL-PL pairs of 6 languages (Python, Java, JavaScript, PHP, Ruby, Go).
	## Training Objective
	This model is initialized with [CodeBERT-base](https://huggingface.co/microsoft/codebert-base) and trained to classify whether a user will drop out given their posts and code snippets.
	## Training Regime
	Preprocessing methods for input texts include unicode normalisation (NFC form), removal of extraneous whitespaces, removal of punctuations (except within links), lowercasing and removal of stopwords.
	Code snippets were also removed of their in-line comments or docstrings (cf. the main manuscript). RoBERTa tokenizer was used, as the built-in tokenizer for the original CodeBERT implementation.

	Training was done across 8 epochs with a batch size of 8, learning rate of 1e-5, epsilon (weight update denominator) of 1e-8.
	A random 20% sample of the entire dataset was used as the validation set.
	## Performance
	* Final validation accuracy: 0.822
	* Final validation F1: 0.809
	* Final validation loss: 0.5