Homepage > Pre-training > Data

Data

We pre-trained our custom $\text{BERT}$ models by using the data from the CodeSearchNet challenge project. The datasets are freely available.

The following table reports the size of the pre-training datasets and the links to download the datasets. We report in bold the language datasets, or their combinations, we used to produce the pre-trained models.

Language	Number of functions	Number of tokens	Data
*JavaScript*	1,857,835	128,430,003	javascript.zip
*Java*	1,569,889	75,654,447	java.zip
*Python*	1,156,085	50,551,794	python.zip
PHP	977,821	53,352,522	php.zip
Go	726,768	37,075,579	go.zip
Ruby	164,048	5,495,442	ruby.zip
*TOP*	4,583,809	254,636,244	javascript.zip java.zip python.zip
*ALL*	6,452,446	350,559,787	javascript.zip java.zip python.zip php.zip go.zip ruby.zip