We pre-trained our custom $\text{BERT}$ models by using the data from the CodeSearchNet challenge project. The datasets are freely available.
The following table reports the size of the pre-training datasets and the links to download the datasets. We report in bold the language datasets, or their combinations, we used to produce the pre-trained models.
Language | Number of functions | Number of tokens | Data |
---|---|---|---|
JavaScript | 1,857,835 | 128,430,003 | |
Java | 1,569,889 | 75,654,447 | |
Python | 1,156,085 | 50,551,794 | |
PHP | 977,821 | 53,352,522 | |
Go | 726,768 | 37,075,579 | |
Ruby | 164,048 | 5,495,442 | |
TOP | 4,583,809 | 254,636,244 | |
ALL | 6,452,446 | 350,559,787 |