We pre-trained our custom $\text{BERT}$ models by using the data from the CodeSearchNet challenge project. The datasets are freely available.
The following table reports the size of the pre-training datasets and the links to download the datasets. We report in bold the language datasets, or their combinations, we used to produce the pre-trained models.
| Language | Number of functions | Number of tokens | Data |
|---|---|---|---|
| JavaScript | 1,857,835 | 128,430,003 | |
| Java | 1,569,889 | 75,654,447 | |
| Python | 1,156,085 | 50,551,794 | |
| PHP | 977,821 | 53,352,522 | |
| Go | 726,768 | 37,075,579 | |
| Ruby | 164,048 | 5,495,442 | |
| TOP | 4,583,809 | 254,636,244 | |
| ALL | 6,452,446 | 350,559,787 |