Data

We pre-trained our custom $\text{BERT}$ models by using the data from the CodeSearchNet challenge project. The datasets are freely available.

The following table reports the size of the pre-training datasets and the links to download the datasets. We report in bold the language datasets, or their combinations, we used to produce the pre-trained models.

LanguageNumber of functionsNumber of tokensData
JavaScript1,857,835128,430,003
  • javascript.zip
  • Java1,569,88975,654,447
  • java.zip
  • Python1,156,08550,551,794
  • python.zip
  • PHP977,82153,352,522
  • php.zip
  • Go726,76837,075,579
  • go.zip
  • Ruby164,0485,495,442
  • ruby.zip
  • TOP4,583,809254,636,244
  • javascript.zip
  • java.zip
  • python.zip
  • ALL6,452,446350,559,787
  • javascript.zip
  • java.zip
  • python.zip
  • php.zip
  • go.zip
  • ruby.zip