Data

We produced our own dataset, consisting of question-answer pairs from StackOverflow for the fine-tuning part. We use the question’s title as the natural language query and the accepted answer’s code snippets as the source code document to be retrieved from the search corpus.

We publicly release the source code:

src-sodatamining.tar.xz

We included the Google BigQuery queries to mine the data and the scripts for preprocessing it.

Description

The following table shows the number of StackOverflow questions after each filtering step. The numbers in the last row represent our final dataset sizes.

StepJavaScriptJavaPython
Questions2,045,1141,841,2961,884,571
Questions with accepted answer1,105,690934,062984,989
Accepted answer contains a code snippet861,273533,217655,430
3+ upvotes and 3+ lines of code85,04971,19487,231

We also report the dataset quality statistics on average before and after filtering.

LanguageQuestion upvotesQuestion length (tokens)Answer upvotesAnswer length (tokens)Answer length (lines)
JavaScriptBefore2.948.744.75175.6129.73
After21.168.4828.96207.3934.43
JavaBefore3.188.625.14203.5029.64
After16.648.4922.61262.9937.73
PythonBefore3.519.085.34165.1525.89
After18.378.6624.13205.9032.63

Mined data

We release the mined data in the form of JSON Lines files.

Each of the lines contains the fields fold_x, where x is the fold number. For each of them, a value among train, valid, and test is specified to indicate to which set that line is belonging in the case of that fold.

LanguageSizeData
JavaScript85,049sodata-js.jsonl.xz
Java71,194sodata-ja.jsonl.xz
Python87,231sodata-py.jsonl.xz