Configuration

We customized the original $\text{BERT}_\text{base}$ configuration by Google, whose source code can be found at https://github.com/google-research/bert, to deal with the context of source code as input.

We publicly release the source code:

src-pretraining.tar.xz

Hyperparameters

Thus, we produced a $\text{BERT}_\text{custom}$ configuration.

The following table shows the pre-training hyperparameters compared to $\text{BERT}_\text{base}$.

Parameter$\text{BERT}_\text{base}$$\text{BERT}_\text{custom}$
OptimizerAdamAdam
Learning rate0.00010.0001
$\beta_1$0.90.9
$\beta_2$0.9990.999
L2 weight decay0.010.01
Learning rate decaylinearlinear
Dropout probability0.10.1
Activation functiongelugelu
Masking rate0.150.15
Hidden size768768
Intermediate size3,0723,072
Attention heads1212
Hidden layers1212
Vocabulary size30,52230,522
Maximum sequence length512256
Batch size25662
Learning rate warmup steps10,0001,000