We customized the original configuration by Google, whose source code can be found at https://github.com/google-research/bert, to deal with the context of source code as input.
We publicly release the source code:
src-pretraining.tar.xzThus, we produced a configuration.
The following table shows the pre-training hyperparameters compared to .
| Parameter | ||
|---|---|---|
| Optimizer | Adam | Adam |
| Learning rate | 0.0001 | 0.0001 |
| 0.9 | 0.9 | |
| 0.999 | 0.999 | |
| L2 weight decay | 0.01 | 0.01 |
| Learning rate decay | linear | linear |
| Dropout probability | 0.1 | 0.1 |
| Activation function | gelu | gelu |
| Masking rate | 0.15 | 0.15 |
| Hidden size | 768 | 768 |
| Intermediate size | 3,072 | 3,072 |
| Attention heads | 12 | 12 |
| Hidden layers | 12 | 12 |
| Vocabulary size | 30,522 | 30,522 |
| Maximum sequence length | 512 | 256 |
| Batch size | 256 | 62 |
| Learning rate warmup steps | 10,000 | 1,000 |