LOLA Large - Model size: 1.3B dense / 7.46B sparse - Model form: GPT based Mixture of Experts (16 experts) - Training Dataset: CulturaX (167 languages) - Training steps: 296000 - Training cluster: noctua2 - Training hardware: 96x Nvidia's A100 (40GB) GPUs - Train script: https://github.com/dice-group/LOLA-Megatron-DeepSpeed/blob/main/lola_ws/gpt/gpt3-moe-pretrain.sh