LOLA Large
- Model size: 1.3B dense / 7.46B sparse
- Model form: GPT based Mixture of Experts (16 experts)
- Training Dataset: CulturaX (167 languages)
- Training steps: 296000
- Training cluster: noctua2
- Training hardware: 96x Nvidia's A100 (40GB) GPUs
- Train script: https://github.com/dice-group/LOLA-Megatron-DeepSpeed/blob/main/lola_ws/gpt/gpt3-moe-pretrain.sh