LOLA Small - Model size: 1.3B dense / 2.52B sparse - Model form: GPT based Mixture of Experts (4 experts) - Training Dataset: MC4 (en, de, es, fr, hi, zh, ja, pt, ar, ru) - Training steps: 56341 - Training server: Enexa1 and Enexa2 - Training hardware: 4x Nvidia's A100 (80GB) GPUs - Train script: https://github.com/dice-group/LOLA-Megatron-DeepSpeed/blob/main/lola_ws/gpt/gpt3-moe-pretrain.sh