LOLA Small
- Model size: 1.3B dense / 2.52B sparse
- Model form: GPT based Mixture of Experts (4 experts)
- Training Dataset: MC4 (en, de, es, fr, hi, zh, ja, pt, ar, ru)
- Training steps: 56341
- Training server: Enexa1 and Enexa2
- Training hardware: 4x Nvidia's A100 (80GB) GPUs
- Train script: https://github.com/dice-group/LOLA-Megatron-DeepSpeed/blob/main/lola_ws/gpt/gpt3-moe-pretrain.sh