Introduction to SmartMOE

Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization

Posted by Kylin on August 20, 2023

[TOC]

Motivation: 之前的 research 都是按照一个 static plan 进行系统执行,而且一般考虑到:model architecture 和 hardware specification,现在我们想整一个 workload-aware 的 。

SmartMOE split the process of automatic parallelization into two stages, performed offline and online, respectively.

Evaluation

setup

用了3个不同拓扑结构的Clusters:

image-20230820205840937

Model 用了 GPT-MOE 和 Swin-MOE

Baselines

  • DeepSpeed-MoE
  • Tutel
  • FasterMoE
  • Alpa

Metrics

training latency