FlashMoe Running a 397B Parameter Model on a Mac with 48GB RAM

Image for article FlashMoe Running a 397B Parameter Model on a Mac with 48GB RAM
News Source : Github.com

News Summary

  • Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second.
  • Model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention.
  • Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert).
  • Hidden dimension is 4096.
Read the paper Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.Pure C/Metal inference engine that runs Qwen3.5397BA17B (a 397 billion parame [+7668 chars]

Must read Articles