Deepseek, MoE, and building bridges on the Moat

smaller and MoE models have greatly reduced the moat

February 3, 2025

machine-learning mixture-of-experts ai-infrastructure

Hey everyone,

Lately, I’ve been getting a lot of questions from you all about these “Mixture-of-Experts” (MoE) models, especially after the buzz around Deepseek V3 and similar AI breakthroughs.

You’re right to be curious! These models aren’t just another incremental improvement – they’re subtly but profoundly changing the game in AI, and in ways that might not be obvious at first glance.

Think of it this way: MoEs are helping us build incredibly powerful AI while also making it potentially more efficient and accessible in the long run. That’s a big deal!

This post is for you – my friends who’ve been asking – to break down what MoEs are all about, especially how they impact the inference part of using AI (that’s when you’re actually asking the AI questions or using it to generate stuff). We’ll also touch on what this all means for the cloud services that power AI and where AI development might be headed.

What’s the MoE Magic? Thinking Beyond the “Dense” Box

To really get why MoEs are interesting, let’s first quickly understand how “traditional” AI models, often called “dense” models, work:

Dense Models: All Hands on Deck, All the Time

Imagine a restaurant where every single chef and waiter has to participate in preparing every single dish, no matter what you order. That’s kind of like a dense AI model.

Every part of the model is active for every single request. Whether you ask it a simple question or a complex one, the entire network of “neurons” and connections is engaged.
Think of it as a monolithic block of computation. It’s straightforward to build and train, but it can get incredibly expensive, especially as you make the models bigger and more powerful. It’s like having to pay all those chefs and waiters, even if some are just standing around sometimes.

The Problem with “Dense” models at Scale: As we want AI to do more complex tasks and handle more information, we need bigger models with more “parameters” (think of these as the model’s range of knowledge and skills). But with dense models, bigger models mean way more computation for every single request. This becomes incredibly costly and resource-intensive especially for training (where we have to figure out values for all these parameters through brute force means).

MoE Models: Specialized Experts for the Job

Now, picture a different kind of restaurant. Instead of everyone doing everything, you have specialized chefs: a pasta expert, a sushi master, a grill specialist, etc.

When you order pasta, only the pasta expert and a few supporting staff get really busy. For sushi, it’s the sushi master’s turn to shine.

That’s the core idea behind Mixture-of-Experts (MoE) models.

Intelligent Routing: When you give an MoE model a request, a “router” figures out which “experts” within the model are best suited to handle that specific task. It’s like the restaurant manager directing your order to the right chef.
Selective Activation: Only a “select set” of these “experts” are activated for each request. This is the game-changer! Instead of waking up the whole giant model for every little thing, you only engage the parts that are truly needed.

Pros of MoE Summarized:

Let’s quickly compare dense and MoE models side-by-side:

Feature	Dense Models	MoE Models
Computation	Every part active for every request	Only selected “experts” are active
Parameters	All parameters engaged for each request	Can have many more total parameters, but only use a fraction per request
Inference Speed	Can be straightforward	Potential for faster throughput overall, but routing can add to the latency
Model Capacity	Limited by computational cost	Can achieve much higher capacity without linearly increasing inference cost

Inference Latency: The Routing “Traffic Jam” & Load balancing?

You might be wondering about that “latency” thing. And what exactly is routing and how it can introduce a tiny bit of overhead.¹

Think of it like the restaurant manager taking a moment to read your order and decide who should cook it.

How MoEs Change the Game:

MoE models, because they are more efficient in their use of parameters during inference, can significantly reduce the VRAM requirements for running very large models.

You can have a model with trillions of parameters, but you only need to load a fraction of those into VRAM for any single request. It’s like having a giant library, but only needing to pull a few books off the shelf at a time.

The Deployment Story

We’ve talked about how MoE models are structured and their potential advantages. But when it comes to actually running these models and making them available for everyone to use (what we call “deployment”), things get a bit more interesting and complex.

Let’s look at some of the key consequences:

The Mystery of the “Quiet Experts”: Expert Sparsity

Think back to our restaurant analogy. Imagine you have a sushi expert who is amazing, but your restaurant mostly serves pasta and burgers.

The sushi expert might not get used as much as the pasta chef. That’s kind of what happens in MoE models with “expert sparsity.”

Some experts are less active than others. In an MoE model, certain “experts” might be really good at handling specific types of questions or tasks, while others are used less frequently. It’s like having specialized chefs, some of whom are in high demand depending on the menu.
“Inactive” doesn’t mean useless! You might think, “Hey, let’s just remove those rarely used experts to save resources!” But here’s the catch: removing them can actually hurt the model’s overall performance. Even experts that seem “quiet” contribute to the model’s overall knowledge and ability to handle a wide range of inputs. Think of it like having specialized knowledge – even if you don’t use it every day, it’s valuable to have when you do need it.
Hard to just “prune” them away. So, unlike trimming fat from a dense model, you can’t just easily get rid of experts that seem less busy. They are part of the model’s overall intelligence, even if they are specialized and only activated for certain specific requests.

Good News: Compute Power Gets a Break!

Remember how dense models need all their “neurons” firing all the time? MoEs are much more selective, and this has a fantastic upside:

Significantly Lower Compute Requirements. Because only a portion of the model (the chosen experts) is used for each request, the amount of actual computation needed goes down dramatically compared to a dense model of similar size. It’s like our restaurant – if you only order pasta, you only need to use the pasta station, not the entire kitchen at full blast.
Efficiency is the name of the game. This reduced compute is a huge win for efficiency and cost-effectiveness. It means you can potentially run larger, more powerful models without needing exponentially more computing power.

The Flip Side: Memory Demands Skyrocket! (VRAM, We Meet Again!)

Now for the slightly less rosy part. While compute gets easier, memory gets trickier(oof couldn’t stop myself):

We still need to load the entire model into memory. Even though we only use parts of it at a time, we still need to have all the experts loaded onto our GPUs (those specialized processors for AI). Think of our restaurant again – you still need to have all the chefs and all the stations, even if you’re not using them all at once.
“Sparse but Big” means massive memory footprint. MoE models are designed to be sparse in activation but massive in overall size (lots of experts!). This translates to needing a lot more memory (VRAM) on our GPUs to hold the whole thing. This can be a significant challenge, especially as we build even larger MoE models.
The Cache Trick: Offloading “Cold” Experts. Clever engineers are working on solutions, like using “caching” techniques. Imagine moving the sushi expert to a less expensive part of the kitchen if sushi orders are rare, and only bringing them back to the main station when needed. In AI, this means offloading experts that are used less frequently to the CPU’s main memory (which is slower but cheaper than GPU memory) and quickly loading them back onto the GPU when they are needed. This helps manage the VRAM crunch.

Expert Parallelism and Big Clusters

To handle these massive MoE models efficiently, especially when you have tons of users making requests at the same time, we often use large clusters of computers working together. MoEs are well-suited for this thanks to “expert parallelism”:

Dividing the Experts Across Many GPUs. Imagine distributing our restaurant’s chefs across multiple kitchens in different locations. “Expert parallelism” means we can split the MoE model’s experts and put them on different GPUs (and even different machines) within a large cluster.
Scale Out, Not Just Up. This allows for incredible scalability. Instead of just trying to get bigger and bigger individual GPUs (scaling up), we can scale out by adding more GPUs and machines to the cluster. This is often a more cost-effective way to handle massive workloads.
Optimized for Large-Scale AI. Expert parallelism is a key technique for making MoE models practical for large-scale AI deployments in cloud environments.

The Load Balancing Tightrope: Hot Experts and Cold Experts

Here’s where things get really interesting (and a bit tricky) for cloud providers:

Uneven Workload Distribution: “Hot” vs. “Cold” Experts. Remember how some experts are used more than others? Well, when requests come in and get routed, some experts (the “hot” ones) might get slammed with tons of work, while others (the “cold” ones) might be sitting idle. Think of the pasta chef in a busy Italian Festa night compared to the sushi chef on that same night.
Bottlenecks and Overload. If a few experts are constantly in high demand, the GPUs hosting those “hot” experts can become bottlenecks. They can get overloaded, run out of memory (Out-of-Memory errors - OOM!), and slow everything down.
Wasted Resources. Meanwhile, the GPUs hosting the “cold” experts are sitting there, underutilized and wasting resources.
Load Balancing is CRITICAL. To make MoE models work smoothly and efficiently in the real world, load balancing is absolutely essential. We need smart systems that can scale the expert models against the incoming requests to allow for better hardware, making sure no single expert or GPU gets overwhelmed while others are idle. This is a complex orchestration challenge!

This “May” Have HUGE Implications for Cloud Infra

So, what does all this mean for the cloud infrastructure that powers AI? Potentially, a lot! MoE models could drive some significant shifts in how cloud providers design and operate their AI services:

Smarter Infrastructure: Model-Aware Cluster Management. Cloud providers might need to become much more “model-aware.” Instead of just treating all AI workloads the same, they might need to understand the specific architecture of the models being deployed, especially MoEs.
- Routing Logic Integration. Imagine the cloud infrastructure needing to understand the “routing logic” of the MoE model itself! The way requests are routed to experts within the model could actually influence how the cloud system schedules tasks and allocates resources across its hardware.
- Deeper AI Integration = Competitive Edge. Cloud providers who can deeply integrate their infrastructure with the intricacies of MoE models (and other advanced AI architectures) could gain a significant advantage. They can offer more efficient, performant, and cost-effective AI services.
Democratizing AI Hardware: Scalability on “Lower-End” Hardware. While memory is a challenge, the reduced compute requirements of MoEs open up some exciting possibilities:
- Cheaper Hardware Becomes Viable. Because you don’t need massive compute power for every single request, it might become feasible to run large MoE models on less expensive hardware than what’s currently used for giant dense models.
- Consumer Hardware in the Game? Could we even see scenarios where consumer-grade hardware becomes more relevant for running parts of large AI models, especially with expert parallelism? It’s a bit futuristic, but MoEs could shift the balance.
Opening Doors for Non-CUDA Players? A More Diverse AI Hardware Market. The AI hardware market is currently dominated by NVIDIA and their CUDA platform. MoEs might create opportunities for other hardware vendors:
- Relaxed High-End Compute Pressure. If the extreme compute demands of dense models are somewhat lessened by MoEs, it could reduce the pressure to only use the absolute highest-end, most expensive GPUs.
- Memory and Scalability Focus. With MoEs, memory capacity and efficient scaling across many devices become even more critical. This could create niches for hardware that excels in these areas, even if it’s not the absolute top performer in raw compute for dense models.
- Inference on Diverse Hardware. Especially for inference (using the model to answer questions or generate outputs), where volume and memory might be more important than raw compute speed, we could see more diverse hardware options become competitive, potentially including non-CUDA architectures.

Conclusion

The MoE Era: More Complex, Potentially More Accessible, Definitely Transformative

MoE models are not just a tweak to existing AI – they represent a fundamental shift in how we build, deploy, and utilize large AI systems. They bring incredible potential for efficiency and scalability, but also introduce new challenges in memory management, load balancing, and infrastructure design.

As MoEs become more prevalent, expect to see some exciting changes in the cloud AI landscape, potentially leading to more democratized access, more diverse hardware options, and even smarter and more specialized AI services.

Let me know what you think, and if you have more questions!

Note

There have already been models from several large companies that are MoE based, eg, Grok(from xAI), DBRX(from Mosaic) and Mixtral.

So Deepseek isn’t quite using anything totally new MoE model, but it has brought along several other improvements to bring down costs. That make it interesting.

Useful Papers - if you want to look into technical details

Mixture of Experts LLMs technical details ↩