I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.
by nl
|
Mar 21, 2026, 12:14:20 PM
I'm not sure that I buy their conclusion that more compute during inference is good.<p>Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.<p>With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.<p>If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.
by jychang
|
Mar 21, 2026, 12:14:20 PM
> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.<p>Why can’t they simply say -<p>Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.
by robofanatic
|
Mar 21, 2026, 12:14:20 PM