Move Over Groq, Cerebras Now Has the World’s Fastest AI Inference

Cerebras has finally opened access to its Wafer-Scale Engine (WSE) and it’s achieving 1,800 tokens per second while inferencing the Llama 3.1 8B model. As for the largerLlama 3.1 70Bmodel, Cerebras clocks up to 450 tokens per second. Till now,Groqwas the fastest AI inference provider, but Cerebras has now taken that crown.Introducing Cerebras Inference‣ Llama3.1-70B at 450 tokens/s – 20x faster than GPUs‣ 60c per M tokens – a fifth the price of hyperscalers‣ Full 16-bit precision for full model accuracy‣ Generous rate limits for devsTry now:https://t.co/50vsHCl8LMpic.twitter.com/hD2TBmzAkw— Cerebras (@CerebrasSystems)August 27, 2024

Introducing Cerebras Inference‣ Llama3.1-70B at 450 tokens/s – 20x faster than GPUs‣ 60c per M tokens – a fifth the price of hyperscalers‣ Full 16-bit precision for full model accuracy‣ Generous rate limits for devsTry now:https://t.co/50vsHCl8LMpic.twitter.com/hD2TBmzAkw— Cerebras (@CerebrasSystems)August 27, 2024

Cerebras has developed its own wafer-scale processor that integrates close to 900,000 AI-optimized cores and packs 44GB of on-chip memory (SRAM). As a result, the AI model is directly stored on the chipset itself, unlocking groundbreaking bandwidth. Not to mention, Cerebras is running Meta’s full 16-bit precision weights meaning there is no compromise on accuracy.Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s.@CerebrasSystemshas just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips.Cerebras Inference is…pic.twitter.com/WWkTGy1qpE— Artificial Analysis (@ArtificialAnlys)August 27, 2024

Cerebras has set a new record for AI inference speed, serving Llama 3.1 8B at 1,850 output tokens/s and 70B at 446 output tokens/s.@CerebrasSystemshas just launched their API inference offering, powered by their custom wafer-scale AI accelerator chips.Cerebras Inference is…pic.twitter.com/WWkTGy1qpE— Artificial Analysis (@ArtificialAnlys)August 27, 2024

I did test Cerebras’ claim and it generated a response at a breakneck pace. While running the smaller Llama 3.1 8B model, it achieved a speed of 1,830 tokens per second. And on the 70B model, Cerebras managed 446 tokens per second. In comparison, Groq pulled 750 T/s and 250 T/s while running 8B and 70B models, respectively.

Artificial Analysis independently reviewed Cerebras’s WSE engine and found that it does deliver unparalleled speed at AI inference. You canclick hereto check out Cerebras Inference by yourself.

Arjun Sha

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.

Add new comment

Name

Email ID

Δ

01

02

03

04

05