Four Trillion Transistors Chip

Arthur Hanson · Sep 3, 2025

Any thoughts on the WSE-3 chip from Cerebras that has four trillion transistors with 900,000 optimized cores? Is this a game changer or curiousity?

Xebec · Sep 3, 2025

I think the software stack is the Achilles heel here; unless it can run CUDA it isn't going to take away significant Nvidia market share in the next 3-5 years.

Paul2 · Sep 3, 2025

My SSD has 4 trillion transistors flash chips

blueone · Sep 3, 2025

Arthur Hanson said:
Any thoughts on the WSE-3 chip from Cerebras that has four trillion transistors with 900,000 optimized cores? Is this a game changer or curiousity?

It's a game changer for AI. Nothing touches it, performance-wise, especially for inference. There are numerous challenges for a wafer-scale processor, especially in applications. They need to keep the cores small, so defects don't cause lots of expensive die area to be lost to relatively large cores with just small defects. But AI-specific cores are naturally small. The packaging needs to be completely custom, and the cooling system is a work of art. No standard high volume rack chassis for Cerebras. Clustering the modules is a networking challenge too, but Cerebras uses it's own internal networking architecture, and uses 100Gb Ethernet for external communications with a custom gateway for translation to their internal message passing architecture.

The advantages in chip resident SRAM (44GB) and 21PB/sec of on chip memory bandwidth are game changers in and of themselves.

Cerebras can currently cluster 2048 WSE-3 nodes into one system, which means you're looking at 1.8B+ cores, but I don't think they've built any large clusters yet.

The WSE-3 can be used for training, but they seem to be focusing on inference. They're still a private company, but their investor list looks like a who's who of industry superstars. These guys are the Cray Research of AI systems for inferencing.

I think they're amazing.

Xebec · Sep 4, 2025

blueone said:
It's a game changer for AI. Nothing touches it, performance-wise, especially for inference.

Does outright performance really matter here though?

AI is already a highly risky investment for any major firm, and if you're spending billions on something already highly risky -- would you further add to the risk by trying Cerebras's products instead of just buying from Nvidia?

I agree the tech is pretty fantastic, but I think they're just waiting for a buy out from an AMD, Nvidia, or similar at this point?

blueone · Sep 4, 2025

Xebec said:
Does outright performance really matter here though?

Of course, it depends on the application, but performance matters in the usual metrics: latency (time to first token), and throughput (tokens per second). With thousands or tens of thousands of concurrent users, inference performance can be critical to customer application success.

Xebec said:
AI is already a highly risky investment for any major firm, and if you're spending billions on something already highly risky -- would you further add to the risk by trying Cerebras's products instead of just buying from Nvidia?

I think Cerebras products are largely still in the investigation, research, and proof of concept phases for user applications. I agree, Nvidia clusters are the low-risk approach. I think, but don't have data, that cloud company AI chips and systems are mostly targeted at internal applications.

Xebec said:
I agree the tech is pretty fantastic, but I think they're just waiting for a buy out from an AMD, Nvidia, or similar at this point?

I doubt it. Cerebras was unicorn a few years ago, and have had a plan to go public. As with so many unicorns, I think they're just waiting for the exact right time to saddle themselves with the complexity of being a public company to get the huge payoff of being public.

KevinK · Sep 4, 2025

blueone said:
I think they're amazing.

I agree - they have found a way to build wafer scale chips that have killed many companies before them and harness the WSE-3 / CSE-3 on one of the most relevant problems of the times, LLM inference. Their on-chip static memory and associated bandwidth gives them pretty much unrivaled performance on LLM benchmarks, plus the short on-chip interconnect gives them a power advantage on a per compute basis. Where I think they fall down a bit is on their capital cost per million tokens in a many-user environment.

Paul2 · Sep 4, 2025

KevinK said:
found a way to build wafer scale chips that have killed many companies before them

Amdahl, Amdahl, Amdahl was indeed the absolute star, and hype champion of the era, and we know how he ended. You could indeed have a benchmark beating product, the most spoken about product, the most commercially profitable one, with the most exclusive features, and still fly out of the window, if you fail at mass market.

blueone · Sep 4, 2025

KevinK said:
Where I think they fall down a bit is on their capital cost per million tokens in a many-user environment.

I don't know how to answer this. Cerebras offers their own cloud services for their systems, and the pricing is publicly advertised. I'm not ambitious enough to do the rigorous analysis, but it is claimed to be competitive with competing cloud services. If you want to actually buy a Cerebras system, their business terms are confidential, and pricing isn't published.

Xebec · Sep 4, 2025

KevinK said:
I agree - they have found a way to build wafer scale chips that have killed many companies before them and harness the WSE-3 / CSE-3 on one of the most relevant problems of the times, LLM inference. Their on-chip static memory and associated bandwidth gives them pretty much unrivaled performance on LLM benchmarks, plus the short on-chip interconnect gives them a power advantage on a per compute basis. Where I think they fall down a bit is on their capital cost per million tokens in a many-user environment.

In theory their tech should give them the best perf/watt -- which should let them scale down cost per token in a large environment further than Nvidia's offerings.

Markwrob · Sep 4, 2025

Iirc their largest stakeholder (70+%?) is in the Middle East.

I like their approach, I think it's a clever way to work with what we have today. From what I glean of their documents, they take the whole wafer without dicing, and route redistribution (?) layers over the reticle boundaries. Please correct me if I'm wrong. If true, this technique would make for fascinating reading all by itself.

A question: is all the memory in the wafer itself, or do they add more atop the wafer somewhat like AMD's 3Dcache products?

blueone · Sep 4, 2025

This HotChips presentation on the WSE-2 system explains a lot.

https://hc34.hotchips.org/assets/program/conference/day2/Machine%20Learning/HC2022_Cerebras_Final_v02.pdf

blueone · Sep 4, 2025

Markwrob said:
Iirc their largest stakeholder (70+%?) is in the Middle East.

I like their approach, I think it's a clever way to work with what we have today. From what I glean of their documents, they take the whole wafer without dicing, and route redistribution (?) layers over the reticle boundaries. Please correct me if I'm wrong. If true, this technique would make for fascinating reading all by itself.

A question: is all the memory in the wafer itself, or do they add more atop the wafer somewhat like AMD's 3Dcache products?

They have a facility called MemoryX units, which are connected via their proprietary inter-WSE interconnect called SwarmX. This article describes the off-chip distributed DRAM memory at a high level:

Cerebras Goes Hyperscale With Third Gen Waferscale Supercomputers

It is a pity that we can’t make silicon wafers any larger than 300 millimeters in diameter. If there were 450 millimeter diameter silicon wafers, as we

www.nextplatform.com

Paul2 · Sep 4, 2025

Markwrob said:
Iirc their largest stakeholder (70+%?) is in the Middle East.

I like their approach, I think it's a clever way to work with what we have today. From what I glean of their documents, they take the whole wafer without dicing, and route redistribution (?) layers over the reticle boundaries. Please correct me if I'm wrong. If true, this technique would make for fascinating reading all by itself.

A question: is all the memory in the wafer itself, or do they add more atop the wafer somewhat like AMD's 3Dcache products?

yes, they have to stop the stepper to change masks for each die on a wafer, and restart the materials system each time

KevinK · Sep 5, 2025

blueone said:
I don't know how to answer this. Cerebras offers their own cloud services for their systems, and the pricing is publicly advertised. I'm not ambitious enough to do the rigorous analysis, but it is claimed to be competitive with competing cloud services. If you want to actually buy a Cerebras system, their business terms are confidential, and pricing isn't published.

Ah, but they are benchmarked for performance on specific models in terms of speed (MTokens/s), time to first token, and cost per million tokens by guys like Artificial Analysis.

https://artificialanalysis.ai/models/gpt-oss-120b/providers

But benchmarks vary greatly based other parameters like batch size and number of concurrent end users. NVIDIA was trolling Cerebras a week or so ago with one particular case where Cerebras endpoint performance was inferior to one of their smallish DGX8 boxes (and hence a huge cost advantage).

blueone · Sep 5, 2025

KevinK said:
Ah, but they are benchmarked for performance on specific models in terms of speed (MTokens/s), time to first token, and cost per million tokens by guys like Artificial Analysis.

https://artificialanalysis.ai/models/gpt-oss-120b/providers

But benchmarks vary greatly based other parameters like batch size and number of concurrent end users. NVIDIA was trolling Cerebras a week or so ago with one particular case where Cerebras endpoint performance was inferior to one of their smallish DGX8 boxes (and hence a huge cost advantage).

I admit, I'm biased. I hate benchmarks. I used to be a DBMS development guy, and given the complexity of systems, software architecture, and data structures, you could often design a specific benchmark your DBMS was better at or cheaper at than the competition. The sales and marketing people always wanted to win at something, especially something the customers thought was important. I hated those discussions. The really important capability was the whole weighted portfolio of capabilities. But I digress.

That's a great link you provided. Very interesting reading. The Cerebras results were not surprising at all. Lowest latency and highest throughput almost always costs more money. And Cerebras is still more of a curiosity than a market leader. But I still think what they've accomplished is amazing.

Search

Four Trillion Transistors Chip

Arthur Hanson

Well-known member

Xebec

Well-known member

Paul2

Well-known member

blueone

Well-known member

Xebec

Well-known member

blueone

Well-known member

KevinK

Well-known member

Paul2

Well-known member

blueone

Well-known member

Xebec

Well-known member

Markwrob

Active member

blueone

Well-known member

blueone

Well-known member

Cerebras Goes Hyperscale With Third Gen Waferscale Supercomputers

Paul2

Well-known member

KevinK

Well-known member

blueone

Well-known member