Amazon Web Services (AWS) is officially partnering with hardware startup Cerebras Systems to combine Amazon’s custom Trainium processors with Cerebras’ giant chips. This high-profile collaboration aims to significantly accelerate artificial intelligence (AI) inference workloads for global cloud computing customers. The joint effort also seeks to challenge Nvidia’s current dominance in the AI infrastructure and hardware market.
The new integrated hardware service will be directly deployed via Amazon Bedrock inside AWS data centers. There are conflicting reports regarding the exact launch timeline for the new hardware integration. According to Bloomberg, the new cloud computing service is expected to roll out in the second half of 2026. In contrast, official press statements from AWS and Cerebras indicate that the integration will officially launch in the next couple of months. While the exact financial terms of the agreement were not disclosed to the public, AWS Vice President Nafea Bshara noted that the two companies have been working toward this partnership for several years. Bshara also indicated that AWS intends to install as many Cerebras chips as market demand dictates.
Tackling the Speed Bottleneck
According to AWS, inference is the specific phase where AI delivers tangible value to end users. However, processing speed remains a critical bottleneck for highly demanding workloads, such as real-time coding assistance and interactive AI applications. As reasoning models begin to represent the majority of AI inference, these systems must compute and generate significantly more tokens per request as they “think” through complex problems. This shift has drastically increased the industry-wide need to accelerate the AI workflow.
Currently, prominent AI companies like OpenAI, Cognition, and Mistral utilize Cerebras hardware to accelerate their most demanding computing workloads. Cerebras has demonstrated that it can power models from OpenAI, Cognition, and Meta at speeds of up to 3,000 tokens per second. This speed is particularly crucial for tasks like agentic coding, where a software developer’s productivity is directly constrained by AI inference speeds.
The Disaggregated Inference Strategy
To achieve industry-leading processing speeds, the partner companies are deploying an innovative hardware strategy called disaggregated inference. Instead of relying on a single type of graphics processing unit (GPU) for the entire AI pipeline, the workload is strategically split into two specialized computing stages. These two distinct hardware systems are seamlessly connected within the AWS cloud infrastructure using Amazon’s high-bandwidth, low-latency Elastic Fabric Adapter (EFA) networking stack.
The first stage of the inference process is called “prefill,” which involves interpreting user prompts and converting them into tokens that AI systems can process. Amazon’s custom Trainium 3 chips, which feature dense compute cores designed for scalable performance, will exclusively handle this highly compute-intensive phase.
The second stage, known as “decode,” is a highly memory-intensive process where the AI model generates its final response token by token. Cerebras’ CS-3 system, also referred to as the Wafer Scale Engine, will exclusively manage this decode stage. The giant CS-3 chip is uniquely designed to store all AI model weights directly on-chip in static random-access memory (SRAM). This architectural design gives the CS-3 thousands of times more memory bandwidth than the fastest traditional GPUs available on the market.
Industry Impact and Future Outlook
David Brown, Vice President of Compute and Machine Learning Services at AWS, stated that separating the inference workload allows each piece of hardware to focus entirely on what it does best. He noted that this dual-chip approach will deliver inference speeds an order of magnitude faster and offer significantly higher performance than currently available options. Cerebras CEO Andrew Feldman described the hybrid architecture as a “divide and conquer” strategy that will bring the fastest possible inference to a global enterprise customer base.
This specialized hybrid hardware model is designed for strict cost efficiency. It aims to deliver five times more high-speed token capacity within the exact same physical hardware footprint. Later this year, AWS plans to begin offering leading open-source large language models (LLMs) and its proprietary Amazon Nova models running specifically on the new Cerebras hardware.
For Cerebras, a startup currently preparing for an initial public offering, securing AWS as a client marks a major corporate milestone. AWS is the first major hyperscaler data center operator to commit to utilizing Cerebras technology. While Amazon remains a significant customer of market leader Nvidia, the cloud provider continues to expand its own proprietary silicon roadmap. Because inference workloads are becoming massively large, cloud providers are increasingly experimenting with heterogeneous hardware architectures to bypass Nvidia’s firmly established CUDA software ecosystem and mature tooling.
