The newly launched Nvidia Nemotron 3 Super is a 120-billion-parameter open artificial intelligence model designed to run complex, autonomous agentic AI systems at scale. By combining advanced reasoning capabilities with rapid processing, the model aims to solve major bottlenecks that enterprises face when shifting from basic chatbots to multi-agent applications.
The model delivers up to five times higher throughput and twice the accuracy of the previous-generation Nemotron Super. It utilizes a hybrid mixture-of-experts architecture, activating only 12 billion of its total parameters during operation. This efficiency allows businesses to run massive AI workflows without facing crushing computational costs.
Solving the Bottlenecks of Agentic AI Systems
As companies build multi-agent AI applications, they run into two significant hurdles: context explosion and the thinking tax.
Context explosion happens because multi-agent workflows generate up to 15 times more tokens than standard chat interactions. Every time a user interacts with an agent, the system must resend the entire workflow history, including tool outputs and intermediate reasoning steps. Over long tasks, this massive volume of data drives up costs and causes goal drift, a problem where agents gradually lose alignment with their original objective.
To fix this, the Nvidia Nemotron 3 Super features a native one-million-token context window. This massive memory capacity allows the model to retain the full state of a workflow, giving agents the long-term memory required to stay on track.
The second constraint is the thinking tax. Complex autonomous agents must reason at every step of a task. Relying on massive models for every tiny subtask makes these systems too expensive and sluggish for practical use. The new model solves this by utilizing its specialized architecture to balance deep reasoning with high-speed performance.
A Deeper Look at the Hybrid Architecture
Nvidia Nemotron 3 Super achieves its performance leap through a hybrid Mamba-Transformer mixture-of-experts backbone. This design blends different types of processing layers to maximize efficiency.
Mamba layers handle sequence processing, delivering four times higher memory and compute efficiency. Meanwhile, Transformer attention layers handle precise reasoning, allowing the model to find specific facts buried deep within conflicting information.
The mixture-of-experts design ensures that only a fraction of the model works at any given time. A new feature called Latent MoE compresses tokens before they reach the expert pathways. This allows the model to consult four times as many specialized experts for the same computational cost as running just one. For example, it can activate specific experts for Python syntax or SQL logic only when strictly necessary.
Additionally, the model uses multi-token prediction. Instead of guessing one word at a time, it predicts multiple future tokens simultaneously in a single forward pass. This forces the model to learn long-range logical dependencies and speeds up inference by three times.
Hardware Efficiency and Open Availability
Nvidia optimized the model to run on its Blackwell graphics processing units using NVFP4 precision. Training natively in this four-bit reduced precision cuts memory requirements and pushes inference speeds up to four times faster than what can be achieved with FP8 on the previous-generation Hopper platform, all without losing accuracy.
Nvidia is releasing the model with open weights, datasets, and recipes under a permissive license. The company published its methodology, though its official sources present slightly different training figures. According to the Nvidia blog, the model includes over 10 trillion tokens of pre-training and post-training datasets and used 15 reinforcement learning environments. Conversely, Nvidia’s developer blog specifies the pre-training corpus spans 10 trillion unique curated tokens, with 25 trillion total tokens seen, and notes it was post-trained across 21 environment configurations.
Developers can download the model from platforms like Hugging Face, OpenRouter, and the Nvidia build website. It is packaged as a microservice, allowing deployment across on-premises workstations, data centers, and the cloud.
Early Adoption Across Industries
Perplexity is the first partner to offer users access to the new model. The AI search company is integrating the model into its search engine and its AI agent system, known as Computer, to handle complex research and information synthesis.
Other organizations are already deploying the model for specialized tasks. Generative AI coding platforms like CodeRabbit, Factory, and Greptile are using it to load entire codebases into memory for end-to-end debugging. In the life sciences sector, Edison Scientific and Lila Sciences are powering agents for molecular understanding and deep literature research.
Major enterprises, including Palantir, Amdocs, Siemens, Cadence Design Systems, and Dassault Systèmes, are using the model to automate workflows in cybersecurity, telecommunications, semiconductor design, and manufacturing. Furthermore, Dell Technologies and Hewlett Packard Enterprise plan to offer the model through their enterprise agent hubs, alongside cloud integrations from Google Cloud, Oracle, Microsoft Azure, and Amazon Web Services.
