By using this site, you agree to our Privacy Policy and Terms of Use.
Accept
VellaTimesVellaTimesVellaTimes
  • News
    NewsShow More
    Thick smoke and flames engulf a multi-story hotel building in central Beirut at night following an airstrike.
    Israel Strikes Beirut Hotel, Killing 4 Amid Wider War
    March 9, 2026
    Thick dark smoke rises over an industrial refinery area on Sitra island in Bahrain as emergency vehicles respond nearby.
    Bapco force majeure after Iran attack on Bahrain refinery
    March 9, 2026
    A split landscape showing a cracked, dry earth foreground transitioning into a thriving, irrigated green field under bright sunlight.
    Global Water Crisis: Drought Impacts and Recovery Efforts
    March 9, 2026
    A sleek modern smartphone and tablet displaying futuristic glowing AI data nodes, representing Samsung's expansion of Gemini-powered Galaxy AI across its device ecosystem.
    Samsung to Double Galaxy AI Devices to 800 Million in 2026
    March 9, 2026
    A glowing futuristic holographic data interface in a modern server room representing fast artificial intelligence processing.
    Google Gemini 3.1 Flash Lite Launched for Developers
    March 9, 2026
  • Technology
    TechnologyShow More
    A glowing futuristic holographic data interface in a modern server room representing fast artificial intelligence processing.
    Google Gemini 3.1 Flash Lite Launched for Developers
    March 9, 2026
    A futuristic server room with glowing blue and orange digital data streams representing the strategic cloud and artificial intelligence partnership between Amazon and OpenAI.
    Amazon OpenAI Investment: $50 Billion Deal Explained
    March 9, 2026
    Professionals working on advanced digital screens in a modern office, representing the technology-driven 2026 job market.
    2026 Job Market: AI Hiring Trends and Economic Shifts
    March 8, 2026
    A glowing Bitcoin and digital gold tokens displayed over a modern financial trading background with market data screens.
    Bitcoin Fluctuates Amid Geopolitical Tensions and Institutional ETF Inflows
    March 8, 2026
    A close-up view of an advanced semiconductor microchip glowing on a high-tech manufacturing assembly line.
    Nvidia AI Chips: New Strategy Amid Export Controls
    March 8, 2026
  • AI
    AIShow More
    A sleek modern smartphone and tablet displaying futuristic glowing AI data nodes, representing Samsung's expansion of Gemini-powered Galaxy AI across its device ecosystem.
    Samsung to Double Galaxy AI Devices to 800 Million in 2026
    March 9, 2026
    A brightly illuminated conference stage screen displaying the words Core AI in a sleek modern font, with blurred audience silhouettes in the foreground.
    Apple to Replace Core ML With Core AI Framework at WWDC 2026
    March 9, 2026
    A futuristic neural network core in a modern data center representing the massive processing power of the DeepSeek V4 AI model.
    DeepSeek V4 Launch: 1T Parameters and 1M Token Context
    March 9, 2026
    A modern, high-tech corporate office lobby with blurred futuristic elements representing an artificial intelligence and robotics company.
    Caitlin Kalinowski Resigns as OpenAI Robotics Head Over Pentagon Deal
    March 9, 2026
    A modern high-tech corporate office where developers work alongside digital displays showing artificial intelligence networks and data connections.
    Microsoft Accelerates Agentic AI Push for Enterprise Developers
    March 8, 2026
  • Science
    ScienceShow More
    A split landscape showing a cracked, dry earth foreground transitioning into a thriving, irrigated green field under bright sunlight.
    Global Water Crisis: Drought Impacts and Recovery Efforts
    March 9, 2026
    A glowing, highly distorted atomic nucleus demonstrating an atomic glitch in a high-tech laboratory setting.
    Island of Inversion Discovered in Bizarre Atomic Glitch
    March 9, 2026
    A glowing, high-tech thin magnetic film with a hexagonal pattern emitting blue and purple energy waves in a dark modern laboratory setting.
    Engineered Magnetic Materials Behave Like Graphene
    March 9, 2026
    A computer screen in a dark laboratory displaying duplicated microscope images alongside a digital academic manuscript, illustrating scientific fraud.
    Scientific Fraud Boom: Fake Papers Flood Academic Journals
    March 9, 2026
    A rugged, rocky beach on Bering Island under overcast skies, representing the remote location where researchers discovered severed killer whale fins.
    Killer Whale Cannibalism: Severed Fins in Russia Spark Scientific Debate
    March 8, 2026
  • World
    WorldShow More
    Thick smoke and flames engulf a multi-story hotel building in central Beirut at night following an airstrike.
    Israel Strikes Beirut Hotel, Killing 4 Amid Wider War
    March 9, 2026
    Thick dark smoke rises over an industrial refinery area on Sitra island in Bahrain as emergency vehicles respond nearby.
    Bapco force majeure after Iran attack on Bahrain refinery
    March 9, 2026
    A dramatic geopolitical split-screen illustration showing Washington DC and Iranian government buildings under a stormy sky, representing international tensions.
    Iran’s New Supreme Leader Warned By Trump He Won’t Last
    March 9, 2026
    A high-end secure telephone on a polished wooden desk with subtly blurred flags of Qatar and the United States in the background, representing a high-stakes diplomatic phone call.
    Qatar Emir Warns Trump Over Middle East War Escalation
    March 9, 2026
    A SpaceX rocket preparing for launch at twilight, featuring digital network lines in the sky to represent Starlink and artificial intelligence integration.
    SpaceX IPO Targets Record $1.75 Trillion Market Valuation
    March 9, 2026
  • Bookmarks
Search
Category
  • News
  • Technology
  • AI
  • Science
  • World
Company
  • About Us
  • Contact Us
  • Fact Checking Policy
  • Terms & Conditions
  • Privacy Policy
  • Copyright Policy
Resources
  • Home
  • Web Stories
  • Bookmarks
  • Interests
  • Disclaimer
  • Sitemap
© 2022 VellaTimes • All Rights Reserved.
Reading: Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal AI
Share
Notification Show More
Font ResizerAa
VellaTimesVellaTimes
Font ResizerAa
  • News
  • Technology
  • AI
  • Science
  • World
Search
  • Explore
    • News
    • Technology
    • AI
    • Science
    • World
  • Useful Links
    • About Us
    • Contact Us
    • Fact Checking Policy
    • Terms & Conditions
    • Privacy Policy
    • Copyright Policy
  • Home
  • Web Stories
  • Bookmarks
  • Interests
  • Disclaimer
  • Sitemap
© 2022 VellaTimes • All Rights Reserved.
News

Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal AI

Sameer Katoch
Last updated: 09/03/2026
Sameer Katoch
Share
6 Min Read
A glowing holographic display showing mathematical equations and scientific charts in a modern, brightly lit server room.

Microsoft has officially launched Phi-4-reasoning-vision-15B, an open-weight multimodal artificial intelligence model featuring 15 billion parameters. This new system is specifically tailored for vision-language applications, allowing the AI to effectively process and analyze both text and images. The model demonstrates exceptional capabilities in generating image captions, analyzing complex documents, and performing mathematical and scientific reasoning based on visual inputs.

Contents
A New Approach to Hybrid ReasoningMid-Fusion Architecture and Hardware EfficiencyTraining Process and Data RefinementOutperforming Larger Models on BenchmarksPowering AI Agents and Visual Analysis

By introducing a compact, hardware-efficient system, Microsoft aims to provide a powerful alternative to larger, resource-heavy AI models. Phi-4-reasoning-vision-15B stands out for its unique hybrid reasoning capabilities. This architecture allows the model to actively decide when a task requires a multi-step thought process and when a direct, straightforward answer is sufficient, saving valuable computing time.

A New Approach to Hybrid Reasoning

One of the most defining features of this model is its mixed reasoning and non-reasoning training strategy. Instead of forcing the AI to use a complex chain-of-thought process for every single prompt, Microsoft trained the system to alternate seamlessly between two distinct modes.

For complicated challenges, such as mathematical problems or scientific chart evaluations, the model activates a “think” mode. This generates structured, multi-step reasoning traces to ensure high accuracy. For simpler, perception-focused tasks like basic image captioning or optical character recognition, it relies on a “no-think” mode to provide immediate, low-latency responses.

This hybrid setup was achieved by making reasoning data approximately 20 percent of the overall training mixture. While the AI learns the boundary between these modes implicitly, users retain full control. Developers can override the default behavior by using specific prompt tags to force the model into either state depending on their exact needs.

Mid-Fusion Architecture and Hardware Efficiency

To successfully balance performance with compute costs, Microsoft researchers built the model using a mid-fusion architecture. The system physically combines two existing algorithms: the SigLIP-2 vision encoder and the previously released Phi-4-Reasoning language model.

SigLIP-2 operates by compressing images into a numerical format, generating visual tokens that neural networks can easily understand. These tokens are then projected into the language model’s embedding space. In a mid-fusion setup, only some of the model’s layers support multimodal processing, unlike early-fusion designs where every layer handles multimodal data.

This strategic design trades a minimal amount of output quality for a massive reduction in hardware usage. To lower the infrastructure footprint even further, users can completely disable the reasoning feature via prompts if they want to prioritize pure speed. The model’s dynamic resolution vision encoder supports up to 3,600 visual tokens, ensuring detailed high-resolution perception without the sluggish latency often found in larger models.

Training Process and Data Refinement

Microsoft managed to train the model efficiently over just four days using 240 B200 GPUs. The model processed 200 billion multimodal tokens during its training phase. This is a mere fraction of the trillion-plus tokens required to train other recent multimodal models currently on the market.

The training data primarily consisted of open-source image and text collections, but Microsoft heavily refined this data through a multi-step process. High-quality datasets were preserved, while images featuring inaccurate captions were given entirely new, corrected descriptions generated by GPT-4o and o4-mini. The researchers also enriched the training mix with internally created data, targeted acquisitions, and specific safety datasets designed to prevent harmful outputs.

Outperforming Larger Models on Benchmarks

Despite its highly compact size, the model achieved impressive results across numerous open-source evaluations using testing frameworks like Eureka ML Insights and VLMEvalKit. On the MathVista_Mini benchmark, which specifically tests multimodal mathematics, the model scored 75.2, outperforming Google’s gemma-3-12b-it by a significant 17 percent margin.

The model also recorded notable scores on several other comprehensive tests. It achieved an 84.8 on the AI2D_TEST, an 83.3 on ChartQATEST, an 88.2 on ScreenSpotv2, and a 76.0 on OCRBench. Microsoft researchers note that the model delivers better accuracy than similarly fast models and offers highly competitive performance against slower models that require ten times more computing power.

Powering AI Agents and Visual Analysis

With its ability to accurately detect graphical user interface elements, the model is exceptionally well-suited for computer-use agents. It can interpret screen content, deduce the exact functions of buttons and menus from standard screenshots, and provide precise click coordinates for automation. This functionality makes it an ideal base model for navigating web, mobile, and desktop interfaces.

The system also excels at analyzing highly complicated visual assets. In a demonstration shared by Microsoft, a user uploaded a photograph of a tilted Saturn. The model accurately explained that the planet’s orientation depended entirely on the time of year and the specific position of the telescope used to capture the image. Developers can now access the model’s code directly through Hugging Face, GitHub, and Azure AI Foundry.

TAGGED: AI agents, Artificial Intelligence, machine learning, Microsoft AI, multimodal AI, open-source AI, tech news
Share This Article
Facebook Twitter Whatsapp Whatsapp Telegram Copy Link
By Sameer Katoch
As the Founder of VellaTimes and an avid traveler, I'm passionate about the daily news events happening globally. With over five years of experience in the writing field, I am committed to delivering top-notch news that satisfies your daily news intake.
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *


Most Read

Daily Horoscope Today, November 29, 2023 | Find Out Your Astrological Prediction Now

November 29, 2023

Deepfake takedown rule: India sets 3-hour deadline

February 20, 2026

Nvidia Q4 Earnings: Record $68.1B Revenue on AI Boom

February 27, 2026

Geneva Peace Talks End Without Breakthrough as Russia and Ukraine Cite ‘Difficult’ Progress

February 19, 2026

Russia strike on Ukraine power grid hits Kyiv, Kharkiv

January 14, 2026

DMTF1 protein may rejuvenate aging brain stem cells

February 13, 2026

Related News

Thick smoke and flames engulf a multi-story hotel building in central Beirut at night following an airstrike.
News

Israel Strikes Beirut Hotel, Killing 4 Amid Wider War

Editorial Staff Editorial Staff March 9, 2026
Thick dark smoke rises over an industrial refinery area on Sitra island in Bahrain as emergency vehicles respond nearby.
News

Bapco force majeure after Iran attack on Bahrain refinery

Editorial Staff Editorial Staff March 9, 2026
A split landscape showing a cracked, dry earth foreground transitioning into a thriving, irrigated green field under bright sunlight.
News

Global Water Crisis: Drought Impacts and Recovery Efforts

Nisha Pradhan Nisha Pradhan March 9, 2026

About Us

VellaTimesVellaTimesVellaTimes

VellaTimes is a leading news portal that covers the latest trending news in technology, lifestyle, entertainment, automobiles, travel, and sports.

Explore

  • News
  • Technology
  • AI
  • Science
  • World

Useful Links

  • About Us
  • Contact Us
  • Fact Checking Policy
  • Terms & Conditions
  • Privacy Policy
  • Copyright Policy

Subscribe Us

Subscribe to our newsletter for the Latest News and Top Stories!

© 2022 VellaTimes • All Rights Reserved.
  • Home
  • Web Stories
  • Bookmarks
  • Interests
  • Disclaimer
  • Sitemap
adbanner
AdBlocker Detected
Our site is an advertising supported site. Please whitelist us to support our work.
Okay, I'll Whitelist