Meta's Muse Spark Marks a Dramatic Shift Away From Open-Source Llama
Meta has abandoned its signature open-source approach to artificial intelligence, launching Muse Spark, a proprietary model that achieves a performance score of 52 on the Artificial Analysis Intelligence Index, nearly tripling Llama 4 Maverick's score of 18 and placing it among the world's most capable AI systems. The move represents a fundamental strategic overhaul following the mixed reception of Llama 4 in 2025, which faced criticism over benchmark gaming and ultimately prompted Meta founder Mark Zuckerberg to restructure the company's entire AI division in summer 2025 .
What Happened to Meta's Llama Models?
For nearly three years, Meta built a massive following by releasing the Llama family of large language models (LLMs), which are AI systems trained on vast amounts of text to understand and generate human language, as open-source software that anyone could download and modify. This strategy made Meta a beloved figure in the AI developer community. However, the rollout of Llama 4 last year derailed that momentum. The model received mixed reviews, and Meta later admitted to gaming benchmarks, the standardized tests used to measure AI performance. The backlash prompted Zuckerberg to take dramatic action .
When asked directly whether Meta would continue developing future Llama models, a Meta spokesperson provided an evasive response, stating only that "our current Llama models will continue to be available as open source," without confirming ongoing development of new versions . This ambiguity has already raised concerns among the billions of Llama users and thousands of developers who built applications on the open-source models, particularly those active in communities like Reddit's r/LocalLLaMA subreddit.
How Does Muse Spark Represent a Technical Leap Forward?
Muse Spark is built from the ground up as a natively multimodal model, meaning it processes both text and images simultaneously rather than treating them as separate inputs. Unlike previous approaches that "stitched" vision and text together, Muse Spark integrates visual information directly into its core reasoning process. This architectural shift enables what Meta calls "visual chain of thought," allowing the model to annotate complex environments, identify components of intricate systems like espresso machines, and even correct a user's yoga form through side-by-side video analysis .
The most significant technical innovation is a feature called "Contemplating" mode, which orchestrates multiple sub-agents to reason in parallel. This allows Meta to compete directly with extreme reasoning models like Google's Gemini Deep Think and OpenAI's GPT-5.4 Pro. The model also employs a process called "thought compression," which penalizes excessive "thinking time" during reinforcement learning, a training technique that uses rewards to guide model behavior. This forces the model to solve complex problems with fewer reasoning tokens, or units of text, without sacrificing accuracy. The result is remarkable efficiency: Muse Spark achieves its reasoning capabilities using over an order of magnitude less compute than Llama 4 Maverick, its previous flagship .
Where Does Muse Spark Rank Against Global Competitors?
According to independent auditing from Artificial Analysis, a third-party LLM tracking firm, Muse Spark now sits within striking distance of the industry's most elite systems. It trails only Gemini 3.1 Pro Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53) on the Artificial Analysis Intelligence Index v4.0. This represents a dramatic return to form for Meta after a year-long absence from the absolute frontier of AI performance .
Muse Spark demonstrates particular dominance in multimodal reasoning, where visual figures and logic intersect. On CharXiv Reasoning, a benchmark measuring figure understanding, Muse Spark achieved a score of 86.4, significantly outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Pro (80.2), and GPT-5.4 (82.8). On MMMU Pro, a comprehensive multimodal benchmark, Muse Spark scored 80.4 to 80.5%, making it the second-most capable vision model on the market, surpassed only by Gemini 3.1 Pro Preview at 83.9%. On Visual Factuality, Muse Spark scored 71.3, placing it ahead of GPT-5.4 (61.1) and Grok 4.2 (57.4), though it narrowly trails Gemini 3.1 Pro (72.4) .
How Does Muse Spark Perform on Specialized Reasoning Tasks?
Muse Spark's "Thinking" capabilities were tested against specialized benchmarks designed to challenge non-reasoning models. On Humanity's Last Exam, a multidisciplinary evaluation, Meta reports a score of 42.8 without tools and 50.4 with tools, though independent audits by Artificial Analysis tracked the model at 39.9%, trailing Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%). On GPQA Diamond, a PhD-level reasoning benchmark, Muse Spark achieved 89.5, surpassing Grok 4.2 (88.5) but trailing Opus 4.6 (92.7) and Gemini 3.1 Pro (94.3) .
The model shows notable weakness in abstract reasoning. On ARC AGI 2, Muse Spark scored 42.5, far behind Gemini 3.1 Pro (76.5) and GPT-5.4 (76.1). However, in physics research benchmarks, Muse Spark achieved the fifth-highest score at 11%, marking a substantial lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%) .
Why Is Muse Spark Particularly Strong in Healthcare Applications?
One of the most striking results from official data is Muse Spark's performance in health-related tasks, likely a result of Meta's collaboration with over 1,000 physicians. On HealthBench Hard, Muse Spark achieved 42.8, a massive lead over Claude Opus 4.6 (14.8), Gemini 3.1 Pro (20.6), and even GPT-5.4 (40.1). On MedXpertQA, a multimodal medical benchmark, Muse Spark scored 78.4, comfortably ahead of Opus 4.6 (64.8) and Grok 4.2 (65.8), though it still trails Gemini 3.1 Pro's top-tier score of 81.3 .
Steps to Understand Muse Spark's Strategic Positioning
- Proprietary Distribution Model: Unlike Llama, Muse Spark is confined to Meta's AI app and website, plus a private API preview for select users, with no pricing information announced yet, representing a complete reversal of Meta's open-source philosophy.
- Performance Benchmarking: Muse Spark ranks in the global top 5 for AI models, with particularly strong performance in multimodal reasoning and healthcare applications, validating Meta's new scaling trajectory after Llama 4's disappointing reception.
- Organizational Restructuring: Meta's formation of Meta Superintelligence Labs (MSL) in summer 2025, led by 29-year-old Alexandr Wang, former Scale AI co-founder and CEO, signals a fundamental shift in how the company approaches AI development and strategy.
- Efficiency Gains: Muse Spark achieves its capabilities using over an order of magnitude less compute than Llama 4 Maverick through a process called thought compression, which optimizes reasoning without sacrificing accuracy.
Alexandr Wang, who was recruited from Scale AI to lead Meta's new AI division, announced the launch on X, the social network used frequently by the machine learning community. Wang stated that Muse Spark is "the most powerful model that meta has released," and has "support for tool-use, visual chain of thought, and multi-agent orchestration." He also indicated that Muse Spark would be the start of a new Muse family of models, raising questions about what will become of Meta's popular Llama lineup .
Wang, who was recruited from Scale AI to lead Meta's new AI division
"The most powerful model that meta has released," with "support for tool-use, visual chain of thought, and multi-agent orchestration," said Alexandr Wang.
Alexandr Wang, Chief AI Officer at Meta Superintelligence Labs
Muse Spark arrives not as a generic chatbot, but as the foundation for what Wang calls "personal superintelligence," an AI that doesn't just process text but "sees and understands the world around you" to act as a digital extension of the self. This vision echoes Zuckerberg's public manifesto for personal superintelligence published in summer 2025, suggesting that Meta's strategic pivot is not merely about model performance but about a fundamentally different vision for how AI should integrate into human life .
While Muse Spark excels at reasoning tasks, its performance in real-world work execution presents a more nuanced picture. On SWE-Bench Verified, which measures software engineering capabilities, Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Pro (80.6). This suggests that while the model "thinks" exceptionally well, it is still refining its ability to "act" in practical scenarios. The proprietary nature of Muse Spark, combined with the ambiguity surrounding Llama's future, represents a watershed moment for Meta and the broader AI ecosystem that has grown accustomed to the company's open-source generosity .