The interface layer for AI products is no longer fixed. Before 2026, product teams designed screens and flows with predictable inputs: keyboard, touch, or click. Today, companies must actively choose whether users interact via voice, text, image, video, documents, or combinations of each. This modality decision has become as important as the AI model itself, and getting it wrong can undermine everything that made a product valuable in the first place. Pinterest's product teams recently clashed with their CEO over this exact question. The CEO advocated for a voice-first AI assistant, believing that conversational interfaces would make shopping feel like "talking to a friend" and align with Gen Z expectations. But the company's designers and product leaders pushed back hard, arguing that forcing voice onto a platform built around quiet, visual discovery would destroy its core value proposition. What Makes Modality Choice So Critical for AI Products? The Pinterest dispute highlights a fundamental tension in 2026 product design. Before AI, the modality question barely existed because the answer was almost always the same. Now, teams must evaluate whether voice, text, image, or video best serves their users' actual behavior and expectations. A voice-first experience might feel natural for a voice assistant but catastrophic for a platform where users browse quietly and visually discover products. Recent examples show how companies are experimenting with different modality combinations. Google's updated Stitch agent went viral in design circles for letting users talk to an infinite canvas while mixing voice, text, and images in one space, with an AI agent understanding the entire project context. This represents a different approach: combining modalities rather than choosing one. How to Choose the Right Modality for Your Product - User Context and Environment: Consider where and how users actually interact with your product. A shopping app used during quiet browsing sessions may suffer from voice-first design, while a task management tool used by engineers might thrive with voice delegation for quick commands. - Core Value Proposition Alignment: Evaluate whether the new modality enhances or undermines what made users choose your product originally. Pinterest's visual discovery experience could be damaged by forcing conversational voice interactions. - Multimodal Combination Testing: Rather than choosing a single modality, test how combining voice, text, and images works together. Google Stitch and Replit Agent 4 demonstrate that the real power lies in letting users switch between modalities within a single session. A practical framework is emerging across leading companies. According to analysis of 30+ real-world examples from Google, Lyft, Notion, Shopify, Revolut, Zendesk, Headspace, and DoorDash, product teams are asking five core questions when introducing new modalities: Does this modality match user behavior? Does it enhance or replace existing workflows? Can users switch between modalities seamlessly? Does it improve or degrade the core experience? And what happens when modalities combine ? Where Is Voice Actually Heading in 2026? Voice isn't disappearing, but its role is shifting. Spotify's engineers now start every day with voice task delegation on their phones, suggesting that voice works best for quick, hands-free commands rather than primary interactions. This signals a move away from voice-first interfaces toward voice as one option among many. The real innovation isn't happening in single-modality products. Companies like Google are showing what's possible when voice, image, and text work together in a single session. Users can speak to describe a project, paste images for reference, type detailed notes, and have the AI agent understand the complete context across all modalities. This hybrid approach preserves the strengths of each modality while avoiding the weaknesses of forcing one onto users who don't want it. The broader implication is clear: in 2026, the modality question isn't about picking the best single interface. It's about understanding your users deeply enough to know which modalities serve their actual workflows, and then designing seamless transitions between them. Pinterest's designers understood this instinctively. Their CEO's voice-first vision might have worked for a different product, but it would have destroyed the quiet, visual experience that makes Pinterest valuable. That's the real lesson emerging from product teams across the industry right now.