AI Health Chatbots Are Everywhere Now,But Nobody's Really Testing Them
Major tech companies are rolling out AI health chatbots to the general public without independent expert evaluation, raising concerns about safety in a high-stakes area where mistakes can harm users. Microsoft launched Copilot Health in March 2026, Amazon expanded its Health AI tool beyond One Medical members, and OpenAI's ChatGPT Health arrived in January . These products tap into genuine demand: Microsoft reports receiving 50 million health questions daily, with health being the most popular discussion topic on its Copilot mobile app . Yet researchers worry that companies are evaluating their own products without external oversight, potentially missing critical blind spots.
Why Are Tech Companies Rushing to Launch Health Chatbots?
The timing isn't accidental. Large language models (LLMs), which are AI systems trained on vast amounts of text to generate human-like responses, have improved dramatically in their ability to discuss health topics . But capability alone doesn't explain the rush. The real driver is a healthcare access crisis. "There is a reason that these tools exist and they have a position in the overall landscape," explained Girish Nadkarni, chief AI officer at the Mount Sinai Health System. "That's because access to health care is hard, and it's particularly hard for certain populations" . With millions of people unable to reach doctors easily, these chatbots offer a 24/7 alternative that feels less judgmental than calling a clinic.
Dominic King, vice president of health at Microsoft AI and a former surgeon, cited AI advancement as the core reason why the company formed its health team and launched Copilot Health . The combination of improved AI capabilities and massive user demand has created what tech leaders see as a pivotal moment. "Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions," noted Karan Singhal, who leads OpenAI's Health AI team .
What's the Problem With Testing These Tools Before Release?
The concern isn't that health chatbots are inherently dangerous; it's that they're being deployed at scale without the kind of independent testing that would catch serious flaws. A recent study from Mount Sinai researchers found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies . While OpenAI has suggested the study's methodology might not capture the full picture, the research surfaced a troubling pattern: these tools are reaching millions of users before independent experts have thoroughly evaluated them.
Companies do test their products internally. OpenAI designed HealthBench, a benchmark that scores LLMs on how they respond in realistic health conversations, though the conversations themselves are AI-generated rather than real patient interactions . When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported that it performed substantially better than previous models on HealthBench, though overall performance was "far from perfect" . But internal benchmarks have blind spots that external researchers might catch.
Andrew Bean, a doctoral candidate at the Oxford Internet Institute, conducted a study revealing a critical gap: even if an LLM can accurately identify a medical condition from a fictional scenario on its own, a non-expert user given the same scenario with LLM assistance might figure it out only one-third of the time . "If they lack medical expertise, users might not know which parts of a scenario,or their real-life experience,are important to include in their prompt, or they might misinterpret the information that an LLM gives them," Bean explained . This performance gap could be significant for OpenAI's models, especially in conversations requiring the chatbot to ask follow-up questions.
How Do These Health Chatbots Actually Work?
Health chatbots use large language models to process user questions and generate responses based on patterns learned from training data. They can access user medical records if granted permission, and they're designed to provide health advice, help with triage (deciding whether someone needs emergency care), and answer general wellness questions. The technology behind them is the same as general-purpose AI assistants like ChatGPT, but fine-tuned for health contexts.
The vision is appealing: chatbots could improve user health while reducing pressure on the healthcare system. If triage works well, patients with emergencies might seek care earlier, and those with mild concerns might manage symptoms at home instead of clogging emergency rooms and doctor's offices . But that vision depends entirely on the chatbots being accurate and reliable in ways that haven't been proven at scale.
Steps for Using Health Chatbots Safely
- Treat them as information sources, not diagnosis tools: All major health chatbots include disclaimers stating they're not intended for diagnosis or treatment, but these warnings are easy to ignore. Use them to gather general information, not to replace medical judgment.
- Provide complete context when asking questions: Research shows that users without medical training often omit important details when describing symptoms. Be thorough about your medical history, current medications, and specific symptoms to get more accurate responses.
- Verify critical advice with a healthcare provider: If a chatbot recommends urgent care, emergency treatment, or a specific diagnosis, confirm it with a doctor before acting. Don't rely solely on AI for high-stakes health decisions.
- Understand the limitations of AI-generated conversations: These tools can miss context that a trained clinician would catch, especially in complex cases. They're best used for routine questions, not for unusual or serious symptoms.
What Would Better Testing Look Like?
Ideally, health chatbots would be subjected to controlled tests with real human users before being released to the public, similar to the study Bean conducted . This would reveal how non-experts actually interact with these tools and where they struggle. However, this approach faces practical barriers: the AI world moves fast, human testing takes time, and companies are under pressure to capitalize on demand.
All six academic experts interviewed for the MIT Technology Review piece agreed that LLM health chatbots could have real benefits given how little healthcare access some people have . But every single one expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. "To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works," said Andrew Bean. "It's entirely plausible to me that these models have reached a point where they're actually worth rolling out. But the evidence base really needs to be there" .
"We all know that people are going to use it for diagnosis and management," said Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.
Adam Rodman, Internal Medicine Physician and Researcher at Beth Israel Deaconess Medical Center
The gap between what companies claim their tools can do and what independent testing reveals is widening. OpenAI has reported that GPT-5.4, its current flagship model, is actually worse at seeking context than GPT-5.2, an earlier version . This suggests that even as models improve in some areas, they can regress in others, a risk that internal testing alone might not catch.
The real question isn't whether AI can help with health advice; the sources suggest it can, in many cases. The question is whether deploying these tools to millions of people without independent expert review is responsible when the stakes are this high. As these chatbots become more integrated into how people think about their health, the pressure to test them rigorously will only grow.