The Data Trapped in Scientific Papers: How AI Is Unlocking Hidden Materials Research

A team at Tohoku University has created an AI system that solves a critical bottleneck in materials science: extracting experimental data buried inside scientific paper figures and tables. The system, called DIVE (Descriptive Interpretation of Visual Expression), uses multiple AI agents working together to read, interpret, and organize data from research papers in structured form, then proposes new material candidates. This breakthrough addresses a fundamental problem that has slowed materials discovery for years .

Why Are Figures and Tables Such a Problem for AI?

Materials scientists have long relied on data-driven artificial intelligence to explore new materials efficiently. However, much of the most valuable experimental data in materials research exists only as images embedded in scientific papers. A researcher looking to build a database of hydrogen storage materials, for example, would need to manually extract numbers from hundreds or thousands of published figures and tables, a process that is time-consuming, error-prone, and nearly impossible to scale .

General-purpose AI models struggle with this task because they try to extract and interpret everything at once. DIVE takes a different approach by breaking the problem into specialized steps, with different AI agents handling different aspects of the work .

How Does DIVE Actually Work?

The DIVE system uses a multi-agent workflow where each AI agent has a specific role in the data extraction process. Rather than one model attempting to read figures, interpret captions, and verify numbers all at once, DIVE divides the labor. One agent focuses on understanding figure content, another interprets captions, and a third verifies that the extracted numbers are consistent and accurate. This step-by-step approach achieved substantial improvements in both accuracy and applicability compared to conventional methods that rely on a single multimodal model .

The results speak for themselves. In benchmarks focused on the hydrogen storage materials field, DIVE achieved extraction accuracy that was 10 percent to 15 percent higher than general commercial multimodal models, and over 30 percent better than open-source models .

Steps to Accelerate Materials Discovery With AI-Powered Data Extraction

  • Build Specialized Multi-Agent Workflows: Instead of relying on a single AI model to handle all aspects of data extraction, divide the task among specialized agents, each with a distinct role in reading, interpreting, and validating information from scientific literature.
  • Organize Extracted Data Into Structured Databases: Once figures and tables are read and interpreted, compile the data into organized, machine-readable databases that can be analyzed systematically for patterns and insights across thousands of publications.
  • Apply Inverse-Design Workflows to Propose Candidates: Use the structured database as a foundation to run computational workflows that work backward from desired material properties to suggest new candidate materials that may not yet exist in published literature.

From Reading Papers to Discovering Materials in Minutes

The Tohoku University team demonstrated the practical power of DIVE by building DigHyd (Digital Hydrogen Platform), an AI agent-based infrastructure for systematic analysis. They organized more than 30,000 data entries extracted from over 4,000 publications into this database. Using this foundation, they constructed an inverse-design workflow capable of proposing new hydrogen storage material candidates in as little as approximately two minutes .

To put this in perspective, what once required weeks or months of manual literature review and data entry can now be accomplished in seconds. The system doesn't just read the papers; it understands the science behind the data and uses that understanding to suggest materials that researchers should investigate further .

"The team developed the DIVE multi-agent AI workflow with the goal of not just reading figures and tables in scientific papers but also interpreting the extracted data on the basis of scientific reasoning," explained the research team led by Professor Hao Li and Director Shin-ichi Orimo from the Advanced Institute for Materials Research at Tohoku University.

Professor Hao Li and Director Shin-ichi Orimo, Advanced Institute for Materials Research, Tohoku University

What Materials Could Benefit From This Approach?

While the Tohoku team demonstrated DIVE's capabilities using hydrogen storage materials, the system is designed to be broadly applicable across multiple fields. The researchers identified several areas where DIVE could streamline database construction and accelerate materials exploration :

  • Battery Materials: Extract performance data from thousands of battery research papers to identify promising new electrode or electrolyte compositions.
  • Catalysts: Systematically organize experimental results from catalysis research to propose new materials for chemical reactions and industrial processes.
  • Thermoelectric Materials: Compile data on thermal and electrical properties from published research to discover materials that convert heat to electricity more efficiently.

What Comes Next for AI-Powered Materials Research?

The Tohoku team's work is far from finished. Future development will focus on expanding DIVE's compatibility with a wider range of figure and table formats, since scientific papers use diverse ways to present data. The researchers also plan to enhance autonomous materials design workflows that make use of the extracted data, moving beyond simple candidate proposal toward more sophisticated computational design .

The ultimate vision is ambitious: building a new research infrastructure in which AI reads scientific literature, including all figures and tables, to accelerate materials discovery at a scale that would be impossible for human researchers alone. This approach could fundamentally change how materials scientists work, transforming the field from one where researchers manually hunt through papers to one where AI systematically extracts and synthesizes knowledge from the entire published record .

For materials science, this represents a shift from asking "What data can we find?" to asking "What new materials should we make?" The bottleneck is no longer the lack of information in published research; it's the ability to extract and act on that information quickly. DIVE and systems like it are removing that bottleneck, one figure at a time.