The National Cancer Institute's New AI Strategy: From Data Silos to Knowledge Engines

The National Cancer Institute (NCI) is fundamentally reimagining how cancer researchers access and use data, moving from scattered databases to unified, AI-ready systems that can extract insights across genomics, imaging, and clinical information simultaneously. Under new director Anthony G. Letai, MD, PhD, the agency is prioritizing artificial intelligence as a cornerstone of its mission, with a particular focus on transforming how researchers discover patterns hidden in massive, messy datasets .

Why Is the NCI Overhauling Its Data Infrastructure Now?

For decades, cancer researchers have faced a persistent problem: data exists everywhere, but it doesn't talk to each other. A patient's genomic profile lives in one system, their imaging scans in another, and their treatment outcomes in a third. This fragmentation wastes time and obscures connections that could accelerate drug discovery and personalized treatment .

The NCI Cancer Research Data Commons (CRDC) was created to solve this, but as the volume and variety of cancer data have exploded, a new challenge emerged. The real bottleneck isn't data availability anymore; it's semantic harmonization, a technical term meaning that different institutions use different labels and formats for the same medical concepts. One hospital might code a diagnosis as "adenocarcinoma, stage 2B," while another uses "AC-2B," making it nearly impossible for AI systems to learn across institutions .

How Is the NCI Leveraging AI to Connect Cancer Data?

To address this constraint on a national scale, the NCI has partnered with ARPA-H's Biomedical Data Fabric (BDF) Toolbox program, a government initiative focused on accelerating biomedical breakthroughs through better data infrastructure. Together, they are building technologies centered on three core capabilities:

  • Ontology-Driven Modeling: Creating standardized vocabularies so that a cancer diagnosis means the same thing across every hospital and research center in the country.
  • Standardized Metadata Frameworks: Ensuring that every piece of data carries consistent context, like patient age, treatment date, and outcome, so AI systems can reliably learn from it.
  • Semantic Crosswalks: Building automated bridges between different data formats so that legacy systems can communicate with modern AI pipelines without manual translation.

Within the CRDC ecosystem, these BDF capabilities strengthen cross-commons interoperability by aligning data elements across genomic, imaging, and clinical domains. More importantly, they enable construction of longitudinal, computable patient knowledge graphs, a technical approach that treats each patient's entire medical history as an interconnected network of facts that AI can reason about .

The Childhood Cancer Data Initiative (CCDI) exemplifies the real-world impact of this strategy. By contributing harmonized pediatric oncology datasets into CRDC, CCDI leverages BDF-enabled semantic frameworks to normalize diagnoses, treatments, biospecimens, and outcomes across institutions. This means researchers studying childhood leukemia can now access standardized data from dozens of hospitals simultaneously, rather than negotiating individual data-sharing agreements with each one .

What Does This Mean for Cancer Researchers and Patients?

The shift from data aggregation to structured knowledge generation has three immediate implications. First, researchers can now build portable feature engineering pipelines for multimodal AI workflows, meaning they can train machine learning models on imaging plus genomics plus clinical data all at once, rather than analyzing each data type separately. Second, the standardized infrastructure reduces the time researchers spend on data preparation and increases the time spent on actual discovery. Third, findings from one institution can be more easily validated across other institutions, accelerating the path from research to clinical practice .

"NCI has never been in a stronger position to advance its mission of reducing suffering from cancer," said Jon Retzlaff, AACR's chief policy officer, noting that Director Letai's priorities include immuno-oncology, cancer vaccines, functional precision medicine, artificial intelligence, and ensuring that the United States maintains its lead over China in medical research.

Jon Retzlaff, Chief Policy Officer at AACR

The NCI is also investing heavily in the underlying technologies that make this possible. Two cornerstone programs, the Innovative Molecular Analysis Technologies (IMAT) and Informatics Technology for Cancer Research (ITCR), provide sustained funding for technologies across their full lifecycle, from early-stage, high-risk ideas to widely adopted research tools. These programs have already supported development of transformative capabilities including liquid biopsy techniques, spatial omics platforms, multidimensional microscopy systems, extrachromosomal DNA assays, and artificial intelligence applications .

Beyond the technical infrastructure, the NCI is also strengthening its commitment to the next generation of cancer researchers. The agency offers numerous fellowships and career development pathways for students, postdoctoral scholars, and early-career scientists, including the R50 program, which provides salary support for exceptional research specialists and clinician scientists. These investments signal that the NCI views AI-enabled cancer research not as a temporary trend, but as the foundational approach for the next decade of discovery .

Director Letai will outline his full vision for the agency at the AACR Annual Meeting 2026 in San Diego, where he is expected to discuss how the NCI's renewed focus on AI, data harmonization, and international competitiveness will reshape cancer research priorities across the nation .