The Data Colonialism Problem: How AI Companies Are Learning to Pay for Genetic Information

Q: Why Are Tech Companies Suddenly Paying for DNA Data?

The problem isn't new, but it's becoming impossible to ignore. When OpenAI trained ChatGPT on internet-scraped text, it faced dozens of lawsuits from publishers and authors claiming copyright infringement. Encyclopedia Britannica and Merriam-Webster recently sued OpenAI, alleging it used their copyrighted material to train its models and generated responses that were "substantially similar" to their work . But genetic data presents a different ethical minefield. Unlike a published article, DNA samples come from specific communities and ecosystems. When Basecamp sends explorers to places like Cameroon, Costa Rica, the Arctic ice caps, and even Point Nemo, the most remote location in the ocean, the company is extracting biological value that could fuel billion-dollar drug discoveries. The question becomes: who owns that value, and who should benefit when AI models trained on that data generate profits? The Financial Times noted that Basecamp's global sampling efforts have faced criticism for echoing a modern form of colonialism, extracting value from communities without adequately sharing it . That tension forced the company to rethink its approach entirely.

Q: How Is Basecamp Actually Tracking and Paying for Genetic Data?

Since 2023, Basecamp says it has paid royalties to 60 organizations across 21 countries based on the use of digital sequence information, which is genetic data that underpins its AI models . To make this work, the company built systems to tag and track the origin of each data sample and measure how much it contributes to downstream outputs. In effect, Basecamp is attempting to create an audit trail for genetic information, allowing payments to be distributed accordingly. This approach stands in stark contrast to how most large language models (LLMs), the AI systems powering tools like ChatGPT and Claude, are trained. LLMs are typically trained on vast, messy datasets scraped from across the internet, where ownership, consent, and individual contributions from millions of sources are nearly impossible to track. Basecamp's modular tracking system suggests an alternative is possible, at least for structured biological data. The company's latest initiative, the Trillion Gene Atlas, developed in collaboration with Anthropic, Ultima Genomics, and PacBio, and powered by Nvidia's AI infrastructure, aims to expand what we know about genetic diversity 100-fold by collecting genomic data from more than 100 million species . Basecamp has raised $85 million in venture capital to date and is comparing this effort to the Human Genome Project, the landmark sequencing initiative that took 13 years and cost roughly $3 billion. The stakes are enormous. AI for science is often held up as the clearest example of what "AI for good" could look like. Curing cancer? Bring on the data. New medicines? Here's some DNA. But that framing obscures a fundamental question: good for whom? Basecamp's approach suggests people may be far less willing to accept their data being used to generate endless streams of marketing content than they are to help advance medicine or scientific discovery . The company's willingness to pay for genetic data reflects a bet that communities will share their biological

Q: What Does This Mean for the Broader AI Industry?

Basecamp's experiment reveals a critical gap in how the AI industry has approached data sourcing. While tech companies have largely treated data as a free resource to be scraped and processed at scale, biotech companies are discovering that genetic information is fundamentally different. It's tied to specific places, specific communities, and specific ecosystems. You can't anonymize a rainforest or a coral reef. The company's Eden models, introduced earlier this year, are trained on its growing biological dataset and designed to identify patterns across genes and ecosystems that would be difficult for humans to detect, potentially accelerating discoveries in drug development . But those models only work if communities continue to provide samples. That requires trust, and trust requires transparency about how data is being used and who benefits. For Basecamp's cofounders Glen Gowers and Oliver Vince, this wasn't an abstract ethical concern. The company began with a 2019 expedition to the Arctic to discover new species and genes. They found that two-thirds of the samples they hauled back to a makeshift lab in Iceland had never been recorded before. That experience led them to take a bet on building what they describe as an "internet of biology" for AI models to train on . Six years later, they're still staking out highly ambitious territory, but with a crucial difference: they're trying to ensure the communities providing that data aren't left behind. The question now is whether other AI companies will follow Basecamp's lead. As genetic data becomes increasingly valuable for training AI models, the pressure to establish ethical sourcing practices will only intensify. The alternative is a repeat of the copyright lawsuits and public backlash that have plagued the LLM industry, except this time with indigenous communities and developing nations at the center of the dispute.

FrontierNews.ai AI Research Desk

FrontierNews.ai