AI IP Year in Review - Challenges with Protocols for Training Data Discovery in AI Litigation

Authors

Sasha S. Rao Richard A. Crudo

As artificial intelligence (AI) litigation increases in frequency, parties may face discovery requests for their training data and must consider how best to protect it. Litigants have often relied on traditional source code inspection protocols. But the unprecedented scale and complexity of AI data require adaptation of such protocols.

The Value of Training Data

Training data is a highly valuable commercial asset. AI companies invest significant resources and expenses to collect, curate, and structure this data to create their models. Gathering this data can also be challenging given the time required and potential inconsistency or bias in available data. Thus, AI companies often treat the data as a valuable asset whose disclosure could cause competitive harm.

This training data can also be highly valuable in AI-based copyright litigation. Training data could include copyrighted works, and LLMs could generate outputs substantially similar to protected works. Access to this training data thus provides litigants with an opportunity to see firsthand the potential unauthorized use of copyrighted materials.

Training Data Inspection Protocols: Challenges and Opportunities

Where disclosure of training data seems likely, parties have stipulated to inspection protocols that govern a plaintiff’s ability to request and review a defendant’s data. But the unprecedented scale and complexity of AI data will likely require adaptation of the source code review protocols that litigants are familiar with.

Scale and volume. A major difference between source code and AI training data is scale and volume. The scale of AI training data makes it difficult to simply “hand over” in any human-reviewable form. Defendants can use their training data’s size to their advantage by making proportionality arguments. Further, both parties benefit from negotiating shortcuts for reviewing raw training data: defendants avoid opening their entire training datasets to review while plaintiffs can review the data more efficiently.

Heterogeneity and accessibility. Training data can use multiple languages and might require different types of indices and tools for facilitating review of different data types and formats. Given these numerous ways to organize training data, most protocols require technical guidance for efficient review.

Takeaways

Training data’s importance, both as a commercial asset and in litigation, demands an approach that balances protection with access during discovery. Achieving this balance requires accounting for the unique size, formats, and sources of the training data. Creative solutions beyond source code review protocols are needed by litigants to ensure efficient review while minimizing overproduction and misuse.

Read the full article, Discovery of Training Data in AI Litigation, which first appeared in Corporate Counsel.

This article appeared in the 2025 AI Intellectual Property: Analysis & Trends Year in Review report

Related Industries

Related Services

AI IP Year in Review – Challenges with Protocols for Training Data Discovery in AI Litigation

The Value of Training Data

Training Data Inspection Protocols: Challenges and Opportunities

Takeaways