How Duplicate Data Kills Your RAG

Retrieval Augmented Generation (RAG) has become essential for enterprise AI applications to make proprietary data available for responding to user queries. Optimizing RAG's accuracy and precision ensures relevant application responses while maintaining data security and privacy. However, these objectives often conflict. Current methods for enforcing permissions on RAG vector databases can degrade results or leak data to unauthorized users due to the need to divide enterprise data into chunks for GenAI applications.

AI's Chunked View of Data

Businesses typically handle data as files, documents, videos, or database tables--units of information we can name, organize, and manage. Metadata, such as creation time, size, creator, and applicable policies, links to these data units when stored. However, GenAI applications don't view data this way. They break down large, complex sources into smaller, semantically meaningful segments or chunks. LLMs train on chunks, vector databases store vectors created from chunks, AI applications consume data in chunks, and APIs called by AI agents move data in chunks.

Same data chunks in different Apple 10Q statements

Common paragraphs or chunks (blue) and unique chunks (orange) mixed in two Apple 10Q statements. Denying access to all chunks from the Q2 2024 statement in a vector database would also deny users access to much of the data in the Q2 2023 statement as well.

It seems intuitive that document-level attributes should apply to their chunks. Many AI vendors reinforce this idea. However, this assumption is fundamentally flawed and can lead to inaccurate AI responses, biased results, and compromised data security.

Data Redundancy from Chunking

Through our experience with WAN optimization, over 90% of data chunks flowing through enterprise networks are redundant. In storage systems holding enterprise data, 50%-90% of storage blocks are duplicates .

As documents are shared, edited, and converted across platforms, identical paragraphs or images often appear in multiple versions. These redundancies accumulate as authors merge comments from various drafts, circulate files, and manage revisions.

AI researchers at Google and University of Pennsylvania recorded the number of duplicate chunks seen across 348 million web documents. The number of chunks appearing only once have Group Size = 1. Eighty-five thousand chunks appeared 6–10 times. The length of chunks are not represented in this graph. Source: https://arxiv.org/abs/2107.06499

Unintended Consequences

When enterprise documents are ingested into an LLM training set or RAG vector database, identical chunks are stored only once. Identical binary sequences generate identical vectors, which vector databases use for data lookup.

However, metadata can differ significantly between documents containing the same chunks. When two such documents are ingested, the first chunk gets metadata from the first document, which the second ingestion overwrites with the second document's metadata. This can lead to problems including poor AI performance, data leakage, and biases in responses

Graph of data chunk relationships across Apple 10Q statements

Using content-defined chunking, data in seven Apple 10Q statements (pink and blue dots) common sets of chunks (in red) can be mapped to the statements they are found in. Pink dots represent the entirety of data in each statement as would be seen by data governance systems that do not take chunking into account.

Poor AI Performance and Data Leakage

Consider a document with metadata allowing public access and another with metadata restricting access to Amy. If a common chunk exists in both, the second document's restrictive permissions might overwrite the first's open-access metadata. Bob, who should access the first document, might be denied access, while Amy could see both documents' data.

This issue, if unchecked, diminishes RAG search precision despite improvements from graph-based RAG, cascading retrieval, and other methods. Without a monitoring system, users might receive incomplete responses or unintentionally access sensitive data like coworker salaries.

Ethical Issues and Biases

As chunk sizes increase, sub-sequences within chunks become a concern. Paragraph-length chunks may contain common sentences appearing in multiple chunks. Research from Google and University of Pennsylvania shows duplicate data in LLM training sets can introduce biases and errors in responses. Repetition in prompts has been shown to cause errors in RAG systems.

Where limits exist on the number of chunks retrieved from RAG, a user query with a high similarity score to chunks containing the same sentence, may preclude other chunks relevant to the query to be dropped thus reducing response accuracy.

Conclusion

The rise of agentic AI--where autonomous AI agents collaborate and take intertwined actions--adds new layers of complexity to enterprise AI systems. These agents depend on dynamically retrieved and processed data, making robust data governance essential. As agents interact, risks like data mismanagement, bias, and leakage increase, elevating data governance from a best practice to a critical necessity. Traditional data governance systems designed for documents, files, and tables--data-at-rest--are incompatible with how AI processes data. GenAI applications break data into "chunks," stripping away context and metadata needed for effective policy enforcement. To control enterprise data use in these complex AI ecosystems, data governance products must adapt by addressing the unique challenges of chunked data. Without this shift, enterprises face degraded AI precision, reduced reliability, and potential data breaches--risks amplified by agentic AI's interconnected nature.