By Caber Team
10 Dec 2024
Retrieval Augmented Generation (RAG) has become essential for enterprise AI applications to make proprietary data available for responding to user queries. Optimizing RAG's accuracy and precision ensures relevant application responses while maintaining data security and privacy. However, these objectives often conflict. Current methods for enforcing permissions on RAG vector databases can degrade results or leak data to unauthorized users due to the need to divide enterprise data into chunks for GenAI applications.
Businesses typically handle data as files, documents, videos, or database tables--units of information we can name, organize, and manage. Metadata, such as creation time, size, creator, and applicable policies, links to these data units when stored. However, GenAI applications don't view data this way. They break down large, complex sources into smaller, semantically meaningful segments or chunks. LLMs train on chunks, vector databases store vectors created from chunks, AI applications consume data in chunks, and APIs called by AI agents move data in chunks.
It seems intuitive that document-level attributes should apply to their chunks. Many AI vendors reinforce this idea. However, this assumption is fundamentally flawed and can lead to inaccurate AI responses, biased results, and compromised data security.
Through our experience with WAN optimization, over 90% of data chunks flowing through enterprise networks are redundant. In storage systems holding enterprise data, 50%-90% of storage blocks are duplicates .
As documents are shared, edited, and converted across platforms, identical paragraphs or images often appear in multiple versions. These redundancies accumulate as authors merge comments from various drafts, circulate files, and manage revisions.
When enterprise documents are ingested into an LLM training set or RAG vector database, identical chunks are stored only once. Identical binary sequences generate identical vectors, which vector databases use for data lookup.
However, metadata can differ significantly between documents containing the same chunks. When two such documents are ingested, the first chunk gets metadata from the first document, which the second ingestion overwrites with the second document's metadata. This can lead to problems including poor AI performance, data leakage, and biases in responses
Consider a document with metadata allowing public access and another with metadata restricting access to Amy. If a common chunk exists in both, the second document's restrictive permissions might overwrite the first's open-access metadata. Bob, who should access the first document, might be denied access, while Amy could see both documents' data.
This issue, if unchecked, diminishes RAG search precision despite improvements from graph-based RAG, cascading retrieval, and other methods. Without a monitoring system, users might receive incomplete responses or unintentionally access sensitive data like coworker salaries.
As chunk sizes increase, sub-sequences within chunks become a concern. Paragraph-length chunks may contain common sentences appearing in multiple chunks. Research from Google and University of Pennsylvania shows duplicate data in LLM training sets can introduce biases and errors in responses. Repetition in prompts has been shown to cause errors in RAG systems.
Where limits exist on the number of chunks retrieved from RAG, a user query with a high similarity score to chunks containing the same sentence, may preclude other chunks relevant to the query to be dropped thus reducing response accuracy.
The rise of agentic AI--where autonomous AI agents collaborate and take intertwined actions--adds new layers of complexity to enterprise AI systems. These agents depend on dynamically retrieved and processed data, making robust data governance essential. As agents interact, risks like data mismanagement, bias, and leakage increase, elevating data governance from a best practice to a critical necessity. Traditional data governance systems designed for documents, files, and tables--data-at-rest--are incompatible with how AI processes data. GenAI applications break data into "chunks," stripping away context and metadata needed for effective policy enforcement. To control enterprise data use in these complex AI ecosystems, data governance products must adapt by addressing the unique challenges of chunked data. Without this shift, enterprises face degraded AI precision, reduced reliability, and potential data breaches--risks amplified by agentic AI's interconnected nature.