Constructing Context Part 1 - A Document Pipeline for Intelligent Chat

The Use Case: Empowering Chat with Document Context

My goal at Noble Base is to equip our AI-powered chat interface with the ability to understand and leverage user-uploaded documents during conversations about their business. The hope is to allow for more informed and context-aware interactions on top of the business context we already embed within the chat, ultimately leading to better and more focused decision-making. I focused on ensuring that PDFs were supported first, with plans to extend support to other document types as quick follow-ups such as .txt, .docx,.csv, and .xlsx.

Initial Explorations: Unstructured and Docling libraries

I initially explored a few popular Python libraries designed for document processing. Unstructured showed promise for handling diverse document types, and their cloud solution offered a seemingly straightforward path for multimodal PDF parsing by providing workflows to piece the pipeline together visually, but I ultimately decided to pass. The abstraction level and limited control over the pipeline itself didn't align with our need for deeper optimization and customization.

Docling was another interesting contender, and I think it will be useful for spreadsheets, but I decided to go a different direction for PDFs. I don't have a ton of direct criticism for Docling honestly, the docs could be a little bit clearer, especially in its use case examples, but overall the software is great. I wanted an approach that was simple, reliable, and allowed me to build around a strong core of processing a document that is usually multi-modal like PDFs, and that brought me to Mistral's OCR library.

The Chosen Path: Mistral's Document OCR

Ultimately, I selected Mistral's Document OCR processor (https://docs.mistral.ai/capabilities/document/) for its accuracy, speed, and reasonable cost. Feel free to read the docs, but the high level here is that it takes a snapshot of each page and intelligently parses text, image, and table data from the PDF, enabling part by part processing of each page.

One of the things I've been enjoying most about these model companies is it seems like they're furiously chasing after a niche they've discovered they're pretty good at, and it seems like Mistral has found at least one niche they can thrive in. The reliability of its output is critical for the later stages of our pipeline development, and the narrow focus of its functionality gave us room to build all the custom processing we want now and will want in the future.

From Pixels to Vectors: Summarization and Embedding

With the OCR output in hand, I cobbled together some Python code to summarize and contextualize each page of the document. This involved extracting and summarizing textual/table content and generating descriptive summaries for any extracted images. These summaries were then converted into vector embeddings using an OpenAI embedding model. These summaries, alongside highly relevant metadata, and their vector embeddings are then stored into a vector database for future searching and referencing.

Vector Storage: Choosing Milvus and Zilliz

After evaluating various options, I opted for Milvus, an open source vector database leveraging Zilliz, a cloud platform (https://zilliz.com/) that hosts Milvus databases of various scale. The dedicated nature of a vector database like Milvus aligned well with our straight forward use case and the need to get this stood up relatively quickly.

Deployment on AWS Lambda: Overcoming Dependency Challenges

Now comes deploying the thing, and for that I chose to leverage AWS Lambda functions. I wanted something lightweight and simple to call right out of the box, so a lambda function seemed like a good fit. When it was all said and done, I was about to get the pipeline up and running, but not without a few hiccups.

Without boring you too much and droning on about AWS lambda functions and layers, I'll happily give you the TL;DR. After iterating through a couple of upload options for the pipeline code and its dependencies, packaging it all up into a docker image, uploading it to AWS Elastic Container Registry, and referencing the image as my lambda function worked perfectly for me. I'll provide a link to the documentation of roughly how to do it here.

Based on my research, this method, while not the simplest approach, seems like the best route to go for deploying generative AI functionality. The list of dependencies behind these AI tools can be massive, so containerization just makes sense, even if it means jumping through a few extra hoops to get there.

Current Status and Next Steps

So that wraps this post up, and I'm excited to talk more about the progression of this work moving forward. The current iteration of our document processing pipeline is performing well in terms of scalability and speed, providing a solid foundation to build on.

What's next? I'll be working on allowing our Rails application to be able to directly and securely call this pipeline. This involves establishing communication with the deployed Lambda function and ensuring the processed document data is integrated into user chat sessions for the end user seamlessly. Following this integration, I'll be focusing on implementing both observability and monitoring to ensure the long-term stability and maintainability.