Address
304 North Cardinal St.
Dorchester Center, MA 02124
Work Hours
Monday to Friday: 7AM - 7PM
Weekend: 10AM - 5PM

Organizations across industries are amassing vast repositories of text data, ranging from reports and research papers to transactional records and internal documents in various digital formats like PDF, TXT, HTML, and DOCX. Despite the potential insights this data holds, it presents a significant challenge: efficiently searching and extracting precise knowledge from such an unstructured and fragmented data set. Users often struggle with the following specific issues:
The aim was to overcome these obstacles by creating a solution that not only enhances the precision and relevancy of search results but also seamlessly merges with the client’s existing data ecosystem.
To address the issue of unstructured and diverse datasets for our client, we started by implementing an intuitive ingestion mechanism. This feature allows users to select their data repositories, which may contain a range of formats including PDF, TXT, HTML, or Word documents. Following this step, the app proceeds to extract text, breaking down the content into manageable segments. These segments are then encoded into numerical vectors using OpenAI’s vector embeddings API, preparing them for in-depth analysis.
The modeling strategy of the solution is two-pronged. It first establishes a vector space through OpenAI embeddings to construct a high-dimensional vector space, revealing semantic contexts within each text segment. The embeddings data gets stored in Chroma DB which is optimized for similarity searches in this vector space. This setup is pivotal when the user enters a query, which is also converted into its vector representation, and a similarity search is performed in the Chroma DB, extracting the top segments closely aligned with the query’s intent. Secondly, using the query and the returned segments as context, a query is made to the OpenAI GPT4 Large Language Model to generate contextually informed responses and return them to the user along with specifying the source of the data used in the search. This significantly reduces the time users spend on data extraction and analysis, streamlining the entire knowledge discovery process.
The solution is underpinned by a selection of technologies chosen for their performance and reliability. Streamlit provides the framework for an intuitive user interface, allowing straightforward navigation and interaction. OpenAI’s vector embeddings and GPT4 APIs power the core analytical engine, ensuring that both the corpus data and user queries are translated into a comparable vector format. Chroma DB is the backbone of our data storage system, optimized for searching and retrieving high-dimensional vector data.
Following the application’s deployment, the client’s ability to navigate and utilize their extensive text data improved considerably. The application provided users with a more efficient method to access relevant information which translated into tangible time savings during data searches and analysis.
The retrieval of information became more accurate, with the system adeptly returning relevant results. This improvement was reflected in user feedback, which noted the system’s effectiveness in discerning the intent behind their searches.
Overall, the application proved to be a valuable tool for the client, enabling a quicker turnaround in information retrieval and supporting better-informed decision-making processes.