The goal: Identify and summarize the contents of a set of PDF documents and present them in a conversational format.
1. Document preprocessing:
Text is extracted from each file uploaded, skipping any embedded media on the document.
Text content is splitted in smaller chunks in order to keep meaningful pieces of information that fit within the context window of the language model.
3. Document summarization:
Every set of chunks is recursively summarized using GPT-3.5 Turbo. A prefix stating that it is a summary is prepended.
Each chunk is vectorized by using OpenAI’s embeddings to get a numerical representation of the content that condenses the semantic and its main keywords.
Every chunk and their summaries are stored in Chroma. This vector database makes it easier to find similar content to a query.
6. Clusterization tool:
Summaries are clustered to find common topics between them. First features are obtained from text by using term frequency – inverse document frequency, then truncated singular value decomposition is applied to reduce the dimension. The algorithm used for clustering is DBSCAN. Once clusters are detected, a brief summary is built for each group emphasizing their similarities. Alongside with the descriptions, the tool reports the amount of documents and the document names present per category.
A conversational agent is built based on GPT-4 with access to the vector db and a clusterization tool. It can formulate many internal questions/queries to compose an answer for a user question.
1. Document upload:
The user can upload the PDF documents to analyze into the “Input” box. If needed, more documents can be uploaded after the previous upload has been completed.
2. Chatbot interaction:
After the documents are uploaded, the user can formulate questions about the documents in natural language. At any time the conversation can be cleared by the user, and optionally the files can also be cleared to start from the beginning.