large language models

Corpus analyzer

The goal: Identify and summarize the contents of a set of PDF documents and present them in a conversational format.

Technical procedure

1. Document preprocessing:

Text is extracted from each file uploaded, skipping any embedded media on the document.

2. Splitting:

Text content is splitted in smaller chunks in order to keep meaningful pieces of information that fit within the context window of the language model.

3. Document summarization:

Every set of chunks is recursively summarized using GPT-3.5 Turbo. A prefix stating that it is a summary is prepended. 

4. Vectorization:

Each chunk is vectorized by using OpenAI’s embeddings to get a numerical representation of the content that condenses the semantic and its main keywords.

 

5. Storage:

Every chunk and their summaries are stored in Chroma. This vector database makes it easier to find similar content to a query.

6. Clusterization tool:

Summaries are clustered to find common topics between them. First features are obtained from text by using term frequency – inverse document frequency, then truncated singular value decomposition is applied to reduce the dimension. The algorithm used for clustering is DBSCAN. Once clusters are detected, a brief summary is built for each group emphasizing their similarities. Alongside with the descriptions, the tool reports the amount of documents and the document names present per category.

7. Agent:

A conversational agent is built based on GPT-4 with access to the vector db and a clusterization tool. It can formulate many internal questions/queries to compose an answer for a user question.

User instructions

1. Document upload:

The user can upload the PDF documents to analyze into the “Input” box. If needed, more documents can be uploaded after the previous upload has been completed.

2. Chatbot interaction:

After the documents are uploaded, the user can formulate questions about the documents in natural language. At any time the conversation can be cleared by the user, and optionally the files can also be cleared to start from the beginning.

Download the example document

Documents certifying the establishment of companies in Chile, car financing promotions and resumes

Click here to upload the documents

Questions

  1. What are these documents about?
  2. How can these documents be grouped? Make a detailed list
  3. Reduce the amount of categories to three
  4. Which file names belong to each of these categories?
  5. What is the price for each car model offered?
  6. Which model has the lowest interest rate?
  7. What are the main differences between the developers?
  8. Make a brief description of Chahuan y Filippi Limitada
  9. Which Chilean company has the largest capital?