Corpus Analyzer

large language models

Corpus analyzer

The goal: Identify and summarize the contents of a set of PDF documents and present them in a conversational format.

Technical procedure

1. Document preprocessing:

Text is extracted from each file uploaded, skipping any embedded media on the document.

2. Splitting:

Text content is splitted in smaller chunks in order to keep meaningful pieces of information that fit within the context window of the language model.

3. Document summarization:

Every set of chunks is recursively summarized using GPT-3.5 Turbo. A prefix stating that it is a summary is prepended.

4. Vectorization:

Each chunk is vectorized by using OpenAI’s embeddings to get a numerical representation of the content that condenses the semantic and its main keywords.

5. Storage:

Every chunk and their summaries are stored in Chroma. This vector database makes it easier to find similar content to a query.

6. Clusterization tool:

Summaries are clustered to find common topics between them. First features are obtained from text by using term frequency – inverse document frequency, then truncated singular value decomposition is applied to reduce the dimension. The algorithm used for clustering is DBSCAN. Once clusters are detected, a brief summary is built for each group emphasizing their similarities. Alongside with the descriptions, the tool reports the amount of documents and the document names present per category.

7. Agent:

A conversational agent is built based on GPT-4 with access to the vector db and a clusterization tool. It can formulate many internal questions/queries to compose an answer for a user question.

User instructions

1. Document upload:

The user can upload the PDF documents to analyze into the “Input” box. If needed, more documents can be uploaded after the previous upload has been completed.

2. Chatbot interaction:

After the documents are uploaded, the user can formulate questions about the documents in natural language. At any time the conversation can be cleared by the user, and optionally the files can also be cleared to start from the beginning.

Download the example document

Documents certifying the establishment of companies in Chile, car financing promotions and resumes

Click here to upload the documents

Questions

What are these documents about?
How can these documents be grouped? Make a detailed list
Reduce the amount of categories to three
Which file names belong to each of these categories?
What is the price for each car model offered?
Which model has the lowest interest rate?
What are the main differences between the developers?
Make a brief description of Chahuan y Filippi Limitada
Which Chilean company has the largest capital?

Soko Solutions is headquartered in the Washington DC metro area and has development centers in Latin America & Europe.

We are Global

Washington DC Metro Area
8609 Westwood Center, Tysons Corner, Virginia 22182
Buenos Aires, Argentina
Av. Sánchez de Loria 2395 3rd Floor, Of. A
London, UK
3rd Floor, 12 Gough Square, EC4A 3DW
Stockholm, Sweden
Wallingatan 12 - 111 60 Stockholm
Santiago, Chile
Fidel Oteiza 1921, Piso 5, Providencia
Mendoza, Argentina
Huarpes 2414