Using ChatGPT for Question Answering on Your Own Data

"leverage ChatGPT to answer natural language questions on a variety of text repositories... [via a ]combination of embeddings, vector search, and prompt engineering...

  • Embeddings are mathematical representations of words, phrases, or even entire documents as vectors ... sequences [with] similar meaning are “close together” in a high-dimensional space... encode the semantic meaning ...
  • Vector databases provide an efficient way to store and search for embeddings... designed to perform similarity searches... quickly identify the most relevant matches for a given query...
  • Prompt engineering ... guide the behavior of [LLMs] by carefully crafting input prompts."

He then sets out how to use "open-source Python package Langchain to ... streamline the following process":

  • preprocess your data
  • create embeddings of it "using a pre-trained language model like ChatGPT"
  • put them in a vector database
  • translate user queries into an embedding using the same model
  • "Perform a similarity search in the vector database to identify the most relevant matches...
  • Craft a prompt that combines the user’s query" with the text returned from the vector database, and feed it to the LLM

