Large language models (LLMs) like GPT-4 and ChatGPT are useful for a variety of applications in chatbot, language translation, and content creation. These models are only as accurate as the data given to them. If the data given to them does not include the information needed to accurately answer a question, then it will not be able to contribute anything. This is where you need to customize the model.
By utilizing document embedding, you can give your LLMs context by adding your own custom data. You can modify the standard prompts by pre-appending the desired content. Embeddings are numerical vectors that contain the features the text contains. To make the vectors, we use a machine-learning model to train it on a big dataset. We can use OpenAI’s Embedding API to create these. Once the vector is created, you can store it in a “vector database” such as Faiss by Facebook.
This whole step is accessible with LangChain, a Python library for creating LLM applications. With LangChain you can use different embeddings, LLMs, and databases.
In the creation of the application, there are certain things to keep in mind. Utilize the same embedding models for documents and prompts. LLMs have token limits that need to be considered. The documents and prompts need to be kept to a thousand tokens or less, and if there is a longer document, divide it into chunks that have 100 token overlaps. Another thing to consider is fine-tuning the model, as it can reduce the time and money spent.
The person mentioned in this article is the owner of LangChain, Michael Hallward. The company mentioned in the article is OpenAI, a nonprofit with a mission to ensure that artificial general intelligence benefits humanity as a whole. They have released powerful models such as GPT-3 and launched public services such as their embedding API. OpenAI has also contributed to developing safer and more reliable AI systems, like their robotic hand that helps children learn robotics.