Question answering (QA) is a NLP technique that allows machines to answer questions in natural language. It uses various approaches such as rule-based, information retrieval-based, and machine learning-based systems, including transformer-based language models like BERT and GPT, to generate accurate answers for a variety of applications. Here I use OpenAI API. Check my Github for the notebook.

LangChain is an open-source Python library that provides state-of-the-art multilingual natural language processing capabilities for researchers, developers, and NLP practitioners.

The end goal: Define a list of wikipedia articles and then use a NLP Question Answering approach to be able to answer questions about the predefined texts. I added a function with context and one without a context feature. If it remembers context it keeps past questions in mind and can help answer follow-up questions. We first start with defining the used libraries, as can be seen it’s mostly langchain related.

# libraries
import requests
from langchain.docstore.document import Document
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA

import os

os.environ["OPENAI_API_KEY"] = '12345_openAi_api_key_12345'

This code below defines a Python function that retrieves a Wikipedia article using the Wikipedia API based on a given title. It can retrieve the entire page or just the first paragraph, and returns a Document object containing the page content and source URL metadata.

def query_wikipedia(title, first_paragraph_only=False):
  base_url = "https://en.wikipedia.org"
  url = f"{base_url}/w/api.php?format=json&action=query&prop=extracts&explaintext=1&titles={title}"
  if first_paragraph_only:
    url += "&exintro=1"
  data = requests.get(url).json()
  return Document(
    metadata={"source": f"{base_url}/wiki/{title}"},
    page_content=list(data["query"]["pages"].values())[0]["extract"],
  )

sources = [
  query_wikipedia("Philosophy_of_Friedrich_Nietzsche"),
  query_wikipedia("Plato"),
  query_wikipedia("Confucius"),
  query_wikipedia("Immanuel_Kant"),
]

This Python function below is called “qa_vector_store” that takes three parameters: “chain”, “question”, and “sources”. It splits the text into chunks, creates Document objects for each chunk, and uses them to create a vector store. It then performs a similarity search on the vector store using the “question” parameter and feeds the result along with the “question” parameter to a neural network model. The function returns the output text of the model, which is (likely) the answer to the original question.

def qa_vector_store(chain, question, sources):
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
    chunks = []
    for src in sources:
        for chunk in splitter.split_text(src.page_content):
            document = Document(page_content=chunk, metadata=src.metadata)
            chunks.append(document)
    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    inputs = {
        "input_documents": vector_store.similarity_search(question, k=4),
        "question": question
    }
    response = chain(inputs, return_only_outputs=True)
    outputs = response["output_text"]
    return outputs

# This code initializes an OpenAI language model called "llm" and sets the "temperature" parameter to 0. 
# It then loads a pre-trained question-answering (QA) model called "chain" that has been trained with sources. 
llm = OpenAI(model_name='text-davinci-003', temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"])
chain = load_qa_with_sources_chain(llm)

Now we can ask questions and check the text from the previous defined wikipedia articles for answers as shown below.

qa_vector_store(chain, "What is the meaning of life for Plato?", sources)
' Plato believed that the meaning of life was to gain knowledge of the Forms, which are eternal and unchanging.\nSOURCES: https://en.wikipedia.org/wiki/Plato'

Interesting! It is fun to play around with and see what it is good at and what is difficult. To slightly improve the questions answering we can add the contact. The context is the info of the previous questions and answers received.

# basic implementation of context
def qa_vector_store(chain, question, context, sources):
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
    chunks = []
    for src in sources:
        for chunk in splitter.split_text(src.page_content):
            document = Document(page_content=chunk, metadata=src.metadata)
            chunks.append(document)
    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    inputs = {
        "input_documents": vector_store.similarity_search(context + " [SEP] " + question, k=4),
        "question": question
    }
    response = chain(inputs, return_only_outputs=True)
    outputs = response["output_text"]
    return outputs

txt = ''
answer = qa_vector_store(chain, "Give a two sentence summary of the idealogy of Confucius.", txt, sources)
txt += " [SEP] " + answer
answer
' Confucius was a Chinese philosopher and politician who emphasized personal and governmental morality, correctness of social relationships, justice, kindness, and sincerity. His followers competed with many other schools during the Hundred Schools of Thought era, only to be suppressed in favor of the Legalists during the Qin dynasty.\nSOURCES: https://en.wikipedia.org/wiki/Confucius'

The power of LLMs like (Chat)GPT can be harnessed through the use of appropriate prompts and contextual information, allowing for great results. It ofcourse depends on the size of the used dataset, and in this case its limited for one use (talking with philosophers). An instance of this is demonstrated through LangChain’s capability to interface with OpenAI’s GPT and construct prompts that include relevant context to effectively answer a question about documents.

Rather than fine-tuning the model, the focus is on selecting significant information and utilizing the LLM’s abilities to work with it. Although LangChain has many other potential uses, this example highlights the productive application of general-purpose LLMs – models that are trained on a vast range of data – for specific purposes, by utilizing prompt engineering.

Literature

One thought on “Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI”
  1. Hej Alex,

    Interessante uiteenzetting en ook je bots en start-up ideeën vind ik erg interessant. Zou je mij eens kunnen benaderen? Hoor heel graag van je!

    Met enthousiaste groeten,
    Tim Spijker@Nexler

Leave a Reply

Your email address will not be published. Required fields are marked *