Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI

Question answering (QA) is a NLP technique that allows machines to answer questions in natural language. It uses various approaches such as rule-based, information retrieval-based, and machine learning-based systems, including transformer-based language models like BERT and GPT, to generate accurate answers for a variety of applications. Here I use OpenAI API. Check my Github for the notebook.

LangChain is an open-source Python library that provides state-of-the-art multilingual natural language processing capabilities for researchers, developers, and NLP practitioners.

The end goal: Define a list of wikipedia articles and then use a NLP Question Answering approach to be able to answer questions about the predefined texts. I added a function with context and one without a context feature. If it remembers context it keeps past questions in mind and can help answer follow-up questions. We first start with defining the used libraries, as can be seen it’s mostly langchain related.

# libraries
import requests
from langchain.docstore.document import Document
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI, VectorDBQA

import os

os.environ["OPENAI_API_KEY"] = '12345_openAi_api_key_12345'

This code below defines a Python function that retrieves a Wikipedia article using the Wikipedia API based on a given title. It can retrieve the entire page or just the first paragraph, and returns a Document object containing the page content and source URL metadata.

def query_wikipedia(title, first_paragraph_only=False):
  base_url = "https://en.wikipedia.org"
  url = f"{base_url}/w/api.php?format=json&action=query&prop=extracts&explaintext=1&titles={title}"
  if first_paragraph_only:
    url += "&exintro=1"
  data = requests.get(url).json()
  return Document(
    metadata={"source": f"{base_url}/wiki/{title}"},
    page_content=list(data["query"]["pages"].values())[0]["extract"],
  )

sources = [
  query_wikipedia("Philosophy_of_Friedrich_Nietzsche"),
  query_wikipedia("Plato"),
  query_wikipedia("Confucius"),
  query_wikipedia("Immanuel_Kant"),
]

This Python function below is called “qa_vector_store” that takes three parameters: “chain”, “question”, and “sources”. It splits the text into chunks, creates Document objects for each chunk, and uses them to create a vector store. It then performs a similarity search on the vector store using the “question” parameter and feeds the result along with the “question” parameter to a neural network model. The function returns the output text of the model, which is (likely) the answer to the original question.

def qa_vector_store(chain, question, sources):
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
    chunks = []
    for src in sources:
        for chunk in splitter.split_text(src.page_content):
            document = Document(page_content=chunk, metadata=src.metadata)
            chunks.append(document)
    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    inputs = {
        "input_documents": vector_store.similarity_search(question, k=4),
        "question": question
    }
    response = chain(inputs, return_only_outputs=True)
    outputs = response["output_text"]
    return outputs

# This code initializes an OpenAI language model called "llm" and sets the "temperature" parameter to 0. 
# It then loads a pre-trained question-answering (QA) model called "chain" that has been trained with sources. 
llm = OpenAI(model_name='text-davinci-003', temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"])
chain = load_qa_with_sources_chain(llm)

Now we can ask questions and check the text from the previous defined wikipedia articles for answers as shown below.

qa_vector_store(chain, "What is the meaning of life for Plato?", sources)

' Plato believed that the meaning of life was to gain knowledge of the Forms, which are eternal and unchanging.\nSOURCES: https://en.wikipedia.org/wiki/Plato'

Interesting! It is fun to play around with and see what it is good at and what is difficult. To slightly improve the questions answering we can add the contact. The context is the info of the previous questions and answers received.

# basic implementation of context
def qa_vector_store(chain, question, context, sources):
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)
    chunks = []
    for src in sources:
        for chunk in splitter.split_text(src.page_content):
            document = Document(page_content=chunk, metadata=src.metadata)
            chunks.append(document)
    vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
    inputs = {
        "input_documents": vector_store.similarity_search(context + " [SEP] " + question, k=4),
        "question": question
    }
    response = chain(inputs, return_only_outputs=True)
    outputs = response["output_text"]
    return outputs

txt = ''

answer = qa_vector_store(chain, "Give a two sentence summary of the idealogy of Confucius.", txt, sources)
txt += " [SEP] " + answer
answer

' Confucius was a Chinese philosopher and politician who emphasized personal and governmental morality, correctness of social relationships, justice, kindness, and sincerity. His followers competed with many other schools during the Hundred Schools of Thought era, only to be suppressed in favor of the Legalists during the Qin dynasty.\nSOURCES: https://en.wikipedia.org/wiki/Confucius'

The power of LLMs like (Chat)GPT can be harnessed through the use of appropriate prompts and contextual information, allowing for great results. It ofcourse depends on the size of the used dataset, and in this case its limited for one use (talking with philosophers). An instance of this is demonstrated through LangChain’s capability to interface with OpenAI’s GPT and construct prompts that include relevant context to effectively answer a question about documents.

Rather than fine-tuning the model, the focus is on selecting significant information and utilizing the LLM’s abilities to work with it. Although LangChain has many other potential uses, this example highlights the productive application of general-purpose LLMs – models that are trained on a vast range of data – for specific purposes, by utilizing prompt engineering.

Literature

2 thoughts on “Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI”

Tim Spijker says:

February 10, 2025 at 1:37 pm

Hej Alex,

Interessante uiteenzetting en ook je bots en start-up ideeën vind ik erg interessant. Zou je mij eens kunnen benaderen? Hoor heel graag van je!

Met enthousiaste groeten,
Tim Spijker@Nexler

1. Alex de Vries says:
  
  December 22, 2025 at 3:57 pm
  
  Beste Tim,
  
  Ik zie je bericht nu pas… Comments gaan naar een admin email. Ik ben niet gelijk op zoek naar een nieuwe functie maar ik sta altijd open voor nieuwe contacten en kennis uitwisseling! Je zou mij kunnen bereiken op dit email: alexthevries @gmail.com, en kunnen we zien wat we voor elkaar kunnen betekenen.
  
  Met vriendelijke groet,
  Alex

Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI

Literature

2 thoughts on “Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI”

Leave a Reply Cancel reply

Previous Articles

PYGRUNN 2025. Presenting about the MCP

After The Hype: How To Make NFTs

The problem of hanging glass panes (and a 3d printed solution)

3d printed modular lamp prototyping

Literature

Related Post

2 thoughts on “Building a LLM: Question Answering About Wiki Pages Using LangChain and OpenAI”

Leave a Reply Cancel reply

Previous Articles