At waypost.ai we have built an in-house method for Question and Answering using LLM technologies to not only answer questions about an application but to guide users by telling them which buttons to click to achieve their goal. The common issues with LLMs such as hallucinations are an even bigger problem here because it would result in guiding users to the wrong location, wasting their time.
Advancements in the field of making LLM outputs more reliable occur rapidly but these methods seem to be converging indicating standard ways to improve LLM reliability and validation. This article will hyper charge your understanding on the importance of reliability and validation. Then we dive deeply into the technical details and recent innovations.
The problem of LLM reliability and validation
LLM reliability and validation of its proper performance has been a challenge for various reasons. In-fact, it is commonly the main issue for implementing LLMs in production environments.
1. Generalization and Over fitting: Was the model actually able to generalize from the training data or just ‘copy paste’ it back to end users, or has it been over fitted and performs poorly on new data.
2. Bias and Fairness: Bias in the training data will affect the LLM output.
3. Interpretability: How did a model arrive at a specific answer? LLMs shouldn’t be black boxes.
4. Safety and Misinformation: There are concerns about LLMs generating inaccurate, misleading, or harmful information, especially when used without proper validation or in sensitive applications.
5. Model Robustness: Identifying misleading prompt to trick or hack the model is crucial to guarantee its proper functioning.
6. Validation Limitations: It’s challenging to validate LLMs across all possible queries or prompts due to their vast potential input space. This means that while we can test and validate them in specific contexts, it’s nearly impossible to ensure their reliability universally.
7. Dependency on Prompting: The way a question or prompt is framed can greatly influence the output of the LLM. This can lead to inconsistencies in responses based on slight variations in how a query is posed.
How the problem is commonly managed
1. Fine-tuning: Using narrower, domain-specific datasets can help in refining the model’s responses.
2. Prompt Engineering: Designing prompts more carefully can yield more reliable and consistent results.
3. External Validation: Using external tools or models to validate the outputs of an LLM.
4. User Feedback: Incorporating feedback loops to allow users to flag incorrect or inappropriate outputs.
5. Improve data sets: Improving the quality of datasets by pruning irrelevant information and optimizing for information relevant for the LLM tasks.
The process of external validation is the focus of this article. Every other method for managing LLM reliability is quite clear cut and varies in relevance depending on your LLM use case. There are many methods for validating LLMs within specific domains [1], but we are interested in general QnA use cases. External validation is mostly use-case independent and shows most promise in managing the issue.
External validation of LLMs
Langsmith
Unsurprisingly, Langsmith is developed by the same company behind the open source LangChain framework. LangSmith is a platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework [2]. It seamlessly integrates with LangChain. It is currently in Beta, but work keeping in mind as it aims to tackle the whole spectrum of LLM validation from training to production [3].
Langkit
A slightly less versatile but low bar of entry method for validation is Langkit. It can be imported as a python library and instantly check text quality, relevance, security, sentiment and toxicity. It has the Appache-2.0 license so feel free to use in production and commercial environments [4].
Bleu / Rouge
Both Bleu and Rouge algorithms are commonly used to evaluate the quality of machine translated texts and summarization The advantage of both methods are that the y are quick and inexpensive to implement. It has been shown to correlate highly with human evaluation. The main differences are that Bleu focuses on precision and Rouge on recall [5]. Each of these methods hav its onw Python library, for more detailed evaluation consider SuperGlue, from Samsung Research [6].
reLM
This method stands for ‘regular expression engine for Language Models’. This method is especially valuable for LLMs which have a wide range of use cases. How can be predict and identify bias, unsuitable language, data memorization, toxicity and simply incorrect language. As the researchers note: ‘ReLM’s success stems from using a compact graph representation of the solution space, which is derived from regular expressions and then compiled into an LLM-specific representation before being executed. Therefore, users are not required to be familiar with the LLM’s inner workings; tests produce the same results as if all possible strings existed in the real world’ [7]. Check its github repor out [8].
OpenAI evals
As expected, OpenAI is tackling this problem as well, one of its products in Evals [9]. The advantage of openAI evals is that using it will benefit the used models so they are infact crowdsourcing the evaluation of their models. In my own experience I found faster results using other methods metioned here. Still, crowdsourcing model evaluation seems a great approach for openAI to improve their models. It is also easy to set up if you’re already using openAI api [10].
BigBench
The aim of Google’s BigBench is greater than most methods described before. Its aim is not just to provide a numerical measurement of model performance but also to predict future capabilities of LLMs [11]. This second aim is not so relevant for a small team of developers simply building an internal chatbot. If you are building an application using an LLM for multiple use cases, then BigBench is a great benchmark to consider [12].
Future Directions in Large Language Model Evaluation
The field of Natural Language Processing is rapidly evolving, necessitating precise and robust evaluation methods for Large Language Models (LLMs). The methods described are not all that exist, but reliable methods worth advising people to consider. Current industry standards, including metrics like ROUGE and frameworks such as Big Bench, face challenges like static benchmarks, inherent dataset biases, and comprehensive comprehension assessment. Effective LLM evaluation should encompass varied datasets and real-world application reflection [13]. Key elements for future assessments include updated benchmarks, proactive bias mitigation, and ongoing feedback incorporation.
As AI increasingly shapes our era, there’s a growing need for evaluations that are rigorous, adaptable, and ethically responsible. The standards set now will shape the AI advancements of the future. Upcoming evaluation methods in AI and NLP will likely emphasize understanding context, emotional resonance, and nuanced linguistic aspects. Ethical considerations, particularly concerning bias and fairness, are poised to become central to evaluation processes. User feedback is expected to become a pivotal element in ongoing model refinement and assessment.
Literature
- https://toloka.ai/blog/evaluating-llms/
- https://docs.smith.langchain.com/
- https://blog.logrocket.com/langsmith-test-llms-ai-applications/
- https://github.com/whylabs/langkit
- https://clementbm.github.io/theory/2021/12/23/rouge-bleu-scores.html
- https://super.gluebenchmark.com/
- https://www.marktechpost.com/2023/06/08/cmu-researchers-introduce-relm-an-ai-system-for-validating-and-querying-llms-using-standard-regular-expressions/
- https://github.com/Relm4/Relm4
- https://github.com/openai/evals
- https://techcrunch.com/2023/03/14/with-evals-openai-hopes-to-crowdsource-ai-model-testing/
- https://deepgram.com/learn/big-bench-llm-benchmark-guide
- https://github.com/google/BIG-bench
- https://www.lakera.ai/blog/large-language-model-evaluation