On the 12th of April 2023 Databricks released the first “open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use”, named Dolly 2.0. I have been waiting for a free LLM for comercial use for a couple of months, hoping it would appear. Databricks did it first, and I am curious what the future holds.
I have written this article sharing my knowledge to increase the chance others will start to play around with Dolly 2.0. Besides promoting these kinds of initiates by lowering the bar for people to use it, a second motivation is to simply increase my own expertise on the subject. I happen to work as a data scientist working on chat bots, and multiple projects have been flipped around as a result of LLMs (flipped around by me). So now it is up to me to find ways to get value out of these new technologies. Here is an introduction to that attempt.
Background
In November 2022, the ChatGPT proprietary instruction-following model was released, trained on trillions of words from the web using massive amounts of GPUs. This caused other companies like Google to release their own instruction-following models. In February 2023, Meta released their LLaMA language models weights. In March, Stanford built the Alpaca model based on LLaMA but fine-tuned on a small dataset of human-like questions and answers, resulting in ChatGPT-like interactivity.
On march 23 2023 Databricks came out with Dolly 1.0 and state that: ‘Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca‘.
Dolly 2.0 was able to come out so quickly after Dolly 1.0 because they only removed from their dataset output from chatGPT, making the model not free to use commercially (this now removed data came from the Stanford Alpaca team dataset). They also generated 13000 Question Answering pairs from their 5000 Databricks employees to improve the model. The model is now free for research and commercial use which seems like a big step forward.
Technical details of dolly_v2-12b
For completions sake I will add here some of the technical details of the model. Scroll down if you’re just interested in running it. dolly-v2-12b is the name of the model and huggingface hosts the model with a detailed description. It is based on pythia-12b
, trained on ~15k instruction/response fine tuning records databricks-dolly-15k
generated by Databricks employees. Databricks state that: ‘is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based‘.
- It has 12 billion parameters.
- Based on pythia-12b.
- Released under a permissive license (CC-BY-SA).
Let’s spin it up
This will be a quick intro in setting it all up. Check the Databricks repo for great information with example notebooks and detailed descriptions.
# import libraries
from transformers import pipeline
# define pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", trust_remote_code=True, device_map="auto")
# run the QA pipeline
instruct_pipeline("Explain to me the difference between nuclear fusion and fusion.")
It’s that simple? Yes!
We can make some adjustments such as loading the model with bfloat16
to reduce memory usage. They state that ‘It does not appear to impact output quality. It is also fine to remove it if there is sufficient memory‘.
# import libraries
import torch
from transformers import pipeline
# define pipeline
instruct_pipeline = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
# run the QA pipeline
instruct_pipeline("Explain to me the difference between nuclear fusion and fusion.")
The instruction following pipeline can be loaded using the pipeline
function as shown above. This loads a custom InstructionTextGenerationPipeline
found in the model repo here, which is why trust_remote_code=True
is required.
Not using trust_remote_code=True
is possible by downloading instruct_pipeline.py, store it alongside the used notebook, and construct the pipeline from the loaded model and tokenizer:
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto")
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
Training a model and running unit tests is also discussed in the Dolly repo and worth a read. I will end this article here as it gives enough to start playing around with the model and investigate the chance of using it in production environments. If you find applications for it I am always curious to learn about it, so feel free to share.
I’d love to try it but I’m a total beginner (about everything it is needed to know, I fear). Is there a guide or something that I can study?
The 4 links I shared will give a lot of relevant info on LLMs and Dolly. More generally, you can run the code blocks shared here in Jupyter Notebook. Before you can run the code you need to install the libraries using PIP in the command line (so for instance: pip install torch). That is all to ask questions and receive answers from Dolly 2.0. Let me know if you need more specific info!