π« StarCoder
Paper | Model | Playground | VSCode | Chat
What is this about?
News
- May 9, 2023: We've fine-tuned StarCoder to act as a helpful coding assistant
π¬ ! Check out thechat/directory for the training code and play with the model here.
Disclaimer
Before you can use the model go to hf.co/bigcode/starcoder and accept the agreement. And make sure you are logged into the Hugging Face hub with:
huggingface-cli loginTable of Contents
Quickstart
StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the
Installation
First, we have to install all the libraries listed in requirements.txt
pip install -r requirements.txtCode generation
The code generation pipeline is as follows
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# to save memory consider using fp16 or bf16 by specifying torch.dtype=torch.float16 for example
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))or
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
checkpoint = "bigcode/starcoder"
model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
print( pipe("def hello():") )Text-generation-inference
docker run --gpus '"device:0"' -p 8080:80 -v $PWD/data:/data -e HUGGING_FACE_HUB_TOKEN=<YOUR BIGCODE ENABLED TOKEN> -e HF_HUB_ENABLE_HF_TRANSFER=0 -d ghcr.io/huggingface/text-generation-inference:sha-880a76e --model-id bigcode/starcoder --max-total-tokens 8192For more details, see here.
Fine-tuning
Here, we showcase how we can fine-tune this LM on a specific downstream task.
Step by step installation with conda
Create a new conda environment and activate it
conda create -n env
conda activate envInstall the pytorch version compatible with your version of cuda here, for example the following command works with cuda 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidiaInstall transformers and peft
conda install -c huggingface transformers
pip install git+https://github.com/huggingface/peft.gitNote that you can install the latest stable version of transformers by using
pip install git+https://github.com/huggingface/transformersInstall datasets, accelerate and huggingface_hub
conda install -c huggingface -c conda-forge datasets
conda install -c conda-forge accelerate
conda install -c conda-forge huggingface_hubFinally, install bitsandbytes and wandb
pip install bitsandbytes
pip install wandbTo get the full list of arguments with descriptions you can run the following command on any script:
python scripts/some_script.py --help
Before you run any of the scripts make sure you are logged in and can push to the hub:
huggingface-cli loginMake sure you are logged in wandb:
wandb loginNow that everything is done, you can clone the repository and get into the corresponding directory.
Datasets
instruction - answer pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face
Stack Exchange SE
Stack Exchange is a well-known network of Q&A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. Stack exchange instruction is a dataset that was obtained by scrapping the site in order to build a collection of Q&A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills.
To execute the fine-tuning script run the following command:
python finetune/finetune.py \
--model_path="bigcode/starcoder"\
--dataset_name="ArmelR/stack-exchange-instruction"\
--subset="data/finetune"\
--split="train"\
--size_valid_set 10000\
--streaming\
--seq_length 2048\
--max_steps 1000\
--batch_size 1\
--input_column_name="question"\
--output_column_name="response"\
--gradient_accumulation_steps 16\
--learning_rate 1e-4\
--lr_scheduler_type="cosine"\
--num_warmup_steps 100\
--weight_decay 0.05\
--output_dir="./checkpoints" \The size of the SE dataset is better manageable when using streaming. We also have to precise the split of the dataset that is used. For more details, check the dataset's page on
python -m torch.distributed.launch \
--nproc_per_node number_of_gpus finetune/finetune.py \
--model_path="bigcode/starcoder"\
--dataset_name="ArmelR/stack-exchange-instruction"\
--subset="data/finetune"\
--split="train"\
--size_valid_set 10000\
--streaming \
--seq_length 2048\
--max_steps 1000\
--batch_size 1\
--input_column_name="question"\
--output_column_name="response"\
--gradient_accumulation_steps 16\
--learning_rate 1e-4\
--lr_scheduler_type="cosine"\
--num_warmup_steps 100\
--weight_decay 0.05\
--output_dir="./checkpoints" \Merging PEFT adapter layers
If you train a model with PEFT, you'll need to merge the adapter layers with the base model if you want to run inference / evaluation. To do so, run:
python finetune/merge_peft_adapters.py --model_name_or_path model_to_merge --peft_model_path model_checkpoint
# Push merged model to the Hub
python finetune/merge_peft_adapters.py --model_name_or_path model_to_merge --peft_model_path model_checkpoint --push_to_hubFor example
python finetune/merge_peft_adapters.py --model_name_or_path bigcode/starcoder --peft_model_path checkpoints/checkpoint-1000 --push_to_hub