Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

AutoCoder

Contributions welcome

A basic and simple tool for code auto completion, fine-tuned from the pytorch pre-trained GPT-2 variants offered by the awesome 🤗 transformers library.

Demo

demo

Features

  • Write with Python or Java.

Blog linked to this project

Quick Start

Here provides three ways of quick-start. Before that,

Load from 🤗transformers models

Now there are two fine-tuned models uploded to 🤗transformers models library. They can be used easily as long as you pip install transformers

from transformers import AutoTokenizer,AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
model = AutoModelWithLMHead.from_pretrained("congcongwang/gpt2_medium_fine_tuned_coder")
# or
# tokenizer = AutoTokenizer.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
# model = AutoModelWithLMHead.from_pretrained("congcongwang/distilgpt2_fine_tuned_coder")
use_cuda=True
context="def factorial"
lang="python" # can be java as well.

if use_cuda:
    model.to("cuda")

input_ids = tokenizer.encode("<python> " + context,
                                     return_tensors='pt') if lang == "python" else tokenizer.encode(
            "<java> " + context, return_tensors='pt')
outputs = model.generate(input_ids=input_ids.to("cuda") if use_cuda else input_ids,
                         max_length=128,
                         temperature=0.7,
                         num_return_sequences=1)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

Ready-to-go Interaction

git clone https://github.com/wangcongcong123/auto_coding.git
pip install -r requirements.txt
  1. Download the fine-tuned models, here are two versions provided.
  2. Unzip the model and move it to ./model (create it first)
  3. Run the interact: python interact.py

Fine-tuning yours

git clone <this repository>
pip install -r requirements.txt
  1. Preparing the dataset
  2. Start fine-tuning model: python train.py --model_select distilgpt2
  3. After fine-tuning, the model will be saved to ./model/distilgpt2_fine_tuned_coder/0_GPTSingleHead which is exactly the fine-tuned version as provided in Ready-to-go Interaction.

* For more params setting of training, python train.py -h

Generative examples

Good Python generation examples by fine-tuned GPT2-medium

-------------Example 1--------------------------------------
Context code: def factorial
Generated: 
def factorial(n):
    return 1 if n == 1 else n * factorial(n - 1)

--------------Example 2-------------------------------------
Context code: def sum_of_series(num_of_terms:int, power: int)
Generated:
def sum_of_series(num_of_terms:int, power: int) -> float:
    """Calculate Sum of Series.
    >>> sum_of_series(1, 1, 10)
    55.0
    >>> sum_of_series(1, 10, 100)
    49600.0
    """
    sum = (num_of_terms / 2) * power
    return sum

--------------Example 3-------------------------------------
Context code: def quick_sort(collection)
Generated:
def quick_sort(collection):
    """
    >>> quick_sort([0, 5, 3, 2, 2])
    [0, 2, 2, 3, 5]
    >>> quick_sort([])
    []
    >>> quick_sort([-2, -5, -45])
    [-45, -5, -2]
    """
    length = len(collection)
    if length <= 1:
        return collection
    else:
        # Use the last element as the first pivot
        pivot = collection.pop()
        # Put elements greater than pivot in greater list
        # Put elements lesser than pivot in lesser list
        greater, lesser = [], []
        for element in collection:
            if element > pivot:
                ...

Good Java generation examples by fine-tuned GPT2-medium

--------------Example 1-------------------------------------
Context code: Scanner input= new Scanner(System.in)
Generated:
Scanner input= new Scanner(System.in)
System.out.println("Enter the Octal number: ");
// Take octal number as input from user in a string
String oct = input.next();

--------------Example 2-------------------------------------
Context code: private static int CountCharacters(String str)
Generated:
private static int CountCharacters(String str) {
        return str.replaceAll("\\s", "").length();
}

* Although some generated examples look good, it needs to take a grain of salt to judge the model's actual performance. The model may simply "remembers" existing code in the training set well.

TODO list

  • Expand the dataset (and construct the dataset more carefeully) and increase context window. Try larger generative models like GPT-2 large or even GPT-3 variants as proposed recently if the computational resources are allowed.
  • Remove overlapping between training examples and dev examples for contamination studies. That says, to what extent the model memorizes examples rigidly or at surface heuristics level during training.
  • Try some adversarial examples (more complicated for model's reasoning capability testing purpose) to test the robustness of the model.
  • Integrate this into real-life use case such as a code editor - Sublime Text, where a threshold of joint probability may need to be studied for code snippet recommendations.
  • Try some ideas of location-aware code generation. For example, if a human coder is sitting writing a comment, the autocoder should be aware of the coder's context (left and right if available) to help complete the corresponding content.
  • Model size and inference efficiency is a problem in real-life use cases.
  • Do research in this problem domain to grab a general idea of what work has done in the literature for this particular problem.

Extra notes

  • For mutli-GPU training, it only works when torch==1.4.0. It will be not working when torch==1.5.0. No idea so far how to fix this issue.

Releases

No releases published

Packages

No packages published

Languages

You can’t perform that action at this time.