Github Issue Labelling

Author(s): Anthony Ter-Saakov

Date Created: 10/28/2021

Date Updated: 10/28/2021

Tags: word embedding, NLP, github, issue tracking, text preprocessing

Abstract

Issue labels are a valuable resource in open-source issue-tracking systems. Engineers who specialize or are focused in a specific area use labels to filter through issues that are important to them. Labels also improve organization and can give code owners an idea of the direction their project is moving in. However, since anyone can create an issue in open-source, new issues commonly go unlabelled. We wish to remedy this problem by creating an automatic issue tagging model. There has been a lot of previous work in this direction that we build on laterally.

Introduction

When researching this project, we found a 40-page paper about the automated prediction of tagging issues in issue tracking systems. It effectively served as a survey of the field. The standard method went as such: collect a large dataset of labelled issues from github, and train a model to predict whether the issues are bugs, feature requests, or questions. This is similar to how a general machine learning problem is tackled. Collect a large dataset, and do with it what you can. There are clear drawbacks to this method, namely the challenges of applying multiple tags and applying tags that aren’t bug, feature, or question. The reason that the previous work only labelled those three tags was because they wanted to use a large dataset; they sacrificed personalization for accuracy. Only the top three tags were used because those are the only tags that apply to every repo and are likely used in the same way across repos, and they had to remove the other labels due to class imbalance. To ensure that the amount of data points used per class is equal, they would simply discard data points from the “bug” and “feature” labels to match the number of “question” data points. This left them with plenty of good data to train well-functioning models on. In the survey, many of the standard NLP text processing techniques were used, with SVMs being the best performer out of the “classical” models, and fastText being the newcomer that outperformed all of the classical models. This highly motivated the techniques used in the github-labeler project.

One of the overarching goals of our github labeler was to make it work on smaller, specific issue datasets such as openshift/origin. A model working solely for this purpose should be able to predict more than just three labels, and it must be able to predict multiple labels for one issue. The methods from the previous work done in this field were not suitable for either. In order to solve the issue of class imbalance and multiple classification, we decided to make the problem into a multiple binary classification problem, i.e. creating a separate YES or NO model for each label. Running an issue through each model will give a list of YES’s that we use to tag the issue. While we are now able to create a model for each issue, we can give no guarantee that the models will be any good, since some of the tags may have very little data points and be difficult for a model to predict. We used two different models: fastText and support vector machine classifiers. For each label, we train both models and apply k-fold validation and save the model that performs better. We will discuss the effectiveness and drawbacks of both of these models.

Data Collection

In order to download data we created a notebook to download all of the issue data of any repo or user using the github API. We extracted the openshift/origin issues for experimenting with different methods, which proved to be a sizable dataset of github issue data. There are many labels with several hundred data points, and the most popular label, kind/bug, has over 3000 data points. One of the trickiest parts of performing NLP methods on github issue datasets is the combination of code and English language (the model only supports English). Due to this, some heavy preprocessing functions were used and a different vocabulary was built for the two different models. In preprocessing, the large code blocks were entirely removed and replaced by codeblock. The same happened for inline code which was replaced by codeinline. By the end, most of the messiness was dealt with but there was still a lot of noise which was dealt with separately for each classifier.

Models

First we start with describing the support vector machine. A staple of data scientists, it is consistently one of the most effective classifiers. The way the SVM was used in this case was fairly standard. The data first runs through a preprocessing notebook, where all of the issues in the dataset are preprocessed, split into individual words, the words are stemmed, and the counts of the words are calculated. We set a cutoff so only the words that appear in the top 80% of all words are added into the vocabulary. This effectively discards any words that are mentioned so rarely that they are nothing but noise to an SVM model. This vocabulary is then saved and each issue is transformed into a tf-idf vector before being passed into a binary SVM classifier. This model works quite well, and outperforms fastText on many labels. Although fastText is a more “powerful” model, SVM is simpler and works better when the number of data points is low, which happens to be more common in our usage. If a handful of specific words makes it likely for the issue to be labelled with a certain tag, SVM will pick it up.

However, if understanding the category of the issue requires some nuance and basic english comprehension such as synonyms and sentence structure, the SVM is not as powerful. This is where fastText usually outperforms the SVM model. fastText is a word embedding system that transforms every word into an n-dimensional vector. A github issue then becomes nothing but a list of vectors. The fastText classifier algorithm works by SGD using the word embeddings and a hidden layer. In a good word embedding system, words with similar meanings such as “bug”, “issue”, and “error” have their corresponding vectors close together in the n-dimensional space. Thus, while an SVM sees bug and issue as two totally different words, a well-trained fastText will know these words are similar. Since each word is now represented as an n-dimensional vector rather than a single dimension in a vector, fastText has more complexity and this is why it needs more data points to perform well (and SGD also wants a lot of data!).

Pre-training

Another advantage of fastText is that it allows for pre-training. This means that you can train a fastText model on other data before applying it to the classification problem, and this is exactly what we do to improve performance. We download a heavyweight fastText model that has over a million words in 300 dimensions, pre-trained on wikipedia and the news. We do some processing to exclude most of these words, then perform a PCA reduction to lower the dimensionality to around 170. This is important because we want to simplify the ultra-complex, powerful, model before passing it to our low-data classification problem. In order to include a github issue-relevant vocabulary, we rely on data from all public issues on github to pre-train our models. To extract this data we use GHArchive. We extract issue data over 50 days, preprocess it, and perform a word count similar to the SVM preprocessing. We keep the words that make up the top 95% of all words used. We add the newly found words from github into the fastText word vectors we downloaded and processed. Then, we set up a model for unsupervised training on two years of github issues. Unsupervised training works by predicting the words surrounding other words to generate meaningful vector representations. The longer the model trains, the better the vector representation becomes. This is the pre-trained model that we can then pass on to our classification problem. The high level idea with pre-training is that we can use a lot of data to help the model “learn” English and a little github issue lingo, then the task of predicting labels off a small number of data points becomes much easier.

Conclusion

Applying our methods to the openshift/origin repo, we get fairly good results. There are many labels which we can make a binary prediction for with 80% accuracy or better. Some of these labels have several hundred examples in the openshift issue board, while some have under 100. The SVM model almost always outperformed the fastText model, and no fastText model was able to achieve at least 80% accuracy. At the time of writing this, we find that our pre-training methods do not always improve fastText performance, it is not currently implemented in the main pipeline because of this. The github labeler project is an open source repository that welcomes all contributors and can be found at https://github.com/aicoe-aiops/github-labeler. Any questions or concerns can be posted on the issue board or directed to atersaak@redhat.com. Thanks for reading!

References

[1] https://arxiv.org/abs/2003.05357#

[2] https://fasttext.cc

Contribute to this page