MAST

Machine Learning for the Analysis of Source Code Text



Who we are

We are a group of researchers at Edinburgh University's School of Informatics interested in Machine learning techniques for the Analysis of Source code Text (MAST). To get an idea of our research interests, take a look at our reading group.

What we do

As part of our research we strive to create useful tools and resources that leverage machine learning to aid source code developers. In particular, we have developed the following tools/resources:

Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Website: http://groups.inf.ed.ac.uk/naturalize/
Code: https://github.com/mast-group/naturalize
Paper: Learning natural coding conventions.

A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Code: https://github.com/mast-group/convolutional-attention
Paper: A Convolutional Attention Network for Extreme Summarization of Source Code.

IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
Code: https://github.com/mast-group/itemset-mining
Paper: A Bayesian Network Model for Interesting Itemsets.

ISM is a novel algorithm that mines the most interesting sequences under a probablistic model. It is able to efficiently infer interesting sequences directly from the database.
Code: https://github.com/mast-group/sequence-mining
Paper: A Subsequence Interleaving Model for Sequential Pattern Mining.

PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Code: https://github.com/mast-group/api-mining
Paper: Parameter-Free Probabilistic API Mining across GitHub.

TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
Demo: http://code-summarizer.herokuapp.com
Code: https://github.com/mast-group/tassal
Paper: Autofolding for Source Code Summarization.

The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
Website: http://groups.inf.ed.ac.uk/cup/javaGithub/
Paper: Mining Source Code Repositories at Massive Scale using Language Modeling.

Group Members

  • Miltos Allamanis (@mallamanis) is a PhD student and a Microsoft Research PhD Scholar at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.

  • Pankajan Chanthirasegaran (@pankajan) is a Research Assistant Developer at the University of Edinburgh. He has industry experience in Big Data Analytic Programming and hands on research experience in the application of Natural Language Processing to Source Code.

  • Rafael-Michael Karampatsis (@mpatsis) is a 1st year PhD student in the centre for doctoral training in Data Science at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.

  • Jaroslav Fowkes (@jfowkes) is a Postdoc at the University of Edinburgh and member of the machine learning group. His research focuses on developing natural language processing techniques for the analysis of program source code text as well as novel statistical methods for exploratory data analysis.

  • Charles Sutton (@casutton) is a Reader (= US Associate Professor) at the University of Edinburgh and member of the machine learning group. His research aims at new statistical machine learning methods designed to handle data about the operation and performance of large-scale computer systems with ultimate goal to improve techniques for developing, managing, and debugging computer systems.