MAST

Who we are

We are a group of researchers at Edinburgh University's School of Informatics interested in Machine learning techniques for the Analysis of Source code Text (MAST). To get an idea of our research interests, take a look at our reading group.

What we do

As part of our research we strive to create useful tools and resources that leverage machine learning to aid source code developers. In particular, we have developed the following tools/resources:

Naturalize

Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Website: http://groups.inf.ed.ac.uk/naturalize/
Code: https://github.com/mast-group/naturalize
Paper: Learning natural coding conventions (FSE 2014, ACM Distinguished Paper Award).

Extreme Source Code Summarization

A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Website: http://groups.inf.ed.ac.uk/cup/codeattention
Code: https://github.com/mast-group/convolutional-attention
Paper: A Convolutional Attention Network for Extreme Summarization of Source Code (ICML 2016).

Probabilistic API Miner

PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Code: https://github.com/mast-group/api-mining
Paper: Parameter-Free Probabilistic API Mining across GitHub (FSE 2016).

Interesting Itemset Miner

IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
Code: https://github.com/mast-group/itemset-mining
Paper: A Bayesian Network Model for Interesting Itemsets (ECML/PKDD 2016).

Interesting Sequence Miner

ISM is a novel algorithm that mines the most interesting sequences under a probablistic model. It is able to efficiently infer interesting sequences directly from the database.
Code: https://github.com/mast-group/sequence-mining
Paper: A Subsequence Interleaving Model for Sequential Pattern Mining (KDD 2016).

TASSAL

TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
Demo: http://code-summarizer.herokuapp.com (ICSE 2016 demo).
Code: https://github.com/mast-group/tassal
Paper: Autofolding for Source Code Summarization .

GitHub Java Corpus

The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
Website: http://groups.inf.ed.ac.uk/cup/javaGithub/
Paper: Mining Source Code Repositories at Massive Scale using Language Modeling (MSR 2013).

Group Members

Rafael-Michael Karampatsis (@mpatsis) is a 1st year PhD student in the centre for doctoral training in Data Science at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.
Maria Gorinova (@mgorinova) is a postgraduate student in the EPSRC CDT in Data Science at the University of Edinburgh, working with Andrew Gordon and Charles Sutton.
Irene Vlassi-Pandi (@irenevp) is a PhD student at the Univesity of Edinburgh and a Microsoft Research PhD Scholar, working on machine learning and program analysis.
Annie Louis is a research associate at the Univeristy of Edinburgh. Her research is in natural language processing and machine learning techniques, specially for discourse and pragmatic level language understanding and generation. She completed her PhD at the University of Pennsylvania and was previously a Newton International Fellow at Edinburgh.
Charles Sutton (@casutton) is a Reader (= US Associate Professor) at the University of Edinburgh and member of the machine learning group. His research aims at new statistical machine learning methods designed to handle data about the operation and performance of large-scale computer systems with ultimate goal to improve techniques for developing, managing, and debugging computer systems.

Group Alumni

Miltos Allamanis (@mallamanis) completed his PhD student as a Microsoft Research PhD Scholar at the University of Edinburgh under the supervision of Charles Sutton. He is now a postdoctoral researcher at Microsoft Research
Jaroslav Fowkes (@jfowkes) former postdoctoral researcher, now researcher at the University of Oxford.
Pankajan Chanthirasegaran (@pankajan) former Research Assistant Developer.
Hao Peng, summer intern from Peking University, now PhD student at the University of Washington