Machine Learning for the Analysis of Source Code Text

Who we are

We are a group of researchers at Edinburgh University's School of Informatics interested in Machine learning techniques for the Analysis of Source code Text (MAST). To get an idea of our research interests, take a look at our reading group.

What we do

As part of our research we strive to create useful tools and resources that leverage machine learning to aid source code developers. In particular, we have developed the following tools/resources:

Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Paper: Learning natural coding conventions (FSE 2014, ACM Distinguished Paper Award).

A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Paper: A Convolutional Attention Network for Extreme Summarization of Source Code (ICML 2016).

PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Paper: Parameter-Free Probabilistic API Mining across GitHub (FSE 2016).

IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
Paper: A Bayesian Network Model for Interesting Itemsets (ECML/PKDD 2016).

ISM is a novel algorithm that mines the most interesting sequences under a probablistic model. It is able to efficiently infer interesting sequences directly from the database.
Paper: A Subsequence Interleaving Model for Sequential Pattern Mining (KDD 2016).

TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
Demo: (ICSE 2016 demo).
Paper: Autofolding for Source Code Summarization .

The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
Paper: Mining Source Code Repositories at Massive Scale using Language Modeling (MSR 2013).

Group Members

  • Rafael-Michael Karampatsis (@mpatsis) is a 1st year PhD student in the centre for doctoral training in Data Science at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.

  • Maria Gorinova (@mgorinova) is a postgraduate student in the EPSRC CDT in Data Science at the University of Edinburgh, working with Andrew Gordon and Charles Sutton.

  • Irene Vlassi-Pandi (@irenevp) is a PhD student at the Univesity of Edinburgh and a Microsoft Research PhD Scholar, working on machine learning and program analysis.

  • Charles Sutton (@casutton) is a Reader (= US Associate Professor) at the University of Edinburgh and member of the machine learning group. His research aims at new statistical machine learning methods designed to handle data about the operation and performance of large-scale computer systems with ultimate goal to improve techniques for developing, managing, and debugging computer systems.

Group Alumni

  • Miltos Allamanis (@mallamanis) completed his PhD student as a Microsoft Research PhD Scholar at the University of Edinburgh under the supervision of Charles Sutton. He is now a postdoctoral research scientist at Microsoft Research

  • Jaroslav Fowkes (@jfowkes) former postdoctoral researcher, now researcher at the University of Oxford.

  • Pankajan Chanthirasegaran (@pankajan) former Research Assistant Developer.

  • Hao Peng, summer intern from Peking University, now PhD student at the University of Washington