We are a group of researchers at Edinburgh University's School of Informatics interested in Machine learning techniques for the Analysis of Source code Text (MAST). To get an idea of our research interests, take a look at our reading group.
As part of our research we strive to create useful tools and resources that leverage machine learning to aid source code developers. In particular, we have developed the following tools/resources:
Naturalize is a language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
Paper: Learning natural coding conventions.
A convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
Paper: A Convolutional Attention Network for Extreme Summarization of Source Code.
IIM is a novel algorithm that mines the itemsets that are most interesting under a probablistic model of transactions. Our model is able to efficiently infer interesting itemsets directly from the transaction database.
Paper: A Bayesian Network Model for Interesting Itemsets.
ISM is a novel algorithm that mines the most interesting sequences under a probablistic model. It is able to efficiently infer interesting sequences directly from the database.
Paper: A Subsequence Interleaving Model for Sequential Pattern Mining.
PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
Paper: Parameter-Free Probabilistic API Mining across GitHub.
TASSAL is a tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
Paper: Autofolding for Source Code Summarization.
The GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
Paper: Mining Source Code Repositories at Massive Scale using Language Modeling.
Miltos Allamanis (@mallamanis) is a PhD student and a Microsoft Research PhD Scholar at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.
Pankajan Chanthirasegaran (@pankajan) is a Research Assistant Developer at the University of Edinburgh. He has industry experience in Big Data Analytic Programming and hands on research experience in the application of Natural Language Processing to Source Code.
Rafael-Michael Karampatsis (@mpatsis) is a 1st year PhD student in the centre for doctoral training in Data Science at the University of Edinburgh under the supervision of Charles Sutton. He is pursuing a PhD on Statistical Natural Language Processing for Programming Language Text.
Jaroslav Fowkes (@jfowkes) is a Postdoc at the University of Edinburgh and member of the machine learning group. His research focuses on developing natural language processing techniques for the analysis of program source code text as well as novel statistical methods for exploratory data analysis.
Charles Sutton (@casutton) is a Reader (= US Associate Professor) at the University of Edinburgh and member of the machine learning group. His research aims at new statistical machine learning methods designed to handle data about the operation and performance of large-scale computer systems with ultimate goal to improve techniques for developing, managing, and debugging computer systems.