AI News, Machine Learning on Go Code

Machine Learning on Go Code

This blog post is the written form of my recent talk at GopherCon 2018: Machine Learning on Go Code, which you can now enjoy directly on YouTube Machine Learning on Go Code.

I definitely had heard about it, but I didn’t realize how much this was the case until recently, when I found this article (Infographic: How Many Millions of Lines of Code Does It Take?) which shows the number of lines of code in popular pieces of software and its evolution over time.

invite you to review the amazing graphics included in that article now, but just in case you’d rather continue here the most impressive facts are how Windows NT 3.1 had already 4 to 5 million lines of code, the latest version of Chrome has 18M, or how a Ford Pickup has 150M lines of code.

In comparison, our transportation tools have evolved greatly to the point that our cars now understand their surroundings and are able to warn us in situations of danger, or even to take actions to avoid imminent accidents.

Machine Learning on Source Code (aka ML on Code) is Machine Learning that we apply on top of Source Code, so rather than having an input consisting in images, videos, or natural text, we will feed source code to our models in order to train them to predict interesting characteristics of codebases.

This is why we decided to create a public dataset named Public Git Archive ( which based on GH Archive downloads all of the contents of every repository with 50 GitHub stars or more and makes them available in a convenient format.The dataset contains over 4TB of source code including hundreds of programming languages.

For these tasks some open source tools are available (some of them by source{d}): As an effort to make many of these tools available in an easy and unified way, to provide a simple way to analyze source code repositories we’re working on the source{d} engine a simple server that provides a SQL interface to your git repositories.

Source code can be seen at, at least, four different levels of abstraction: The highest level of abstraction provide with more information to the model, therefore giving a better chance to predict advanced concepts.

But there’s a trade-off since for instance an analysis by token will never predict new identifiers, since only those that we have seen can be predicted, while analyzing it as a sequence of tokens can provide brand new identifiers never seen before.

Lastly, with code2vec we could identify when the name of a function is not adequate, therefore minimizing the possibility of mistakes in later code using these functions where their stated intention (through their name) and the actual implementation do not match.

source{d} lookout Integrating all of these analyzers powered by Machine Learning with existing linters using more traditional techniques, we’ve developed source{d} lookout: a GitHub bot that will review your PRs and help you identify possible mistakes with higher accuracy.

This is what we call assisted code review, and it’s just the beginning of the many use cases we believe can benefit from ML on Code.In the future we’d like to also predict bugs, enforce style guides automatically, and maybe one day we will be even able to generate code automatically from unit tests, specifications, or even natural language descriptions.

In the same way architects did not disappear when CAD tools came to be, developers will simply become more efficient and we hope this will empower them to create even better software that will, at its time, improve how the rest of society performs their own tasks.

Predicting the Winning Team with Machine Learning

Can we predict the outcome of a football game given a dataset of past games? That's the question that we'll answer in this episode by using the scikit-learn ...

How to Make a Prediction - Intro to Deep Learning #1

Welcome to Intro to Deep Learning! This course is for anyone who wants to become a deep learning engineer. I'll take you from the very basics of deep learning ...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, ...

Build a Neural Net in 4 Minutes

How does a Neural network work? Its the basis of deep learning and the reason why image recognition, chatbots, self driving cars, and language translation ...

Hello World - Machine Learning Recipes #1

Six lines of Python is all it takes to write your first machine learning program! In this episode, we'll briefly introduce what machine learning is and why it's ...

Maciej Kula | Neural Networks for Recommender Systems

PyData Amsterdam 2017 Neural networks are quickly becoming the tool of choice for recommender systems. In this talk, I'm going to present a number of neural ...

10.2: Neural Networks: Perceptron Part 1 - The Nature of Code

In this video, I continue my machine learning series and build a simple Perceptron in Processing (Java). Perceptron Part 2: This ..

Build an AI Writer - Machine Learning for Hackers #8

This video will get you up and running with your first AI Writer able to write a short story based on an image that you input. The code for this video is here: ...

How to Make a Simple Tensorflow Speech Recognizer

In this video, we'll make a super simple speech recognizer in 20 lines of Python using the Tensorflow machine learning library. I go over the history of speech ...