AI News, Source Code Classification Using Deep Learning

Source Code Classification Using Deep Learning

Programming languages are the primary tool of the software development industry.

Since the 1940’s hundreds of them have been created and a huge amount of new lines of code in diverse programming languages are written and pushed to active repositories every day.

We believe that a source code classifier that can identify the programming language that a piece of code is written in would be a very useful tool for automatic syntax highlighting and label suggestion on platforms, such as StackOverflow and technical wikis.

This inspired us to train a model for classifying code snippets based on their language, leveraging recent AI techniques for text classification.

Before training our model, the raw data had to be processed to remove and mitigate some unwanted characteristics of code found in the wild.

Thousands of repositories were inspected, but the ones with a size greater than 100mb were ignored to avoid spending too much time on downloading and preprocessing.

Crawled Files Looking carefully at the raw data, we find some challenging behaviours and characteristics, which is not a big surprise given that this data is pulled out of actual arbitrary repositories.

So in case of mixed languages in a single source code file, we would like to keep only the snippets that belong to the primary language of the file (inferred from its extension), and strip everything else.

JavaScript snippet with a “hidden” C code embedded After the preprocessing step, which includes also escaping of newline and tab characters, we need to tokenize all our text.

Our model uses a word embedding layer followed by a convolutional layer with multiple filters, a max-pooling layer and finally a softmax layer (Figure 3).

Convolutional Neural Network model (Figure based on [2]) We performed a test over a 10% data split and calculated the accuracy, precision, recall and f1-score for each label.

Also, versioned data for each programming language could be obtained to make it possible to assign a specific version to a source code snippet.

VSCode Tutorials #4 - Git Integration

Learn about Visual Studio Code and why it's an exciting new text editor. Subscribe for more free tutorials Learn Meteor & React for modern ..

Introducing ML.NET : Build 2018

ML.NET is aimed at providing a first class experience for Machine Learning in .NET. Using ML.NET, .NET developers can develop and infuse custom AI into ...

ReadIt | Android Universal Social News App Template | Codecanyon Scripts and Snippets

Download ReadIt | Android Universal Social News App Template ...

Susan Tan - Let's read code: the requests library - PyCon 2016

Speaker: Susan Tan Imagine you're a new engineer at a workplace who has to learn a new unfamiliar codebase. After you acquire a copy of the repo, what is ... Fast, flexible, and easy-to-use input pipelines (TensorFlow Dev Summit 2018)

Derek Murray discusses, the recommended API for building input pipelines in TensorFlow. In this talk, he introduces the library, and presents some ...

GTAC 2014: Test coverage at Google

Andrei Chirila, Google Did you ever wonder how testing at Google looks like? What tools we use to help us out and how do we measure and act on test ...

PWA starter kit: build fast, scalable, modern apps with Web Components (Google I/O '18)

Web Components are encapsulated, re-usable elements using just the web platform. But these APIs don't say much about how to turn components into fast and ...


by Martin Junghanns and Max Kießling At: FOSDEM 2017 ### Abstract Graph pattern matching is one of the most interesting and challengingoperations in ...

Journey | Android Universal Social Travel App Template | Codecanyon Scripts and Snippets

Download Journey | Android Universal Social Travel App Template ...

Atom Editor Tutorials #16 - Using Git Within Atom

In this Atom Editor Tutorial, I show you how to manage your Git repos directly in Atom without a console. Subscribe for more free tutorials ..