AI News, New open-source Machine Learning Framework written in Java

New open-source Machine Learning Framework written in Java

I am happy to announce that the Datumbox Machine Learning Framework is now open sourced under GPL 3.0 and you can download its code from Github!

The main focus of the framework is to include a large number of machine learning algorithms &

Even though the framework targets to assist the development of models from various fields, it also provides tools that are particularly useful in Natural Language Processing and Text Analysis applications.

The Statistics layer provides classes for calculating descriptive statistics, performing various types of sampling, estimating CDFs and PDFs from commonly used probability distributions and performing over 35 parametric and non-parametric tests.

Such types of classes are usually necessary while performing explanatory data analysis, sampling and feature selection.

This means that it can be incorporated easier into production code, it can easier be tweaked to reduce memory consumption and it can be used in real time systems.

Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, it is within my plans to expand it to handle large-sized datasets.

In August 2013 I decided to start Datumbox as a personal project and develop a framework that provides the tools for developing machine learning models focusing in the area of NLP and Text Classification.

My target was to build a framework that would be reused on the future for developing quickly machine learning models, incorporating it in projects that require machine learning components or offer it as a service (Machine Learning as a Service).

As in every piece of software (and especially the open-source projects in alpha version), the Datumbox Machine Learning Framework comes with its own unique and adorable limitations.

Finally I would like to thank my love Kyriaki for tolerating me while writing this project, my friend and super-ninja-Java-developer Eleftherios Bampaletakis for helping out with important Java issues and you for getting involved in the project.


The main focus of the framework is to include a large number of machine learning algorithms &

To use it, add the following snippet in your pom.xml: The latest snapshot version of the framework is 0.8.2-SNAPSHOT (Build 20180410).

To test it, update your pom.xml as follows: The develop branch is the development branch (default github branch), while the master branch contains the latest stable version of the framework.

In addition it provides several implemented algorithms including Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and several other techniques that can be used for feature selection, ensemble learning, linear programming solving and recommender systems.

Other important enhancements include improving the documentation, the test coverage and the examples, improving the architecture of the framework and supporting more Machine Learning and Statistical Models.

If you make any useful changes on the code, please consider contributing them by sending a pull request.

Evolution of machine-learning-cheat-sheet (Gource Visualization)

Gource visualization of machine-learning-cheat-sheet ( Classical equations and diagrams in ..