Tom M. Mitchell, renowned computer scientist and professor at Carnegie Mellon University, USA, defined machine learning as: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Machine learning tasks are broadly classified into three categories, depending on the nature of the learning ‘signal’ or ‘feedback’ available to a learning system.
- Supervised learning is regarded as a machine learning task of inferring a function from labelled training data. In supervised learning, each example is a pair consisting of an input object (vector) and a desired output value (supervisory signal).
- Unsupervised learning: This is regarded as the machine learning task of inferring a function to describe hidden structures from unlabelled data. It is closely related to the problem of density estimation in statistics.
- Reinforcement learning is an area of machine learning that is linked to how software agents take actions in the environment so as to maximise some notion of cumulative reward. It is applied to diverse areas like game theory, information theory, swarm intelligence, statistics and genetic algorithms. In machine learning, the environment is formulated as a Markov decision process (MDP) due to dynamic programming techniques.
There is a wide range of open source machine learning frameworks available in the market, which enable machine learning engineers to build, implement and maintain machine learning systems, generate new projects and create new impactful machine learning systems.
Let’s take a look at some of the top open source machine learning frameworks available.
The Singa Project was initiated by the DB System Group at the National University of Singapore in 2014, with a primary focus on distributed deep learning by partitioning the model and data onto nodes in a cluster and parallelising the training. Apache Singa provides a simple programming model and works across a cluster of machines. It is primarily used in natural language processing (NLP) and image recognition. A Singa prototype accepted by Apache Incubator in March 2015 provides a flexible architecture of scalable distributed training and is extendable to run over a wide range of hardware.
Apache Singa was designed with an intuitive programming model based on layer abstraction. A wide variety of popular deep learning models are supported, such as feed-forward models like convolutional neural networks (CNN), energy models like Restricted Boltzmann Machine (RBM), and recurrent neural networks (RNN). Based on a flexible architecture, Singa runs various synchronous, asynchronous and hybrid training frameworks.
Singa’s software stack has three main components: Core, IO and Model. The Core component is concerned with memory management and tensor operations. IO contains classes for reading and writing data to the disk and the network. Model includes data structures and algorithms for machine learning models.
- Includes tensor abstraction for strong support for more advanced machine learning models
- Supports device abstraction for running on varied hardware devices
- Makes use of cmake for compilation rather than GNU autotool
- Improvised Python binding and contains more deep learning models like VGG and ResNet
- Includes enhanced IO classes for reading, writing, encoding and decoding files and data
Website: http://singa.apache.org/en/index.html
Shogun was initiated by Soeren Sonnenburg and Gunnar Raetsch in 1999 and is currently under rapid development by a large team of programmers. This free and open source toolbox written in C++ provides algorithms and data structures for machine learning problems. Shogun Toolbox provides the use of a toolbox via a unified interface from C++, Python, Octave, R, Java, Lua and C++; and can run on Windows, Linux and even MacOS. Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, dimensionality reduction, clustering, etc. It contains a number of exclusive state-of-art algorithms, such as a wealth of efficient SVM implementations, multiple kernel learning, kernel hypothesis testing, Krylov methods, etc.
Shogun supports bindings to other machine learning libraries like LibSVM, LibLinear, SVMLight, LibOCAS, libqp, VowpalWabbit, Tapkee, SLEP, GPML and many more.
Its features include one-time classification, multi-class classification, regression, structured output learning, pre-processing, built-in model selection strategies, visualisation and test frameworks; and semi-supervised, multi-task and large scale learning.
The latest version is 4.1.0.
Website: http://www.shogun-toolbox.org/
Apache Mahout, being a free and open source project of the Apache Software Foundation, has a goal to develop free distributed or scalable machine learning algorithms for diverse areas like collaborative filtering, clustering and classification. Mahout provides Java libraries and Java collections for various kinds of mathematical operations.
Apache Mahout is implemented on top of Apache Hadoop using the MapReduce paradigm. Once Big Data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in these Big Data sets, turning this into ‘big information’ quickly and easily.
- Building a recommendation engine: Mahout provides tools for building a recommendation engine via the Taste library– a fast and flexible engine for CF.
- Clustering with Mahout: Several clustering algorithms are supported by Mahout, like Canopy, k-Means, Mean-Shift, Dirichlet, etc.
- Categorising content with Mahout: Mahout uses the simple Map-Reduce-enabled naïve Bayes classifier.
The latest version is 0.12.2.
Website: https://mahout.apache.org/
Apache Spark MLlib is a machine learning library, the primary objective of which is to make practical machine learning scalable and easy. It comprises common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction as well as lower-level optimisation primitives and higher-level pipeline APIs.
Spark MLlib is regarded as a distributed machine learning framework on top of the Spark Core which, mainly due to the distributed memory-based Spark architecture, is almost nine times as fast as the disk-based implementation used by Apache Mahout.
The various common machine learning and statistical algorithms that have been implemented and included with MLlib are:
- Summary statistics, correlations, hypothesis testing, random data generation
- Classification and regression: Supports vector machines, logistic regression, linear regression, naïve Bayes classification
- Collaborative filtering techniques including Alternating Least Squares (ALS)
- Cluster analysis methods including k-means and Latent Dirichlet Allocation (LDA)
- Optimisation algorithms such as stochastic gradient descent and limited-memory BGGS
The latest version is 2.0.1.
Website: http://spark.apache.org/mllib/
TensorFlow is an open source software library for machine learning developed by the Google Brain Team for various sorts of perceptual and language understanding tasks, and to conduct sophisticated research on machine learning and deep neural networks. It is Google Brain’s second generation machine learning system and can run on multiple CPUs and GPUs. TensorFlow is deployed in various products of Google like speech recognition, Gmail, Google Photos and even Search.
TensorFlow performs numerical computations using data flow graphs. These elaborate the mathematical computations with a directed graph of nodes and edges. Nodes implement mathematical operations and can also represent endpoints to feed in data, push out results or read/write persistent variables. Edgesdescribe the input/output relationships between nodes. Data edges carry dynamically-sized multi-dimensional data arrays or tensors.
Its features are listed below.
- Highly flexible: TensorFlow enables users to write their own higher-level libraries on top of it by using C++ and Python, and express the neural network computation as a data flow graph.
- Portable: It can run on varied CPUs or GPUs, and even on mobile computing platforms. It also supports Docker and running via the cloud.
- Auto-differentiation: TensorFlow enables the user to define the computational architecture of predictive models combined with objective functions, and can handle complex computations.
- Diverse language options: It has an easy Python based interface and enables users to write code, and see visualisations and data flow graphs.
The latest version is 0.10.0.
Website: www.tensorflow.org
Oryx 2 is a realisation of Lambda architecture built on Apache Spark and Apache Kafka for real-time large scale machine learning. It is designed for building applications and includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering.
Oryx 2 comprises the following three tiers.
- General Lambda architecture tier: Provides batch, speed and serving layers, which are not specific to machine learning.
- Specialisation on top which, in turn, provides machine learning abstraction to hyperparameter selection, etc.
- End-to-end implementation of the same standard machine learning algorithms as an application (ALS, random decision forests, k-means) on top.
- Batch layer: Used for computing new results from historical data and previous results.
- Speed layer: Produces and publishes incremental model updates from a stream of new data.
- Serving layer: Receives models and updates, and implements a synchronous API, exposing query operations on results.
- Data transport layer: Moves data between layers and takes input from external sources.
The latest version is 2.2.1.
Website: http://oryx.io/
The framework is divided into libraries via the installer, compressed archives and NuGet packages, which include Accord.Math, Accord.Statistics, Accord.MachineLearning, Accord.Neuro, Accord.Imaging, Accord.Audio, Accord.Vision, Accord.Controls, Accord.Controls.Imaging, Accord.Controls.Audio, Accord.Controls.Vision, etc.
Its features are:
- Matrix library for an increase in code reusability, and gradual change of existing algorithms over standard .NET structures.
- Consists of more than 40 different statistical distributions like hidden Markov models and mixture models.
- Consists of more than 30 hypothesis tests like ANOVA, two-sample, multiple-sample, etc.
- Consists of more than 38 kernel functions like KVM, KPC and KDA.
Website: www.accord-framework.net
The key contents used in Amazon ML are listed below.
- Datasources: Contain metadata associated with data inputs to Amazon ML.
- ML models: Generate predictions using the patterns extracted from the input data.
- Evaluations: Measure the quality of ML models.
- Batch predictions asynchronously generate predictions for multiple input data observations.
- Real-time predictions synchronously generate predictions for individual data observations.
- Supports multiple data sources within its system.
- Allows users to create a data source object from data residing in Amazon Redshift – the data warehouse Platform as a Service.
- Allows users to create a data source object from data stored in the MySQL database.
- Supports three types of models: binary classification, multi-class classification and regression.