GNU/Linux AI & Alife HOWTO: Statistical & Machine Learning

7. Statistical & Machine Learning

All about getting machines to learn to do something rather than explicitly programming to do it. Tends to deal with pattern matching a lot and are heavily math and statistically based. Technically Connectionism falls under this category, but it is such a large sub-field I'm keeping it in a separate section.

7.1 Libraries

Libraries or frameworks used for writing machine learning systems.

CognitiveFoundry

Web site: http://foundry.sandia.gov/

The Cognitive Foundry is a modular Java software library for the research and development of cognitive systems. It contains many reusable components for machine learning, statistics, and cognitive modeling. It is primarily designed to be easy to plug into applications to provide adaptive behaviors.

CompLearn

Web site: http://complearn.org/

CompLearn is a software system built to support compression-based learning in a wide variety of applications. It provides this support in the form of a library written in highly portable ANSI C that runs in most modern computer environments with minimal confusion. It also supplies a small suite of simple, composable command-line utilities as simple applications that use this library. Together with other commonly used machine-learning tools such as LibSVM and GraphViz, CompLearn forms an attractive offering in machine-learning frameworks and toolkits.

Elefant

Web site: http://elefant.developer.nicta.com.au/

Elefant (Efficient Learning, Large-scale Inference, and Optimisation Toolkit) is an open source library for machine learning licensed under the Mozilla Public License (MPL). We develop an open source machine learning toolkit which provides

algorithms for machine learning utilising the power of multi-core/multi-threaded processors/operating systems (Linux, WIndows, Mac OS X),
a graphical user interface for users who want to quickly prototype machine learning experiments,
tutorials to support learning about Statistical Machine Learning (Statistical Machine Learning at The Australian National University), and
detailed and precise documentation for each of the above.

Maximum Entropy Toolkit

Web site: http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html

The Maximum Entropy Toolkit provides a set of tools and library for constructing maximum entropy (maxent) model in either Python or C++.

Maxent Entropy Model is a general purpose machine learning framework that has proved to be highly expressive and powerful in statistical natural language processing, statistical physics, computer vision and many other fields.

Milk

Web site: http://packages.python.org/milk/
Web site: https://github.com/luispedro/milk

Milk is a machine learning toolkit in Python. It's focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems. For unsupervised learning, milk supports k-means clustering and affinity propagation.

NLTK

Web site: http://nltk.org/

NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit.

NLTK is ideally suited to students who are learning NLP (natural language processing) or conducting research in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems.

peach

Web site: http://code.google.com/p/peach/

Peach is a pure-python module, based on SciPy and NumPy to implement algorithms for computational intelligence and machine learning. Methods implemented include, but are not limited to, artificial neural networks, fuzzy logic, genetic algorithms, swarm intelligence and much more.

The aim of this library is primarily educational. Nonetheless, care was taken to make the methods implemented also very efficient.

pebl

Web site: http://code.google.com/p/pebl-project/

Pebl is a python library and command line application for learning the structure of a Bayesian network given prior knowledge and observations. Pebl includes the following features:

Can learn with observational and interventional data
Handles missing values and hidden variables using exact and heuristic methods
Provides several learning algorithms; makes creating new ones simple
Has facilities for transparent parallel execution using several cluster/grid resources
Calculates edge marginals and consensus networks
Presents results in a variety of formats

PyBrain

Web site: http://pybrain.org/

PyBrain is a modular Machine Learning Library for Python. It's goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

MBT

Web site: http://ilk.uvt.nl/mbt/

MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing. It has also been used for named-entity recognition, information extraction in domain-specific texts, and disfluency chunking in transcribed speech.

MLAP book samples

Web site: http://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html

Not a library per-say, but a whole slew of example machine learning algorithms from the book "Machine Learning: An Algorithmic Perspective" by Stephen Marsland. All code is written in python.

scikits.learn

Web site: http://scikit-learn.org/stable/

scikits-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

Shogun

Web site: http://www.shogun-toolbox.org/

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM). It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art LibSVM and SVMLight. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved, Fischer, TOP, Spectrum, Weighted Degree Kernel (with shifts). For the latter the efficient LINADD optimizations are implemented. Also SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning. Currently SVM 2-class classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python.

timbl

Web site: http://ilk.uvt.nl/timbl/

The Tilburg Memory Based Learner, TiMBL, is a tool for NLP research, and for many other domains where classification tasks are learned from examples. It is an efficient implementation of k-nearest neighbor classifier.

TiMBL's features are:

Fast, decision-tree-based implementation of k-nearest neighbor lassification;
Implementations of IB1 and IB2, IGTree, TRIBL, and TRIBL2 algorithms;
Similarity metrics: Overlap, MVDM, Jeffrey Divergence, Dot product, Cosine;
Feature weighting metrics: information gain, gain ratio, chi squared, shared variance;
Distance weighting metrics: inverse, inverse linear, exponential decay;
Extensive verbosity options to inspect nearest neighbor sets;
Server functionality and extensive API;
Fast leave-one-out testing and internal cross-validation;
and Handles user-defined example weighting.

7.2 Applications

Full applications that implement various machine learning or statistical systems oriented toward general learning (i.e., no spam filters and the like).

dbacl

Web site: http://dbacl.sourceforge.net/

The dbacl project consist of a set of lightweight UNIX/POSIX utilities which can be used, either directly or in shell scripts, to classify text documents automatically, according to Bayesian statistical principles.

Torch

Web site: http://www.torch.ch/
Old versions: Torch5 Torch3

Torch provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation.

Vowpal Wabbit

Web site: http://hunch.net/~vw/

Vowpal Wabbit is a fast online learning algorithm. It features:

flexible input data specification
speedy learning
scalability (bounded memory footprint, suitable for distributed computation)
feature pairing

The core algorithm is specialist gradient descent (GD) on a loss function (several are available), The code should be easily usable.

Next Previous Contents