Machine Learning

Outlier detection

Thesis J. Janssens

github: jeroenjanssens/phd-thesis
sos.jeroenjanssens.com Stochastic Outlier Selection
slideshare: Outlier Selection and One Class Classification
github: jeroenjanssens/scikit-sos

Stochastic Outlier Selection

Unsupervised outlier selection algorithm
Employs concept of affinity
Computes outlier probabilities
One parameter: perplexity
Inspired by t-SNE

Euler Diagrams

Tools

Running an R Workshop on Azure with the Ubuntu Data Science Virtual Machine

rattle

rattle: A Graphical User Interface for Data Mining using R

library(rattle)
rattle()

Click Execute
Click Yes (load the sample weather dataset)
Click the Model tab
Click Execute (to build a decision tree)
Click Draw to display the decision tree (loads other packages as required)
Click the Forest radio button
Click Execute (to build a random forest - loads packages as required)
Click the Evaluate tab
Click the Risk radio button (installs packages as required)
Click Execute to display two Risk (Cummulative) performance plots
Click the Log tab
Click the Export button to save script to file weather script.R to home folder

Supervised Learning

Random Forests

R

awesome-machine-learning: R
rpart: Recursive Partitioning Using the RPART Routines
party: A Laboratory for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning
Coursera: Predictive Analysis
Coursera: Practical Machine Learning
wikipedia: Decision Tree Learning

Random Forest Exercises

R Assignment: Classification of Ocean Microbes

Data Version Control

dataversioncontrol.com: Make your data science projects reproducible and shareable

R libraries

caret

caret documentation
github: topepo: caret
companion page to Applied Predictive Modelin by Max Kuhn
github APM exercises
webinar on caret
Article in JSS
github: topepo: useR2016 Slides and code for the 2016 useR! tutorial “Never Tell Me the Odds! Machine Learning with Class Imbalances”
Applied Predictive Modeling: useR! 2014 morning tutorial

Methods

Bagging

Some models perform bagging, in train function consider methods options

bagEarth
treebag
bagFDA

Alternatively, bag any model using the bag function

Papers

Hal R. Varian, Big Data: New Tricks for Econometrics (April 14, 2014)

Machine Learning for Hackers

github: johnmyleswhite: ML_for_Hackers

Resources

p.37 01_heights_weights_genders.csv

Models

github: rushter: MLAlgorithms Minimal and clean examples of machine learning algorithms
Classification (Spam Filtering)
Ranking (Priority Inbox)
Regression (Predicting Page Views)
Regularization (Text Regression / Logistic regression)
Optimization (breaking codes / Ridge regression)
PCA (construct market index / unsupervised learning)
MDS (visual exploration / distance metrics)
knn (Recommendation Systems)
Social Graph Analysis
tree-based models
gradient boosting: LightGBM

logistic regression (classification algorithm)

qualitative concept encoded using numeric values that represent a Boolean distinction: 1 means true, whereas 0 means false (“dummy coded”)
numeric coding style required by some machine learning algorithms (e.g. logistic regression, glm function in R)

Logistic regression is, deep down, essentially a form of regression in which one predicts the probability that an item belongs to one of two categories. (p.175)

kNN k-Nearest Neighbors Algorithm

SVM Support Vector Machine

KDnuggets: SVR Support Vector Regression

Deep Learning / Neural Networks

Other Models

Markov models
Generalized Linear Models (GLM)
Probabilistic Graphical Models
Latent Variable Models
Time-Series Model
Real-Time Learning

Conferences

European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases

Master Programs

Stanford: Department of Statistics: MS in Statistics: Data Science

Deep Learning

deeplearning.net

Torsten Hothorm (UZH) on Big Data Science

17 Great Machine Learning Libraries

Source: daoudclarke.github.io/machine-learning-libraries

CNTK
Torch
Caffe

sckikit-learn

comprehensive and easy to use, I wrote a whole article on why I like this library.

install

pip install git+https://github.com/scikit-learn/scikit-learn.git --user

Python

Tensorflow: open source software library for numerical computation using data flow graphs
PyBrain: Neural networks are one thing that are missing from SciKit-learn, but this module makes up for it.
nltk: really useful if you’re doing anything NLP or text mining related.
Theano: efficient computation of mathematical expressions using GPU. Excellent for deep learning.
Pylearn2: machine learning toolbox built on top of Theano - in very early stages of development.
MDP (Modular toolkit for Data Processing): a framework that is useful when setting up workflows.

Java

SystemML: SystemML is a flexible, scalable machine learning (ML) language written in Java. SystemML’s distinguishing characteristics are: (1) algorithm customizability, (2) multiple execution modes, including Standalone, Hadoop Batch, and Spark Batch, and (3) automatic optimization.
Spark: Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
Mahout: Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
Weka: this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
Mallet: another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
JSAT: stands for “Java Statistical Analysis Tool” - created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.

###.NET

Accord.NET: this seems to be pretty comprehensive, and comes recommended by primaryobjects on Reddit. There is perhaps a slight slant towards image processing and computer vision, as it builds on the popular library AForge.NET for this purpose.
Another option is to use one of the Java libraries compiled to .NET using IKVM - I have used this approach with success in production.

C++

Vowpal Wabbit: designed for very fast learning and released under a BSD license, this comes recommended by terath on Reddit.
MultiBoost: a fast C++ framework implementing some boosting algorithms as well as some cascades (like the Viola-Jones cascades). It’s mainly focused on AdaBoost.MH so it is multi-class/multi-label.
Shogun: large machine learning library with a focus on kernel methods and support vector machines. Bindings to Matlab, R, Octave and Python.

General

LibSVM and LibLinear: these are C libraries for support vector machines; there are also bindings or implementations for many other languages. These are the libraries used for support vector machine learning in Scikit-learn.

Coursera

Machine Learning by Andrew NG, Stanford University
github: faridcher/machine-learning-course R version assignments of Stanford machine learning course
github: Borye/machine-learning-coursera-1
github: JWarmenhoven/Coursera-Machine-Learning

People

Graham Williams: togaware.com

Books

From linear models to machine learning: Author: Norman Matloff
Publisher: CRC Press
The Elements of Statistical Learning: Author: Hastie, Tibshirani, and Friedman
Content: formal specifications of basic machine learning techniques (mathematics, statistics, computer science) URL: www-stat.stanford.edu/~tibs/ElemStatLearn
MOOC: r-bloggers: In-depth introduction to machine learning in 15 hours of expert videos
Machine Learning: Author: Mitchell, T.M. Publisher: McGraw-Hill, NY Year: 1997 URL: http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html Course: http://www.cs.cmu.edu/~tom/10701_sp11/
Data Mining: Practical Machine Learning Tools and Techniques: Author: Witten, I., Frank, E. and Hall, M. Edition: 3rd Publisher: Morgan Kaufmann, San Mateo, CA, Year: 2011

Articles

ASA

Statistics as a Science, Not an Art: The Way to Survive in Data Science: 2015-02
statscience_feb2015
Statistics Losing Ground to Computer Science: 2014-11
statistics-losing-ground-to-computer-science
Time to Embrace a New Identity?: 2014-10
statview-oct14
Statistics Training: A Big Role in Big Data?: 2014-05
statview-big-data
Leo Breiman, Statistical Modelling: The Two Cultures, Statistical Science 16(3), 2001: breiman.pdf
Pedro Domingos, A Few Useful Things to Know about Machine Learning: Communications of the ACM, Vol. 55 No. 10, Pages 78-87, 2012
cacm12.pdf
ml-intro-domingos2012.pdf

Data Sources

FLUENTD: data collector for unified logging layer
http://www.fluentd.org/

Data Sets

1001 Datasets and Data repositories (List of lists of lists

← Previous Archive Next →

Published

13 February 2015