Machine Learning
Outlier detection
Thesis J. Janssens
 github: jeroenjanssens/phdthesis
 sos.jeroenjanssens.com Stochastic Outlier Selection
 slideshare: Outlier Selection and One Class Classification
 github: jeroenjanssens/scikitsos
Stochastic Outlier Selection
 Unsupervised outlier selection algorithm
 Employs concept of affinity
 Computes outlier probabilities
 One parameter: perplexity
 Inspired by tSNE
Tools
rattle
library(rattle)
rattle()
 Click Execute
 Click Yes (load the sample weather dataset)
 Click the Model tab
 Click Execute (to build a decision tree)
 Click Draw to display the decision tree (loads other packages as required)
 Click the Forest radio button
 Click Execute (to build a random forest  loads packages as required)
 Click the Evaluate tab
 Click the Risk radio button (installs packages as required)
 Click Execute to display two Risk (Cummulative) performance plots
 Click the Log tab
 Click the Export button to save script to file weather script.R to home folder
Supervised Learning
Random Forests
R
 awesomemachinelearning: R
 rpart: Recursive Partitioning Using the RPART Routines
 party: A Laboratory for Recursive Partytioning
 Coursera: Predictive Analysis
 Coursera: Practical Machine Learning
 wikipedia: Decision Tree Learning
Random Forest Exercises
Data Version Control
 dataversioncontrol.com: Make your data science projects reproducible and shareable
R libraries
caret
 caret documentation
 github: topepo: caret
 companion page to Applied Predictive Modelin by Max Kuhn
 github APM exercises
 webinar on caret
 Article in JSS
 github: topepo: useR2016 Slides and code for the 2016 useR! tutorial “Never Tell Me the Odds! Machine Learning with Class Imbalances”
 Applied Predictive Modeling: useR! 2014 morning tutorial
Methods
Bagging
Some models perform bagging, in train
function consider methods
options
bagEarth
treebag
bagFDA
Alternatively, bag any model using the bag
function
Links
 Cortana Intelligence Gallery
 A Super Harsh Guide to Machine Learning
 DataTau
 kaggle in class: Academic Machine Learning Competitions
 UC Irvine Machine Learning Repository available from
mlbench
R package  KDNuggets: The 10 Algorithms Machine Learning Engineers Need to Know
 Machine Learning: An InDepth, NonTechnical Guide
 Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Papers
Machine Learning for Hackers
Machine Learning for Hackers by Drew Conway and John Myles White (O’Reilly). Copyright 2012 Drew Conway and John Myles White, 9781449303716.
Resources
 p.37 01_heights_weights_genders.csv
Models

github: rushter: MLAlgorithms Minimal and clean examples of machine learning algorithms
 Classification (Spam Filtering)
 Ranking (Priority Inbox)
 Regression (Predicting Page Views)
 Regularization (Text Regression / Logistic regression)
 Optimization (breaking codes / Ridge regression)
 PCA (construct market index / unsupervised learning)
 MDS (visual exploration / distance metrics)
 knn (Recommendation Systems)
 Social Graph Analysis
 treebased models
 gradient boosting: LightGBM
logistic regression (classification algorithm)
 qualitative concept encoded using numeric values that represent a Boolean distinction: 1 means
true
, whereas 0 meansfalse
(“dummy coded”)  numeric coding style required by some machine learning algorithms (e.g. logistic regression,
glm
function in R)
Logistic regression is, deep down, essentially a form of regression in which one predicts the probability that an item belongs to one of two categories. (p.175)
kNN kNearest Neighbors Algorithm
SVM Support Vector Machine
Deep Learning / Neural Networks
 youtube: Tensorflow and deep learning
 Codelabs: TensorFlow and deep learning
 github: Lasagne/Lasagne
Other Models
 Markov models
 Generalized Linear Models (GLM)
 Probabilistic Graphical Models
 Latent Variable Models
 TimeSeries Model
 RealTime Learning
Conferences
Master Programs
Deep Learning
Torsten Hothorm (UZH) on Big Data Science
17 Great Machine Learning Libraries
Source: daoudclarke.github.io/machinelearninglibraries
 CNTK
 Torch
 Caffe
sckikitlearn
comprehensive and easy to use, I wrote a whole article on why I like this library.
install
pip install git+https://github.com/scikitlearn/scikitlearn.git user
Python
 Tensorflow
 open source software library for numerical computation using data flow graphs
 PyBrain
 Neural networks are one thing that are missing from SciKitlearn, but this module makes up for it.
 nltk
 really useful if you’re doing anything NLP or text mining related.
 Theano
 efficient computation of mathematical expressions using GPU. Excellent for deep learning.
 Pylearn2
 machine learning toolbox built on top of Theano  in very early stages of development.
 MDP (Modular toolkit for Data Processing)
 a framework that is useful when setting up workflows.
Java
 SystemML
 SystemML is a flexible, scalable machine learning (ML) language written in Java. SystemML’s distinguishing characteristics are: (1) algorithm customizability, (2) multiple execution modes, including Standalone, Hadoop Batch, and Spark Batch, and (3) automatic optimization.
 Spark
 Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
 Mahout
 Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
 Weka
 this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
 Mallet
 another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
 JSAT
 stands for “Java Statistical Analysis Tool”  created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.
###.NET
 Accord.NET
 this seems to be pretty comprehensive, and comes recommended by primaryobjects on Reddit. There is perhaps a slight slant towards image processing and computer vision, as it builds on the popular library AForge.NET for this purpose.
Another option is to use one of the Java libraries compiled to .NET using IKVM  I have used this approach with success in production.
C++
 Vowpal Wabbit
 designed for very fast learning and released under a BSD license, this comes recommended by terath on Reddit.
 MultiBoost
 a fast C++ framework implementing some boosting algorithms as well as some cascades (like the ViolaJones cascades). It’s mainly focused on AdaBoost.MH so it is multiclass/multilabel.
 Shogun
 large machine learning library with a focus on kernel methods and support vector machines. Bindings to Matlab, R, Octave and Python.
General
 LibSVM and LibLinear
 these are C libraries for support vector machines; there are also bindings or implementations for many other languages. These are the libraries used for support vector machine learning in Scikitlearn.
Coursera
 Machine Learning by Andrew NG, Stanford University
 github: faridcher/machinelearningcourse R version assignments of Stanford machine learning course
 github: Borye/machinelearningcoursera1
 github: JWarmenhoven/CourseraMachineLearning
People
 Graham Williams: togaware.com
Books
 From linear models to machine learning
 Author: Norman Matloff
Publisher: CRC Press  The Elements of Statistical Learning
 Author: Hastie, Tibshirani, and Friedman
Content: formal specifications of basic machine learning techniques (mathematics, statistics, computer science) URL: wwwstat.stanford.edu/~tibs/ElemStatLearn
MOOC: rbloggers: Indepth introduction to machine learning in 15 hours of expert videos  Machine Learning
 Author: Mitchell, T.M. Publisher: McGrawHill, NY Year: 1997 URL: http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html Course: http://www.cs.cmu.edu/~tom/10701_sp11/
 Data Mining: Practical Machine Learning Tools and Techniques
 Author: Witten, I., Frank, E. and Hall, M. Edition: 3rd Publisher: Morgan Kaufmann, San Mateo, CA, Year: 2011
Articles
ASA
 Statistics as a Science, Not an Art: The Way to Survive in Data Science
 201502
statscience_feb2015  Statistics Losing Ground to Computer Science
 201411
statisticslosinggroundtocomputerscience  Time to Embrace a New Identity?
 201410
statviewoct14  Statistics Training: A Big Role in Big Data?
 201405
statviewbigdata  Leo Breiman, Statistical Modelling: The Two Cultures, Statistical Science 16(3), 2001
 breiman.pdf
 Pedro Domingos, A Few Useful Things to Know about Machine Learning
 Communications of the ACM, Vol. 55 No. 10, Pages 7887, 2012
cacm12.pdf
mlintrodomingos2012.pdf
 github: shagunsodhani: papersiread
 Why becoming a data scientist is NOT actually easier than you think
Data Sources
 FLUENTD
 data collector for unified logging layer
http://www.fluentd.org/