<!--
.. title: Machine Learning Primer
.. slug: machine-learning-primer
.. date: 2014/03/15 10:22:26
.. tags: machine learning, scikit-learn, data, python
.. link:
.. description:
.. type: text
-->

How to classify pictures by what they represent? How cluster
similiar clients? How to predict new traffic rates on your server?

Machine learning is a great tool to solve this kinds of problems.
It turns out to be actually really easy to use in python. You don't
have to be a machine learning expert. To use python tools you need
to know python, here is a [tutorial](http://docs.python.org/2/tutorial/).

[Numpy](http://www.numpy.org/) and [scipy](http://scipy.org/) are a
backbone of *scientific* and *numerical* computing in python.
It's good to know at least some basics of them.
[Here](http://scipy-lectures.github.io/) is a tutorial to get you started.

To visualize data, features and results of learning I use
[matplotlib](http://matplotlib.org/). It's a cool, powerful and
useful tool.


#Kinds of problems you can solve with machine learning

Machine learning offers us methods for solving different kinds of
problems. We can divide them in classification, regression and
clustering.

There is also supervised and unsupervised learning.

Here is a quick overview:

##What is a classification problem?

[![svm](http://scikit-learn.org/stable/_images/plot_iris_1.png)](http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html)

Main goal of classification is identifying how to categorize new element.

Algorithms:

- SVM
- nearest neighbors
- random forest

If you want to learn more - [lecture on classification](http://www.google.pl/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CFEQFjAD&url=http%3A%2F%2Fwww.cs.princeton.edu%2F~schapire%2Ftalks%2Fpicasso-minicourse.pdf&ei=EVEkU-HYO8avygOT4IGgCA&usg=AFQjCNHvsg00bHtWUmg6wP9ijPyH5q9IwQ&sig2=hgURpDsAPWa0yG-N7n4zwg&bvm=bv.62922401,d.bGQ&cad=rjahttp://www.google.pl/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CFEQFjAD&url=http%3A%2F%2Fwww.cs.princeton.edu%2F~schapire%2Ftalks%2Fpicasso-minicourse.pdf&ei=EVEkU-HYO8avygOT4IGgCA&usg=AFQjCNHvsg00bHtWUmg6wP9ijPyH5q9IwQ&sig2=hgURpDsAPWa0yG-N7n4zwg&bvm=bv.62922401,d.bGQ&cad=rja)
 from Princeton gives some more examples.

##Regression - how to predict continuous variables?

![regression](http://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/438px-Linear_regression.svg.png)

Algorithms:

- SVR
- ridge regression
 -Lasso

##Clustering - grouping similiar things together!

>Cluster analysis or clustering is the task of
>grouping a set of objects in such a way that objects in the same
>group (called a cluster) are more similar (in some sense or another)
>to each other than to those in other groups (clusters).
-- [Wikipedia](http://en.wikipedia.org/wiki/Cluster_analysis)

Applications:

- customer segmentation
- Grouping experiment outcomes
- learning more about data

Algorithms:

- k-Means
- spectral clustering
- affinity propagation


#What are good tools and how can I start using them?

If you don't know where to start, to solve your machine learning problems,
start with some too. One example of great Machine Learning tool is a
[scikit-learn](http://scikit-learn.org/stable/index.html) library.

It's documentation is just amazing. You can learn not only about
the library and ways to use it, but also how these methods work (logic
behind them) - look at their
[clustering guide](http://scikit-learn.org/stable/modules/clustering.html#clustering).

There are tutorials, examples... you can click at any figure, to learn how
it was generated - as an example: [classifier comparison](http://scikit-learn.org/stable/auto_examples/plot_classifier_comparison.html).

#I'm twelve and what is this? - How to get some insight?

Although scikit-learn offers great tools to solve problems, it doesn't tell
what is best for particular case and how algorithms work in depth.

To gain some insight about using machine learning in python, I recommend:
[Building Machine Learning Systems in Python](http://www.packtpub.com/building-machine-learning-systems-with-python/book)
book, with it's [source code](https://github.com/luispedro/BuildingMachineLearningSystemsWithPython).
Reading this book was both educational and enternaining. It was also
pretty easy to follow.

It's isn't heavy mathematics, but rather guided hands on tutorial with solving
toy problems on real datasets - but you will get accustomed to
machine learning approach and learn some basic concepts.

It's not easy to choose the best algorithm for your problem.
When you choose a particular algorithm, it's great to
understand it well. To learn how particular ML algorithms
I would recommend some youtube
tutorials on machine learning such as [those](https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA).
It's a great starting point to learn Machine Learning.

If it's still not enough for you, I found out that Coursera restarts it's
course on machine learning from Stanford this month, it's
[here](https://www.coursera.org/course/ml).

#Tools for more specific problems

If you have a problem which is connected to image processing, you may consider
using [scikit-image](http://scikit-image.org/) or [mahotas](http://luispedro.org/software/mahotas),
which can make computer vision less painful.

If you work with text, look at [nltk](http://www.nltk.org/). It's a best tool
for natural language processing I know. If you want to do some
semantic analysis - checkout [gensim](http://radimrehurek.com/gensim/).


##Holistic approach

It's easy to forget that machine learning isn't only a pack
of fancy algorithms.

If you have a clustering or classification problem,
you have to get features right.

And it's a very tricky part, often more complicated than choosing a right
machine learning algorithm. Because most of the algorithms
work pretty well for you problem with different tradeoffs (speed, accuracy,
etc).

However without good features, classification start to be no better
than choosing at random. But there are
luckily some helpful algorithms for features selection too.


##Practice and challenges

There are lots of datasets in the internet.
Here are [dumps of wikipedia](http://en.wikipedia.org/wiki/Wikipedia:Database_download)
for example. For testing your algorithms [mlcomp](http://mlcomp.org/)
can be helpful.

And if you're looking for challenges - take a look at [kaggle](http://www.kaggle.com/competitions),
a good place to start. It's a platform hosting competitions on predictive
modelling.

Happy hacking with Machine Learning on board!
