<!--
.. title: Classification with scikit-learn
.. slug: classification-with-scikit-learn
.. date: 2014/03/24 19:18:32
.. tags: python, scikit-learn, programming, machine learning, matplotlib
.. link:
.. description: Machine learning getting started guide with python and scikit learn
.. type: text
-->

Imagine that you have a collection of images. Those images can be divided
into a few separate groups. Problem of sorting them out is
a problem of classification, if you know, what groups are
and clustering if you don't know.

Today we will learn how to make a simple machine learning classification
using python libraries:

- scikit learn
- numpy
- matplotlib


What is a classifier? Classifier is a name for an algorithm,
you train with classes and which can further predict classes
of next items.

To solve our image classification problem we will use [scikit-learn](http://scikit-learn.org/).


Scikit learn is a python library for machine learning.
It has state of the art classifiers already implemented for us
and simple to use.

##Very simple classification problem

We have to start with data. Let's imagine, that we have a zoo.

In our zoo, there are three kinds of animals:

- mice
- elephants
- giraffes

Those animals have features such as height and weight.
Having trainging set with already known animals, how to classify newly
arrived animals?

##Preparing data
Let's create our data:

```python
from random import random


giraffe_features = [(random() * 4 + 3, random() * 2 + 30) for x in range(4)]
elephant_features = [(random() * 3 + 20, (random() - 0.5) * 4 + 23)
                     for x in range(6)]

xs = mice_features + elephant_features + giraffe_features
ys = ['mouse'] * len(mice_features) + ['elephant'] * len(elephant_features) +\
     ['giraffe'] * len(giraffe_features)

```

##Visualization of features
Ok, they're just number. Let's visualize them with matplotlib:

```python
from matplotlib import pyplot as plt

fig, axis = plt.subplots(1, 1)

mice_weight, mice_height = zip(*mice_features)
axis.plot(mice_weight, mice_height, 'ro', label='mice')

elephant_weight, elephant_height = zip(*elephant_features)
axis.plot(elephant_weight, elephant_height, 'bo', label='elephants')

giraffe_weight, giraffe_height = zip(*giraffe_features)
axis.plot(giraffe_weight, giraffe_height, 'yo', label='giraffes')

axis.legend(loc=4)
axis.set_xlabel('Weight')
axis.set_ylabel('Height')
```

![plot](http://i.imgur.com/tOjSkBa.png)


##First approach to classification

That looks simple to classify. Now, we'll build and train classifier with
scikit-learn. Scikit learn offers a very wide rang of clasifiers with
different characteristics. [Here](http://scikit-learn.org/stable/auto_examples/plot_classifier_comparison.html)
is a comparison example with pictures.

Every classifier has its own benefits and drawbacks. For our example we will
use naive bayes gaussian classifier.


```python
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()

clf.fit(xs, ys)

new_xses = [[2, 3], [3, 31], [21, 23], [12, 16]]

print clf.predict(new_xses)

print clf.predict_proba(new_xses)
```

```python
['mouse' 'giraffe' 'elephant' 'elephant']
[[  0.00000000e+000   0.00000000e+000   1.00000000e+000]
 [  9.65249329e-273   1.00000000e+000   2.21228571e-285]
 [  1.00000000e+000   5.47092266e-083   0.00000000e+000]
 [  1.00000000e+000   2.73586896e-132   0.00000000e+000]]
```

It looks good!

Summing up what we did:

- extracted features: weight and height for each imaginary animal
- prepared labels, which map features to particular types of animals
- visualized three groups of animals in feature space - weight on x axis
and heigth on y axis using matplotlib
- chose classifier and trained with our data
- predicted new samples


We were able to predict classes for new elements. But we don't know, how well
our classifier performs so we cannot guarantee anything.

We have to find a method to score our classifiers to find the best one.

##Testing our model

Scikit has a [guide](http://scikit-learn.org/stable/model_selection.html#model-selection)
on model selection and evaluation. It's worth reading.

What first we can do is [crossvalidation](http://scikit-learn.org/stable/modules/cross_validation.html)
and scoring and visualization of decision boundaries.

```python
import numpy as np
import pylab as pl
import matplotlib
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets


def plot_classification_results(clf, X, y, title):
    # Divide dataset into training and testing parts
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2)

    # Fit the data with classifier.
    clf.fit(X_train, y_train)

    # Create color maps
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

    h = .02  # step size in the mesh
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    pl.figure()
    pl.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    pl.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cmap_bold)

    y_predicted = clf.predict(X_test)
    score = clf.score(X_test, y_test)
    pl.scatter(X_test[:, 0], X_test[:, 1], c=y_predicted, alpha=0.5, cmap=cmap_bold)
    pl.xlim(xx.min(), xx.max())
    pl.ylim(yy.min(), yy.max())
    pl.title(title)
    return score
```

We can later use this function like this:

```python
xs = np.array(xs)
ys = [0] * len(mice_features) + [1] * len(elephant_features) + [2] * len(giraffe_features)

score = plot_classification_results(clf, xs, ys, "3-Class classification")
print "Classification score was: %s" % score
```

```python
Classification score was: 1.0
```

![decision boundaries](http://i.imgur.com/uuz4P6G.png)

Cool! But what actually happened there?

First we converted features to numpy array and labels to integer values
instead of string names. It doesn't change much, but helps in
visualization.

In plotting function we:

- divided dataset for crossvalidation
- trained classifier with `fit` method
- created meshgrid and predicted Z values on meshgrid to generate
decision boundaries
- plotted decision boundaries
- plotted training data
- plotted testing data in lighter color on the same plot
- scored classifier and returned score

Our dataset was extremely simple for classification. Real datasets look more
messed up.

##Testing our model on more complicated dataset
How our method will work on more complicated dataset?
Scikit learn have a module with popular machine learning datasets.

One of them is [iris dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set)

```python
import numpy as np
from sklearn import cross_validation
from sklearn import datasets

iris = datasets.load_iris()

# there are three classes of iris flowers
print(np.unique(iris.target))
```

Lets look in depth how our cross validation works.

We use standard cross validation function `train_test_split`.
We pass there features with labels and get randomized two
randomized subsets of desired size. It's very handy.

```python
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
```

There are also more complicated crossvalidation methods that use more
of our trainging data, which is valuable for us.

One of the most popular is KFolds. KFolds divides dataset into K groups,
chooses K - 1 for training and leaves K-th for testing. KFolds can choose
K-th element in K-ways, so we can use it as generator with K tuples of
training and testing elements.

So lets test how well performs our classifier on Iris dataset.
We will use only two of three features for better visualization on the plane.

```python
clf = GaussianNB()

plot_classification_results(clf, X_train[:, :2], y_train, "3-Class classification")
```

![iris boundaries](http://i.imgur.com/zEqCeZh.png)

And our score is `0.83` (1 is the best possible).

##Summary
It could be better. We could use three available features or use better
parameters in classifier or choose another classifier... There are
many options how can we approach improving our classification.

In next post we'll learn how to create and choose good features and
choose best options for model.