Playing around with K-means clustering and images¶

I wanted to see what kind of machine learning tools can be used on maps (I like maps). My first thought was "I can't go about labeling each pixel on a map or satellite image, that would take forever". My next though was "I should use an unsupervised algorithm". Of course, this a vast field of study (duh!) and there are tons of applications of unsupervised algorithms to imaging. I'm going to try to see if K-means clustering can distinguish features in images. This should be fun, because unlike many other applications of clustering, the results here will be totally visual.

In [2]:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

Grayscale Flowers¶

I'm going to start by loading a greyscale image of a flower, that I found in an EdX course called The Analytics Edge (it's a great course).

In [4]:

flower=pd.read_csv('flower.csv', header=None)

In [7]:

flower.head()

Out[7]:

	0	1	2	3	4	5	6	7	8	9	...	40	41	42	43	44	45	46	47	48	49
0	0.099138	0.112069	0.133621	0.137931	0.137931	0.137931	0.129310	0.116379	0.112069	0.120690	...	0.034483	0.025862	0.025862	0.030172	0.025862	0.025862	0.017241	0.021552	0.021552	0.030172
1	0.099138	0.107759	0.116379	0.137931	0.133621	0.129310	0.116379	0.103448	0.099138	0.107759	...	0.034483	0.025862	0.025862	0.030172	0.025862	0.017241	0.017241	0.012931	0.021552	0.034483
2	0.103448	0.112069	0.120690	0.120690	0.125000	0.120690	0.103448	0.103448	0.107759	0.112069	...	0.038793	0.034483	0.038793	0.034483	0.025862	0.017241	0.008621	0.012931	0.021552	0.038793
3	0.103448	0.116379	0.116379	0.120690	0.116379	0.107759	0.107759	0.103448	0.112069	0.116379	...	0.047414	0.043103	0.051724	0.051724	0.038793	0.025862	0.021552	0.017241	0.034483	0.060345
4	0.103448	0.107759	0.112069	0.112069	0.112069	0.112069	0.112069	0.116379	0.116379	0.125000	...	0.064655	0.056034	0.060345	0.060345	0.047414	0.034483	0.025862	0.030172	0.060345	0.077586

5 rows × 50 columns

It's a low resolution image, the rows and columns of the array represent the location of the pixels

Let's make it a numpy array

In [5]:

mat=pd.DataFrame.as_matrix(flower)

In [9]:

mat.shape

Out[9]:

(50, 50)

In [6]:

plt.imshow(mat,cmap='gray')

Out[6]:

In [12]:

vec=mat.flatten()

In [13]:

vec

Out[13]:

array([ 0.09913793,  0.10775862,  0.11637931, ...,  0.13793103,
        0.13793103,  0.00517241])

In [18]:

from sklearn.cluster import KMeans

I can see at least 4 distinct colors, two on the flower and two on the background. I'm going to set K=4.

In [20]:

kmeans = KMeans(n_clusters=4)

The sklearn method requires an input array of dimensions (n_samples, n_features). Since we only have one feature (the grayscale values), I'm going to flatten the array.

In [21]:

vec=mat.flatten()

In [22]:

vec.shape

Out[22]:

(2500,)

2500 is indeed the number of pixels (50x50). I also have to turn vec into a column array (a vector). Here as an interesting stack-exchange post on why this is not a vector yet: https://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r.

In [23]:

vec=vec.reshape(-1,1)

In [24]:

vec.shape

Out[24]:

(2500, 1)

Calling the K-means algorithm:

In [25]:

kmeans.fit(vec)

Out[25]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

The method 'labels_' of K-means contains the classes of each point, as determined by the algorithm, as one dimensional array.

In [27]:

kmeans.labels_

Out[27]:

array([0, 0, 0, ..., 0, 0, 3])

In order to see the results, let's turn that array into a 50x50 matrix again.

In [28]:

vecr=kmeans.labels_.reshape((50, 50))

In [30]:

plt.matshow(vecr,cmap='Accent')

Out[30]:

This is not bad, but we can see that the pixels at the border between the flower and the background are assigned either to a background class or a flower class, when they should belong to a class of their own. Let's increase K and see if this can be improved.

In [31]:

kmeans = KMeans(n_clusters=5)

In [32]:

kmeans.fit(vec)

Out[32]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [33]:

vecr=kmeans.labels_.reshape((50, 50))

In [34]:

plt.matshow(vecr,cmap='Accent')

Out[34]:

In [39]:

kmeans = KMeans(n_clusters=7)

In [40]:

kmeans.fit(vec)

Out[40]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [41]:

vecr=kmeans.labels_.reshape((50, 50))

In [42]:

plt.matshow(vecr)

Out[42]:

In [43]:

kmeans = KMeans(n_clusters=10)

In [44]:

kmeans.fit(vec)

Out[44]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [45]:

vecr=kmeans.labels_.reshape((50, 50))

In [46]:

plt.matshow(vecr)

Out[46]:

The image is getting a bit more confusing with a large number of centroids. The color scheme doesn't help the image look "pretty". I've chosen a scheme that helps us see exactly which pixel belongs to each class; a qualitative color map. I could define a custom color map, but even then, each time K-means is run, the class labels change (what was once class one might be called class two in a subsequent run).

It's also worth noting that it can be useful to run K-means more than once in order to obtain different, and possibly better, results. This is due to the random nature of the initialization. I am not using any kind of smart way to choose the initial positions of the centroids (but they do exist).

This first flower image was easy, because it was greyscale and low-resolution. Now I'm going to use a higher resolution grayscale flower image that I found on google.

In [10]:

import matplotlib.image as mpimg

In [11]:

img1=mpimg.imread('flower2.jpg')
imgplot = plt.imshow(img1,cmap='gray')

In [13]:

img1.shape

Out[13]:

(361, 480)

I'm going to go right ahead and simply reshape the array, instead of flattening it first.

In [14]:

361*480

Out[14]:

In [16]:

vec2=img1.reshape(173280,1)

In [19]:

kmeans = KMeans(n_clusters=5)

In [20]:

kmeans.fit(vec2)

Out[20]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [21]:

vecr2=kmeans.labels_.reshape((361, 480))

In [22]:

plt.matshow(vecr2,cmap='Accent')

Out[22]:

Now with K=10.

In [73]:

kmeans = KMeans(n_clusters=10)

In [74]:

kmeans.fit(vec2)

Out[74]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [75]:

vecr2=kmeans.labels_.reshape((361, 480))

In [76]:

plt.matshow(vecr2,cmap='Accent')

Out[76]:

Back down to 7.

In [77]:

kmeans = KMeans(n_clusters=7)

In [78]:

kmeans.fit(vec2)

Out[78]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=7, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [79]:

vecr2=kmeans.labels_.reshape((361, 480))

In [80]:

plt.matshow(vecr2,cmap='Accent')

Out[80]:

I think the results with K=5 and K=7 actually look better than with K=10, although it might be argued that K=10 captures more detail. This also illustrates that it's hard to find the right K, even when we can look at the results.

RGB Flower¶

Now, let's try a flower from an RGB image.

In [24]:

img2=mpimg.imread('flower3.jpg')
imgplot = plt.imshow(img2)

The array is now composed of 678x960 of triples, each one of these three number corresponds to a value of red, green or blue.

In [25]:

img2.shape

Out[25]:

(678, 960, 3)

In [26]:

arr2=img2.reshape(650880,3)

In [85]:

arr2.shape

Out[85]:

(650880, 3)

In [102]:

kmeans = KMeans(n_clusters=4)

In [103]:

kmeans.fit(arr2)

Out[103]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [104]:

vecr3=kmeans.labels_.reshape((678, 960))

In [105]:

plt.matshow(vecr3,cmap='Accent')

Out[105]:

We could probably use a larger value of K.

In [106]:

kmeans = KMeans(n_clusters=5)

In [107]:

kmeans.fit(arr2)

Out[107]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [108]:

vecr3=kmeans.labels_.reshape((678, 960))

In [109]:

plt.matshow(vecr3,cmap='Accent')

Out[109]:

We can see some improvement, but there is room for more. Let's go crazy

In [94]:

kmeans = KMeans(n_clusters=25)

In [95]:

kmeans.fit(arr2)

Out[95]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=25, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [96]:

vecr3=kmeans.labels_.reshape((678, 960))

In [97]:

plt.matshow(vecr3,cmap='Accent')

Out[97]:

Let's go really crazy.

In [98]:

kmeans = KMeans(n_clusters=60)

In [99]:

kmeans.fit(arr2)

Out[99]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=60, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [100]:

vecr3=kmeans.labels_.reshape((678, 960))

In [101]:

plt.matshow(vecr3,cmap='Accent')

Out[101]:

That looks strange, but I feel that lot of detail is being captured. We could go for an even higher value of K, but that would take ages to run and I want to get to the next session.

Satellite images¶

Atlantic coast of France¶

Now, let's finally play around with some maps. I will begin with a satellite image of a bit of the Atlantic coast of France.

In [27]:

img3=mpimg.imread('france1.jpg')
imgplot = plt.imshow(img3)

In [28]:

img3.shape

Out[28]:

(115, 122, 3)

In [113]:

115*122

Out[113]:

In [29]:

mat=img3.reshape(14030,3)

Choosing a value for K is actually easier now, because my goal is to simply see if the algorithm can make a difference between simple features of the country's surface. For example, let's see if it can make a difference between water and land.

In [115]:

kmeans = KMeans(n_clusters=2)

In [116]:

kmeans.fit(mat)

Out[116]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [117]:

vecr3=kmeans.labels_.reshape((115, 122))

In [118]:

plt.matshow(vecr3,cmap='Accent')

Out[118]:

Not bad, but I think some parts of brownish-colored land were assigned to the water class.

In [119]:

kmeans = KMeans(n_clusters=3)

In [120]:

kmeans.fit(mat)

Out[120]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [121]:

vecr3=kmeans.labels_.reshape((115, 122))

In [122]:

plt.matshow(vecr3,cmap='Accent')

Out[122]:

Great! The green part is water, the other two classes are land.

Google map images of Europe¶

Now I'm going to use a larger satellite image of Western Europe, downloaded from google maps.

In [14]:

west=mpimg.imread('Capture2.jpg')

In [18]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.imshow(west,cmap='Accent')

Out[18]:

Now, we can see we have water, green land, brownish-yellow land and snow cover in the Alps and Pyrénées.

In [19]:

west.shape

Out[19]:

(849, 1725, 3)

In [20]:

849*1725

Out[20]:

In [21]:

mat_west=west.reshape((1464525,3))

In [42]:

kmeans_k4 = KMeans(n_clusters=4)

In [43]:

kmeans_k4.fit(mat_west)

Out[43]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [45]:

vec_west=kmeans_k4.labels_.reshape((849, 1725))

In [46]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')

Out[46]:

Good, but it seems like the snow is being classified as water, which is technically correct, but let's try with K=5.

In [53]:

kmeans_k5 = KMeans(n_clusters=5)

In [54]:

kmeans_k5.fit(mat_west)

Out[54]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [55]:

vec_west=kmeans_k5.labels_.reshape((849, 1725))

In [33]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')

Out[33]:

Hmmmm... Now the snow is in the brownish-yellow land category, not surprising since they are both lightly colored. Let's try with 6.

In [34]:

kmeans = KMeans(n_clusters=6)

In [35]:

kmeans.fit(mat_west)

Out[35]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=6, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [36]:

vec_west=kmeans.labels_.reshape((849, 1725))

In [37]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')

Out[37]:

Not much better, we're getting more details in the water, but the snow problem is still there. I will stop here and stick with K=4 for the sake of simplicity, and because that seems to be able to at least distinguish land from water.

Now I will load an image of Northern Europe, again from Google Maps, and see if we can use what the algorithm has learned on Western Europe to classifiy the features in it.

In [38]:

north=misc.imread('Capture1.jpg')

In [39]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.imshow(north,cmap='Accent')

Out[39]:

In [47]:

north.shape

Out[47]:

(843, 1748, 3)

In [48]:

843*1748

Out[48]:

In [49]:

mat_north=north.reshape((1473564,3))

In [51]:

vec_north=kmeans_k4.predict(mat_north).reshape((843,1748))

In [52]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')

Out[52]:

Looks like we succeeded in differentiating between water and land. Nonetheless, the snow is being classified as water once again, and the fresh water is not appearing. I should say that these Google Map images seem like they have been modified to make the features clearer. For example, I think that fresh water is a different color than sea water. The cloud cover also seems to have been removed. This is not the case with "pure" satellite images, but I prefer using the Google Maps images for consistency.

I'll try with K=5.

In [56]:

vec_north=kmeans_k5.predict(mat_north).reshape((843,1748))

In [57]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')

Out[57]:

Even with K=5 the fresh water does not appear. My guess is that its color is too close to the dark green land. I will try using the algorithm directly on the 'north' dataset.

In [59]:

kmeans_k4_north = KMeans(n_clusters=4)

In [60]:

kmeans_k4_north.fit(mat_north)

Out[60]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [61]:

vec_north=kmeans_k4_north.labels_.reshape((843,1748))

In [62]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')

Out[62]:

Still, not great. Let's try with K=6

In [64]:

kmeans_k6_north = KMeans(n_clusters=6)

In [65]:

kmeans_k6_north.fit(mat_north)

Out[65]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=6, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [66]:

vec_north=kmeans_k6_north.labels_.reshape((843,1748))

In [67]:

fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')

Out[67]:

Much better, but part of the freshwater is still being classified as land. The snow seems to be in its own category now.

Next steps¶

One possible next step is using spatial information as features, since the objects in the images are localized, and not spread out all over it (their color are not the only thing that makes them similar). I think that if this is used directly in K-means, it will require much larger values of K, since the categories are going to be defined by both space and color.

Another thing to investigate, is how to use what the algorithm can learn from the images and see if there is a way of classifying them into something that is human readable. This would be a much more complex task, but that would like to explore in the future.

Felipe's Data Science Blog

Using K-means on images

Playing around with K-means clustering and images¶

Grayscale Flowers¶

RGB Flower¶

Satellite images¶

Atlantic coast of France¶

Google map images of Europe¶

Next steps¶

Categories

Links

Tags