Using K-means on images
Playing around with K-means clustering and images¶
I wanted to see what kind of machine learning tools can be used on maps (I like maps). My first thought was "I can't go about labeling each pixel on a map or satellite image, that would take forever". My next though was "I should use an unsupervised algorithm". Of course, this a vast field of study (duh!) and there are tons of applications of unsupervised algorithms to imaging. I'm going to try to see if K-means clustering can distinguish features in images. This should be fun, because unlike many other applications of clustering, the results here will be totally visual.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
Grayscale Flowers¶
I'm going to start by loading a greyscale image of a flower, that I found in an EdX course called The Analytics Edge (it's a great course).
flower=pd.read_csv('flower.csv', header=None)
flower.head()
It's a low resolution image, the rows and columns of the array represent the location of the pixels
Let's make it a numpy array
mat=pd.DataFrame.as_matrix(flower)
mat.shape
plt.imshow(mat,cmap='gray')
vec=mat.flatten()
vec
from sklearn.cluster import KMeans
I can see at least 4 distinct colors, two on the flower and two on the background. I'm going to set K=4.
kmeans = KMeans(n_clusters=4)
The sklearn method requires an input array of dimensions (n_samples, n_features). Since we only have one feature (the grayscale values), I'm going to flatten the array.
vec=mat.flatten()
vec.shape
2500 is indeed the number of pixels (50x50). I also have to turn vec into a column array (a vector). Here as an interesting stack-exchange post on why this is not a vector yet: https://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r.
vec=vec.reshape(-1,1)
vec.shape
Calling the K-means algorithm:
kmeans.fit(vec)
The method 'labels_' of K-means contains the classes of each point, as determined by the algorithm, as one dimensional array.
kmeans.labels_
In order to see the results, let's turn that array into a 50x50 matrix again.
vecr=kmeans.labels_.reshape((50, 50))
plt.matshow(vecr,cmap='Accent')
This is not bad, but we can see that the pixels at the border between the flower and the background are assigned either to a background class or a flower class, when they should belong to a class of their own. Let's increase K and see if this can be improved.
kmeans = KMeans(n_clusters=5)
kmeans.fit(vec)
vecr=kmeans.labels_.reshape((50, 50))
plt.matshow(vecr,cmap='Accent')
kmeans = KMeans(n_clusters=7)
kmeans.fit(vec)
vecr=kmeans.labels_.reshape((50, 50))
plt.matshow(vecr)
kmeans = KMeans(n_clusters=10)
kmeans.fit(vec)
vecr=kmeans.labels_.reshape((50, 50))
plt.matshow(vecr)
The image is getting a bit more confusing with a large number of centroids. The color scheme doesn't help the image look "pretty". I've chosen a scheme that helps us see exactly which pixel belongs to each class; a qualitative color map. I could define a custom color map, but even then, each time K-means is run, the class labels change (what was once class one might be called class two in a subsequent run).
It's also worth noting that it can be useful to run K-means more than once in order to obtain different, and possibly better, results. This is due to the random nature of the initialization. I am not using any kind of smart way to choose the initial positions of the centroids (but they do exist).
This first flower image was easy, because it was greyscale and low-resolution. Now I'm going to use a higher resolution grayscale flower image that I found on google.
import matplotlib.image as mpimg
img1=mpimg.imread('flower2.jpg')
imgplot = plt.imshow(img1,cmap='gray')
img1.shape
I'm going to go right ahead and simply reshape the array, instead of flattening it first.
361*480
vec2=img1.reshape(173280,1)
kmeans = KMeans(n_clusters=5)
kmeans.fit(vec2)
vecr2=kmeans.labels_.reshape((361, 480))
plt.matshow(vecr2,cmap='Accent')
Now with K=10.
kmeans = KMeans(n_clusters=10)
kmeans.fit(vec2)
vecr2=kmeans.labels_.reshape((361, 480))
plt.matshow(vecr2,cmap='Accent')
Back down to 7.
kmeans = KMeans(n_clusters=7)
kmeans.fit(vec2)
vecr2=kmeans.labels_.reshape((361, 480))
plt.matshow(vecr2,cmap='Accent')
I think the results with K=5 and K=7 actually look better than with K=10, although it might be argued that K=10 captures more detail. This also illustrates that it's hard to find the right K, even when we can look at the results.
RGB Flower¶
Now, let's try a flower from an RGB image.
img2=mpimg.imread('flower3.jpg')
imgplot = plt.imshow(img2)
The array is now composed of 678x960 of triples, each one of these three number corresponds to a value of red, green or blue.
img2.shape
arr2=img2.reshape(650880,3)
arr2.shape
kmeans = KMeans(n_clusters=4)
kmeans.fit(arr2)
vecr3=kmeans.labels_.reshape((678, 960))
plt.matshow(vecr3,cmap='Accent')
We could probably use a larger value of K.
kmeans = KMeans(n_clusters=5)
kmeans.fit(arr2)
vecr3=kmeans.labels_.reshape((678, 960))
plt.matshow(vecr3,cmap='Accent')
We can see some improvement, but there is room for more. Let's go crazy
kmeans = KMeans(n_clusters=25)
kmeans.fit(arr2)
vecr3=kmeans.labels_.reshape((678, 960))
plt.matshow(vecr3,cmap='Accent')
Let's go really crazy.
kmeans = KMeans(n_clusters=60)
kmeans.fit(arr2)
vecr3=kmeans.labels_.reshape((678, 960))
plt.matshow(vecr3,cmap='Accent')
That looks strange, but I feel that lot of detail is being captured. We could go for an even higher value of K, but that would take ages to run and I want to get to the next session.
Satellite images¶
Atlantic coast of France¶
Now, let's finally play around with some maps. I will begin with a satellite image of a bit of the Atlantic coast of France.
img3=mpimg.imread('france1.jpg')
imgplot = plt.imshow(img3)
img3.shape
115*122
mat=img3.reshape(14030,3)
Choosing a value for K is actually easier now, because my goal is to simply see if the algorithm can make a difference between simple features of the country's surface. For example, let's see if it can make a difference between water and land.
kmeans = KMeans(n_clusters=2)
kmeans.fit(mat)
vecr3=kmeans.labels_.reshape((115, 122))
plt.matshow(vecr3,cmap='Accent')
Not bad, but I think some parts of brownish-colored land were assigned to the water class.
kmeans = KMeans(n_clusters=3)
kmeans.fit(mat)
vecr3=kmeans.labels_.reshape((115, 122))
plt.matshow(vecr3,cmap='Accent')
Great! The green part is water, the other two classes are land.
Google map images of Europe¶
Now I'm going to use a larger satellite image of Western Europe, downloaded from google maps.
west=mpimg.imread('Capture2.jpg')
fig, ax = plt.subplots(figsize=(10, 20))
ax.imshow(west,cmap='Accent')
Now, we can see we have water, green land, brownish-yellow land and snow cover in the Alps and Pyrénées.
west.shape
849*1725
mat_west=west.reshape((1464525,3))
kmeans_k4 = KMeans(n_clusters=4)
kmeans_k4.fit(mat_west)
vec_west=kmeans_k4.labels_.reshape((849, 1725))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')
Good, but it seems like the snow is being classified as water, which is technically correct, but let's try with K=5.
kmeans_k5 = KMeans(n_clusters=5)
kmeans_k5.fit(mat_west)
vec_west=kmeans_k5.labels_.reshape((849, 1725))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')
Hmmmm... Now the snow is in the brownish-yellow land category, not surprising since they are both lightly colored. Let's try with 6.
kmeans = KMeans(n_clusters=6)
kmeans.fit(mat_west)
vec_west=kmeans.labels_.reshape((849, 1725))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_west,cmap='Accent')
Not much better, we're getting more details in the water, but the snow problem is still there. I will stop here and stick with K=4 for the sake of simplicity, and because that seems to be able to at least distinguish land from water.
Now I will load an image of Northern Europe, again from Google Maps, and see if we can use what the algorithm has learned on Western Europe to classifiy the features in it.
north=misc.imread('Capture1.jpg')
fig, ax = plt.subplots(figsize=(10, 20))
ax.imshow(north,cmap='Accent')
north.shape
843*1748
mat_north=north.reshape((1473564,3))
vec_north=kmeans_k4.predict(mat_north).reshape((843,1748))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')
Looks like we succeeded in differentiating between water and land. Nonetheless, the snow is being classified as water once again, and the fresh water is not appearing. I should say that these Google Map images seem like they have been modified to make the features clearer. For example, I think that fresh water is a different color than sea water. The cloud cover also seems to have been removed. This is not the case with "pure" satellite images, but I prefer using the Google Maps images for consistency.
I'll try with K=5.
vec_north=kmeans_k5.predict(mat_north).reshape((843,1748))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')
Even with K=5 the fresh water does not appear. My guess is that its color is too close to the dark green land. I will try using the algorithm directly on the 'north' dataset.
kmeans_k4_north = KMeans(n_clusters=4)
kmeans_k4_north.fit(mat_north)
vec_north=kmeans_k4_north.labels_.reshape((843,1748))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')
Still, not great. Let's try with K=6
kmeans_k6_north = KMeans(n_clusters=6)
kmeans_k6_north.fit(mat_north)
vec_north=kmeans_k6_north.labels_.reshape((843,1748))
fig, ax = plt.subplots(figsize=(10, 20))
ax.matshow(vec_north,cmap='Accent')
Much better, but part of the freshwater is still being classified as land. The snow seems to be in its own category now.
Next steps¶
One possible next step is using spatial information as features, since the objects in the images are localized, and not spread out all over it (their color are not the only thing that makes them similar). I think that if this is used directly in K-means, it will require much larger values of K, since the categories are going to be defined by both space and color.
Another thing to investigate, is how to use what the algorithm can learn from the images and see if there is a way of classifying them into something that is human readable. This would be a much more complex task, but that would like to explore in the future.