To Be Continued...

Now lets extract the two main things in our data, Annual Income and Spending Score.

X = df.iloc[:, [3,4]].values

Now we have an array X which contains these two values. Lets import SciKit-Learn so that we can use KMeans for the creating clusters. K-Means have to have definite number of clusters i.e. K. so instead of doing hit and trial method, we have used K-Means++ along with Elbow Method to find the optimal number of K. So then we can use the K for KMeans.

#Building the Model
#KMeans Algorithm to decide the optimum cluster number , KMeans++ using Elbow Mmethod
#to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation
from sklearn.cluster import KMeans
wcss=[]

#we always assume the max number of cluster would be 10
#you can judge the number of clusters by doing averaging
###Static code to get max no of clusters

for i in range(1,11):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

    #inertia_ is the formula used to segregate the data points into clusters

Now lets try to print out the statistics for number of K.

plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.show()

This will give you the output showing that optimal number of K is 5.

Nows let apply the KMeans to our data with the number K=5 and plot the number of clusters along with their centroids. Dont worry, many of the lines below are just to add detail to the plot. So dont worry if it looks too much :)

kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=0)
y_kmeans= kmeansmodel.fit_predict(X)
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeansmodel.cluster_centers_[:, 0], kmeansmodel.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

This gives us the output, plotting 5 clusters on the graph against Annual Income and Spending Score.

This output here shows us that we have 5 different types of customers belong from with different annual income groups and spending scores but basically showing us what group to target and what to offer to those customers so that we can get more and more customers and keep our current customers happy.

Last updated