Customer Segmentation

The problem that we are discussing here and solving through clustering is customer segmentation based on their buying score. This helps retail store owners in identifying their potential customers.

So first of all, lets start importing libraries into the first cell and run it. OS is imported for using operating system related functions if we need them. so its always good to import it. NumPy is a really good library for creating Arrays and Matrices and performing mathematical operations on them. Pandas provide dataframes which are similar to tables and we can import our dataset into dataframes to perform any sort of transformation or manipulation. Matplotlib and Seaborn both provides functions and APIs for data visualizations.

import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Data Visualization 
import seaborn as sns  #Python library for Visualization

After the libraries are imported, we need to import the dataset so that we can start working with data. Click the connections button on the top left and add Data File as Pandas Dataframe into the code.

this will add the below code to the code cell. Run it and you will see the output as the first 5 rows of the dataset.

import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_2acb0ba6cc4740a2ba864d0d152e3d7b = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='4C5n94L9iKN5Z9iYe2sb4_J6Y89Eeot3wobUzJEy9MfD',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_2acb0ba6cc4740a2ba864d0d152e3d7b.get_object(Bucket='customersegmentation-donotdelete-pr-rjjxnn9l3u9plo',Key='Mall_Customers.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head()

So thats the output of our head call which is displaying the first five rows of our dataset.

As the result shows, our dataset contains CustomerID, Age, Gender, Annual Income and their Spending Score that has been done by the mall by their own criteria.

Just some functions that you should run before applying any algorithm to you data is to check the shape of your data, check if there arent null values in your data and so on. i have pasted the three functions. run them separately in different cell blocks to see their outputs.

df.shape
df.info()
df.isnull().sum()

Now lets get on to applying the algorithm to see the clusters of the data.

Last updated