ML5
#!/usr/bin/env python
coding: utf-8
In[1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Importing the required libraries.
In[2]:
from sklearn.cluster import KMeans, k_means #For clustering
from sklearn.decomposition import PCA # Linear Dimensionality reduction.
In[3]:
df = pd.read_csv("sales_data_sample.csv",encoding='latin1') #Loading the dataset.
In[4]:
df.head()
In[5]:
df.shape
In[6]:
df.describe()
In[7]:
df.info()
In[8]:
df.isnull().sum()
df.dtypes
In[9]:
df_drop = ['ADDRESSLINE1', 'ADDRESSLINE2', 'STATUS','POSTALCODE', 'CITY', 'TERRITORY', 'PHONE', 'STATE', 'CONTACTFIRSTNAME', 'CONTACTLASTNAME', 'CUSTOMERNAME', 'ORDERNUMBER']
df = df.drop(df_drop, axis=1) #Dropping the categorical uneccessary columns along with columns having null values. Can't fill the null values are there are alot of null values.
In[10]:
df.isnull().sum()
In[11]:
df.dtypes
In[12]:
df.duplicated( keep='first').sum()
In[13]:
df.isna().sum() #finding missing values
In[14]:
Checking the categorical columns.
df['COUNTRY'].unique()
In[15]:
df['PRODUCTLINE'].unique()
In[16]:
df['DEALSIZE'].unique()
In[17]:
productline = pd.get_dummies(df['PRODUCTLINE']) #Converting the categorical columns.
Dealsize = pd.get_dummies(df['DEALSIZE'])
In[18]:
df = pd.concat([df,productline,Dealsize], axis = 1)
In[19]:
df
In[20]:
df_drop = ['COUNTRY','PRODUCTLINE','DEALSIZE'] #Dropping Country too as there are alot of countries.
df = df.drop(df_drop, axis=1)
In[21]:
df['PRODUCTCODE'] = pd.Categorical(df['PRODUCTCODE']).codes #Converting the datatype.
In[22]:
df.drop('ORDERDATE', axis=1, inplace=True) #Dropping the Orderdate as Month is already included.
In[23]:
df.dtypes #All the datatypes are converted into numeric
In[24]:
before we implement the k-means and assign the centers of our data, we can also make a quick analyze to
find the optimal number (centers) of clusters using Elbow Method.
Elbow Method is one of the most popular methods to determine this optimal value of k.
Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as inertia, or within-cluster sum-of-squares Inertia, or
the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are.
The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean j of the samples in the cluster.
The means are commonly called the cluster centroids.
The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion.
from sklearn.cluster import KMeans
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df)
distortions.append(kmeanModel.inertia_)
In[25]:
kmeanModel
In[26]:
kmeanModel.cluster_centers_
In[27]:
kmeanModel.inertia_
The lower values of inertia are better and zero is optimal.
We can see that the model has very high inertia. So, this is not a good model fit to the data.
In[29]:
check how many of the samples were correctly labeled
label = kmeanModel.labels_
In[30]:
label
In[31]:
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
In[32]:
X_train = df.values #Returns a numpy array.
In[33]:
X_train.shape
In[34]:
model = KMeans(n_clusters=3,random_state=2) #Number of cluster = 3
model = model.fit(X_train) #Fitting the values to create a model.
predictions = model.predict(X_train) #Predicting the cluster values (0,1,or 2)
In[35]:
predictions
#3 clusters within 0, 1, and 2 numbers. We can also merge the result of the clusters with our original data table like this:
In[36]:
unique,counts = np.unique(predictions,return_counts=True)
In[37]:
unique
In[39]:
counts
In[40]:
counts = counts.reshape(1,3) # 1 row and 3 column
In[41]:
counts
In[42]:
counts_df = pd.DataFrame(counts,columns=['Cluster1','Cluster2','Cluster3'])
In[43]:
counts_df.head()
In[44]:
pca = PCA(n_components=2) #Converting all the features into 2 columns to make it easy to visualize using Principal COmponent Analysis.
In[45]:
reduced_X = pd.DataFrame(pca.fit_transform(X_train),columns=['PCA1','PCA2']) #Creating a DataFrame.
In[46]:
reduced_X.head()
In[48]:
#Plotting the normal Scatter Plot
plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])
In[49]:
model.cluster_centers_ #Finding the centriods. (3 Centriods in total. Each Array contains a centroids for particular feature )
In[50]:
reduced_centers = pca.transform(model.cluster_centers_) #Transforming the centroids into 3 in x and y coordinates
In[51]:
reduced_centers
In[52]:
plt.figure(figsize=(14,10))
plt.scatter(reduced_X['PCA1'],reduced_X['PCA2'])
plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300) #Plotting the centriods
In[53]:
reduced_X['Clusters'] = predictions #Adding the Clusters to the reduced dataframe.
In[54]:
reduced_X.head()
In[55]:
#Plotting the clusters
plt.figure(figsize=(14,10))
taking the cluster number and first column taking the same cluster number and second column Assigning the color
plt.scatter(reduced_X[reduced_X['Clusters'] == 0].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 0].loc[:,'PCA2'],color='slateblue')
plt.scatter(reduced_X[reduced_X['Clusters'] == 1].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 1].loc[:,'PCA2'],color='springgreen')
plt.scatter(reduced_X[reduced_X['Clusters'] == 2].loc[:,'PCA1'],reduced_X[reduced_X['Clusters'] == 2].loc[:,'PCA2'],color='indigo')
plt.scatter(reduced_centers[:,0],reduced_centers[:,1],color='black',marker='x',s=300)