The Spark Foundation

By: Karan

Task-3 Exploratory Data Analysis

Problem statment

● As a business manager, try to find out the weak areas where you can work to make more profit.

● What all business problems you can derive by exploring the data?

● dataset: http://bit.ly/w-data

Importing libraries

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler

Reading CSV

In [4]:
data=pd.read_csv("SampleSuperstore.csv")
data
Out[4]:
Ship Mode Segment Country City State Postal Code Region Category Sub-Category Sales Quantity Discount Profit
0 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Bookcases 261.9600 2 0.00 41.9136
1 Second Class Consumer United States Henderson Kentucky 42420 South Furniture Chairs 731.9400 3 0.00 219.5820
2 Second Class Corporate United States Los Angeles California 90036 West Office Supplies Labels 14.6200 2 0.00 6.8714
3 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Furniture Tables 957.5775 5 0.45 -383.0310
4 Standard Class Consumer United States Fort Lauderdale Florida 33311 South Office Supplies Storage 22.3680 2 0.20 2.5164
... ... ... ... ... ... ... ... ... ... ... ... ... ...
9989 Second Class Consumer United States Miami Florida 33180 South Furniture Furnishings 25.2480 3 0.20 4.1028
9990 Standard Class Consumer United States Costa Mesa California 92627 West Furniture Furnishings 91.9600 2 0.00 15.6332
9991 Standard Class Consumer United States Costa Mesa California 92627 West Technology Phones 258.5760 2 0.20 19.3932
9992 Standard Class Consumer United States Costa Mesa California 92627 West Office Supplies Paper 29.6000 4 0.00 13.3200
9993 Second Class Consumer United States Westminster California 92683 West Office Supplies Appliances 243.1600 2 0.00 72.9480

9994 rows × 13 columns

In [6]:
nRow, nCol = data.shape
print(f'There are {nRow} rows and {nCol} columns')
There are 9977 rows and 13 columns

checking for null values

In [7]:
data.isnull().sum()   
Out[7]:
Ship Mode       0
Segment         0
Country         0
City            0
State           0
Postal Code     0
Region          0
Category        0
Sub-Category    0
Sales           0
Quantity        0
Discount        0
Profit          0
dtype: int64

Checking the dupilication in data

In [8]:
data.duplicated().sum()
Out[8]:
0
In [9]:
data.nunique()
Out[9]:
Ship Mode          4
Segment            3
Country            1
City             531
State             49
Postal Code      631
Region             4
Category           3
Sub-Category      17
Sales           5825
Quantity          14
Discount          12
Profit          7287
dtype: int64

Deleting the Variable.

In [10]:
col=['Postal Code']
data1=data.drop(columns=col,axis=1)

Correlation Between Variables.

In [11]:
data1.corr()
Out[11]:
Sales Quantity Discount Profit
Sales 1.000000 0.200722 -0.028311 0.479067
Quantity 0.200722 1.000000 0.008678 0.066211
Discount -0.028311 0.008678 1.000000 -0.219662
Profit 0.479067 0.066211 -0.219662 1.000000

Covariance of columns

In [12]:
data1.cov()
Out[12]:
Sales Quantity Discount Profit
Sales 389028.396022 278.765576 -3.645637 70057.067126
Quantity 278.765576 4.958001 0.003990 34.565743
Discount -3.645637 0.003990 0.042624 -10.632751
Profit 70057.067126 34.565743 -10.632751 54970.478824

EDA

In [13]:
plt.figure(figsize=(16,8))
plt.bar('Sub-Category','Category', data=data1)
plt.title('Category vs Sub Category')
plt.xlabel('Sub-Catgory')
plt.ylabel('Category')
plt.xticks(rotation=45)
plt.show()
In [14]:
data1.hist(bins=50 ,figsize=(20,15))
plt.show();

Count the total repeatable states

In [15]:
data1['State'].value_counts()
Out[15]:
California              1996
New York                1127
Texas                    983
Pennsylvania             586
Washington               502
Illinois                 491
Ohio                     468
Florida                  383
Michigan                 254
North Carolina           249
Arizona                  224
Virginia                 224
Georgia                  184
Tennessee                183
Colorado                 182
Indiana                  149
Kentucky                 139
Massachusetts            135
New Jersey               130
Oregon                   123
Wisconsin                110
Maryland                 105
Delaware                  96
Minnesota                 89
Connecticut               82
Missouri                  66
Oklahoma                  66
Alabama                   61
Arkansas                  60
Rhode Island              56
Utah                      53
Mississippi               53
South Carolina            42
Louisiana                 42
Nevada                    39
Nebraska                  38
New Mexico                37
Iowa                      30
New Hampshire             27
Kansas                    24
Idaho                     21
Montana                   15
South Dakota              12
Vermont                   11
District of Columbia      10
Maine                      8
North Dakota               7
West Virginia              4
Wyoming                    1
Name: State, dtype: int64
In [16]:
plt.figure(figsize=(15,15))
sns.countplot(x=data1['State'])
plt.xticks(rotation=90)
plt.title("STATE")
plt.show()

Next,"Copiers" Sub-category has gain highest amount of profit with no loss.There are other sub-categories too who are not faced any kind of losses but their profit margins are also low.

Next,Suffering from highest loss is machines.

In [31]:
sns.set(style="whitegrid")
plt.figure(2, figsize=(20,15))
sns.barplot(x='Sub-Category',y='Profit', data=data, palette='Spectral')
plt.suptitle('Pie Consumption Patterns in the United States', fontsize=16)
plt.show()
In [32]:
figsize=(15,10)
sns.pairplot(data1,hue='Sub-Category')
plt.show
Out[32]:
<function matplotlib.pyplot.show(*args, **kw)>

Let's explore more about these outliers by using boxplots.

Ist we'll check Sales from Every Segments of Whole Data

In [34]:
plt.figure(figsize=(10,4))
sns.lineplot('Discount','Profit', data=data1 , color='y',label='Discount')
plt.legend()
plt.show()
In [36]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
In [39]:
state_code = {'Alabama': 'AL','Alaska': 'AK','Arizona': 'AZ','Arkansas': 'AR','California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE','Florida': 'FL','Georgia': 'GA','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL','Indiana': 'IN','Iowa': 'IA','Kansas': 'KS','Kentucky': 'KY','Louisiana': 'LA','Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN','Mississippi': 'MS','Missouri': 'MO','Montana': 'MT','Nebraska': 'NE','Nevada': 'NV','New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC','North Dakota': 'ND','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA','Rhode Island': 'RI','South Carolina': 'SC','South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX','Utah': 'UT','Vermont': 'VT','Virginia': 'VA','District of Columbia': 'WA','Washington': 'WA','West Virginia': 'WV','Wisconsin': 'WI','Wyoming': 'WY'}
data1['state_code'] = data1.State.apply(lambda x: state_code[x])
In [40]:
state_data = data1[['Sales', 'Profit', 'state_code']].groupby(['state_code']).sum()


fig = go.Figure(data=go.Choropleth(
    locations=state_data.index, 
    z = state_data.Sales, 
    locationmode = 'USA-states', 
    colorscale = 'Reds',
    colorbar_title = 'Sales in USD',
))

fig.update_layout(
    title_text = 'Total State-Wise Sales',
    geo_scope='usa',
    height=800,
)

fig.show()

We have a few questions to answer here.

1 What products do the most profit making states buy?

2 What products do the loss bearing states buy?

3 What product segment needs to be improved in order to drive the profits higher?

In [42]:
def state_data_viewer(states):
    """Plots the turnover generated by different product categories and sub-categories for the list of given states.
    Args:
        states- List of all the states you want the plots for
    Returns:
        None
    """
    product_data = data1.groupby(['State'])
    for state in states:
        data = product_data.get_group(state).groupby(['Category'])
        fig, ax = plt.subplots(1, 3, figsize = (28,5))
        fig.suptitle(state, fontsize=14)        
        ax_index = 0
        for cat in ['Furniture', 'Office Supplies', 'Technology']:
            cat_data = data.get_group(cat).groupby(['Sub-Category']).sum()
            sns.barplot(x = cat_data.Profit, y = cat_data.index, ax = ax[ax_index])
            ax[ax_index].set_ylabel(cat)
            ax_index +=1
        fig.show()
states = ['California', 'Washington', 'Mississippi', 'Arizona', 'Texas']
state_data_viewer(states)
<ipython-input-42-ce86ebff98ef>:19: UserWarning:

Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

Using Cluster Analysis(K-Mean Clustering)

In [43]:
x = data.iloc[:, [9, 10, 11, 12]].values

from sklearn.cluster import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 0).fit(x)
    wcss.append(kmeans.inertia_)
sns.set_style("whitegrid") 
sns.FacetGrid(data, hue ="Sub-Category",height = 6).map(plt.scatter,'Sales','Quantity')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], 
            s = 100, c = 'yellow', label = 'Centroids')

plt.legend()
plt.show()
In [44]:
sns.set_style("whitegrid") 
sns.FacetGrid(data, hue ="Sub-Category",height = 6).map(plt.scatter,'Sales','Profit')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], 
            s = 100, c = 'yellow', label = 'Centroids')

plt.legend()
plt.show()
In [45]:
fig, ax = plt.subplots(figsize = (10 , 6))
ax.scatter(data1["Sales"] , data1["Profit"])
ax.set_xlabel('Sales')
ax.set_ylabel('Profit')
ax.set_title('Sales vs Profit')
plt.show()

From the Above data Visualization and Clustering we can see that in Which states and in which Category Sales and profits are High or less,We can improve in that States By Providing Discounts in prefered Range so that Company and cosumer both will be in profit.So For Deciding that Range we have to do some Technical Analysis.One can Do it through Factor Analysis,or also can Do it throgh neural networks.

In [ ]: