Group numeric values into categories (bins)¶

We often have a column with many numeric values and we want into group them to bins or buckets, such as age groups or value tiers. In Excel we can do it using and of the following options:

Data Menu -> Data Analysis -> Histogram or Rank and Percentile
VLOOKUP (=LOOKUP(A1,{0,7,14,31,90,180,360},{"0-6","7-13","14-30","31-89","90-179","180-359",">360"}), for example), or
IF (=if(b2>30,"large",if(A1>20,"medium",if(A1>=10,"small",if(A1<10,"tiny","")))), for example)
INDEX (=INDEX({"Small","Medium","Large"},LARGE(IF(A1>{0,11,21},{1,2,3}),1)), for example)

Cut Function¶

Since this is a common need, Pandas has built-in functions cut and qcut that make it more flexible and accurate. We start with importing pandas.

import pandas as pd

Real-life example¶

Let’s analyze data regarding the impact of weatehr information on numbers of bike rentals. We know that when it is too cold or too hot, people rent less bike. But what is the cutoff temperature value? When we listen to the next 10 days weather forecast, how should we prepare our bike rental fleet to meet the expected demand?

Loading Data¶

As usual, let’s load a dataset to work on. We will use the dataset that is used before, “Bike Share”. This is a data set about the demand of bike share service in Seoul. Please note that we need to modify the default encoding of read_csv from ‘UTF-8’ to ‘latin1’.

bike_share_data = (
    pd
    .read_csv(
        'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', 
        encoding='latin1'
    )
)
bike_share_data

	Date	Rented Bike Count	Hour	Temperature(°C)	Humidity(%)	Wind speed (m/s)	Visibility (10m)	Dew point temperature(°C)	Solar Radiation (MJ/m2)	Rainfall(mm)	Snowfall (cm)	Seasons	Holiday	Functioning Day
0	01/12/2017	254	0	-5.2	37	2.2	2000	-17.6	0.0	0.0	0.0	Winter	No Holiday	Yes
1	01/12/2017	204	1	-5.5	38	0.8	2000	-17.6	0.0	0.0	0.0	Winter	No Holiday	Yes
2	01/12/2017	173	2	-6.0	39	1.0	2000	-17.7	0.0	0.0	0.0	Winter	No Holiday	Yes
3	01/12/2017	107	3	-6.2	40	0.9	2000	-17.6	0.0	0.0	0.0	Winter	No Holiday	Yes
4	01/12/2017	78	4	-6.0	36	2.3	2000	-18.6	0.0	0.0	0.0	Winter	No Holiday	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8755	30/11/2018	1003	19	4.2	34	2.6	1894	-10.3	0.0	0.0	0.0	Autumn	No Holiday	Yes
8756	30/11/2018	764	20	3.4	37	2.3	2000	-9.9	0.0	0.0	0.0	Autumn	No Holiday	Yes
8757	30/11/2018	694	21	2.6	39	0.3	1968	-9.9	0.0	0.0	0.0	Autumn	No Holiday	Yes
8758	30/11/2018	712	22	2.1	41	1.0	1859	-9.8	0.0	0.0	0.0	Autumn	No Holiday	Yes
8759	30/11/2018	584	23	1.9	43	1.3	1909	-9.3	0.0	0.0	0.0	Autumn	No Holiday	Yes

8760 rows × 14 columns

Simple Data Visulizations¶

Start with the table above
Create Histograms for numeric columns
Plot the histograms

(
    bike_share_data
    [['Rented Bike Count']]
    .plot(
        kind='hist', 
        alpha=0.5, 
        title='Rented Bike Count'
    )
);

(
    bike_share_data
    [['Temperature(°C)']]
    .plot(
        kind='hist', 
        alpha=0.5, 
        title='Temperature(°C)'
    )
);

Creating bins with Cut¶

Start with the table above
Focus on the temperature column
Create 5 bins

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        5
    )  
)

     (-6.36, 5.08]
     (-6.36, 5.08]
     (-6.36, 5.08]
     (-6.36, 5.08]
     (-6.36, 5.08]
            ...      
  (-6.36, 5.08]
  (-6.36, 5.08]
  (-6.36, 5.08]
  (-6.36, 5.08]
  (-6.36, 5.08]
Name: Temperature(°C), Length: 8760, dtype: category
Categories (5, interval[float64, right]): [(-17.857, -6.36] < (-6.36, 5.08] < (5.08, 16.52] < (16.52, 27.96] < (27.96, 39.4]]

We see the interval fo each of the bins: [(-17.857, -6.36] < (-6.36, 5.08] < (5.08, 16.52] < (16.52, 27.96] < (27.96, 39.4]] that are in order and split to more or less equal sizes from temperature values perspective. Let’s see how well they split the data:

Split the temperature value into 5 bins
Count the number of records in each bin
Sort the values by the temperature range (the index of the series)

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        5
    )
    .value_counts()
    .sort_index()
)

(-17.857, -6.36]     529
(-6.36, 5.08]       1989
(5.08, 16.52]       2428
(16.52, 27.96]      2923
(27.96, 39.4]        891
Name: Temperature(°C), dtype: int64

First, we see that the bins are in order when we sort them, and not by alphabetic order. More importanty, we see that they are not equal in size of records, and not based on any other meaningful split.

Setting the bins’ limits¶

We can set the bins to be more meaningful by setting the limits explicitly. As experts in bicycles, we might know the temperature ranges that are suitable for different accessories and clothings. Let’s fix the ranges based on this domain knowledge:

Split the temperature value into 5 bins based on bicycle professional ranges
Count the number of records in each bin
Sort the values by the temperature range (the index of the series)

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        bins=[-18, 0, 8, 16, 24, 40]
    )
    .value_counts()
    .sort_index()
)

(-18, 0]    1454
(0, 8]      1731
(8, 16]     1651
(16, 24]    2148
(24, 40]    1776
Name: Temperature(°C), dtype: int64

Adding meaningful labels¶

Create a new column in the table for the temperature ranges
Split the temperature value into 5 bins based on bicycle professional ranges
Count the number of records in each bin
Add human readable labels to the bins
Sort the values by the temperature range (the index of the series)

bike_share_data['Temperature Range'] = (
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        bins=[-18, 0, 8, 16, 24, 40],
        labels=['Below Freezing', 'Freezing', 'Cold', 'Warm','Sizzling']
    )
    
)
(
    bike_share_data
    ['Temperature Range']
    .value_counts()
    .sort_index()
)

Below Freezing    1454
Freezing          1731
Cold              1651
Warm              2148
Sizzling          1776
Name: Temperature Range, dtype: int64

We see that the order is still the right order and not alphabetical, and the split is more balanced in the number of records.

Boxplot for each bin¶

Now we can take every group of records and calculate and plot their box-plot showing the mean and the different quantiles of the different groups.

(
    bike_share_data
    [['Rented Bike Count','Temperature Range']]
    .boxplot(by='Temperature Range')
);

What is wrong with Historgram¶

As much as historgrams are popular and simple to plot, they have many limitation:

It depends (too much) on the number of bins.
It depends (too much) on variable’s maximum and minimum.
It doesn’t allow to detect relevant values.
It doesn’t allow to discern continuous from discrete variables.
It makes it hard to compare distributions.

We will use a couple of other plot option to see the data distributions more accurately.

Kernel Density Estimator (KDE)¶

This method is calculating and plotting the estimation of the data distribution, and it is part of the Pandas built-in functions:

(
    bike_share_data
    ['Rented Bike Count']
    .plot
    .kde(
        title='Rented Bike Count',
        grid=True
    )
);

(
    bike_share_data
    ['Temperature(°C)']
    .plot
    .kde(
        title='Temperature(°C)', 
        grid=True
    )
);

Cumulative Distribution Function (CDF)¶

The second option is using counts of the different percentiles of the data and using a cumulative plot makes it easy to find the value of each percentile and calculate the percentage of data points between every given percentiles.

%pip install statsmodels

Requirement already satisfied: statsmodels in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (0.13.2)

Requirement already satisfied: pandas>=0.25 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.3.5)
Requirement already satisfied: packaging>=21.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (21.3)
Requirement already satisfied: patsy>=0.5.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: scipy>=1.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.7.3)
Requirement already satisfied: numpy>=1.17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from packaging>=21.3->statsmodels) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2022.1)
Requirement already satisfied: six in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt

ecdf = ECDF(bike_share_data['Temperature(°C)'])
plt.plot(ecdf.x, ecdf.y)
plt.grid(True)
plt.title('Temperature(°C)'); 

ecdf = ECDF(bike_share_data['Rented Bike Count'])
plt.plot(ecdf.x, ecdf.y)
plt.grid(True)
plt.title('Rented Bike Count'); 

Group by Quantiles using qcut¶

When using a target score such as grades or number of rentals, it makes sense to use the split using the Quantiles of the scores.

bike_share_data['Usage Level'] = (
    pd
    .qcut(
        bike_share_data['Rented Bike Count'], 
        q=4,
        labels=['Low', 'Medium', 'High', 'Very High']
    )
)
(
    bike_share_data
    ['Usage Level']
    .value_counts()
    .sort_index()
)

Low          2194
Medium       2186
High         2190
Very High    2190
Name: Usage Level, dtype: int64

(
    bike_share_data
    [['Temperature(°C)','Usage Level']]
    .boxplot(by='Usage Level')
);

Heat map with the new value group¶

Now that we have the new value categories and we can create heat map to see the hours of the day that are the peak hours that can be used with higher price tiers.

import seaborn as sns

sns.heatmap(
    bike_share_data
    [['Usage Level','Hour','Date']]
    .pivot_table(
        index='Hour', 
        columns='Usage Level',
        aggfunc='count'
    )
    .droplevel(0, axis='columns'),
    cmap="YlOrRd"
);

From Excel to Pandas

Group numeric values into categories (bins)

Contents