Group numeric values into categories (bins)

We often have a column with many numeric values and we want into group them to bins or buckets, such as age groups or value tiers. In Excel we can do it using and of the following options:

  • Data Menu -> Data Analysis -> Histogram or Rank and Percentile

  • VLOOKUP (=LOOKUP(A1,{0,7,14,31,90,180,360},{"0-6","7-13","14-30","31-89","90-179","180-359",">360"}), for example), or

  • IF (=if(b2>30,"large",if(A1>20,"medium",if(A1>=10,"small",if(A1<10,"tiny","")))), for example)

  • INDEX (=INDEX({"Small","Medium","Large"},LARGE(IF(A1>{0,11,21},{1,2,3}),1)), for example)

Open In Studio Lab

Cut Function

Since this is a common need, Pandas has built-in functions cut and qcut that make it more flexible and accurate. We start with importing pandas.

import pandas as pd

Real-life example

Let’s analyze data regarding the impact of weatehr information on numbers of bike rentals. We know that when it is too cold or too hot, people rent less bike. But what is the cutoff temperature value? When we listen to the next 10 days weather forecast, how should we prepare our bike rental fleet to meet the expected demand?

Loading Data

As usual, let’s load a dataset to work on. We will use the dataset that is used before, “Bike Share”. This is a data set about the demand of bike share service in Seoul. Please note that we need to modify the default encoding of read_csv from ‘UTF-8’ to ‘latin1’.

bike_share_data = (
    pd
    .read_csv(
        'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', 
        encoding='latin1'
    )
)
bike_share_data
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 30/11/2018 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday Yes
8756 30/11/2018 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8757 30/11/2018 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8758 30/11/2018 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday Yes
8759 30/11/2018 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday Yes

8760 rows × 14 columns

Simple Data Visulizations

  • Start with the table above

  • Create Histograms for numeric columns

  • Plot the histograms

(
    bike_share_data
    [['Rented Bike Count']]
    .plot(
        kind='hist', 
        alpha=0.5, 
        title='Rented Bike Count'
    )
);
../_images/02.08_group_values_6_0.png
(
    bike_share_data
    [['Temperature(°C)']]
    .plot(
        kind='hist', 
        alpha=0.5, 
        title='Temperature(°C)'
    )
);
../_images/02.08_group_values_7_0.png

Creating bins with Cut

  • Start with the table above

  • Focus on the temperature column

  • Create 5 bins

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        5
    )  
)
0       (-6.36, 5.08]
1       (-6.36, 5.08]
2       (-6.36, 5.08]
3       (-6.36, 5.08]
4       (-6.36, 5.08]
            ...      
8755    (-6.36, 5.08]
8756    (-6.36, 5.08]
8757    (-6.36, 5.08]
8758    (-6.36, 5.08]
8759    (-6.36, 5.08]
Name: Temperature(°C), Length: 8760, dtype: category
Categories (5, interval[float64, right]): [(-17.857, -6.36] < (-6.36, 5.08] < (5.08, 16.52] < (16.52, 27.96] < (27.96, 39.4]]

We see the interval fo each of the bins: [(-17.857, -6.36] < (-6.36, 5.08] < (5.08, 16.52] < (16.52, 27.96] < (27.96, 39.4]] that are in order and split to more or less equal sizes from temperature values perspective. Let’s see how well they split the data:

  • Split the temperature value into 5 bins

  • Count the number of records in each bin

  • Sort the values by the temperature range (the index of the series)

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        5
    )
    .value_counts()
    .sort_index()
)
(-17.857, -6.36]     529
(-6.36, 5.08]       1989
(5.08, 16.52]       2428
(16.52, 27.96]      2923
(27.96, 39.4]        891
Name: Temperature(°C), dtype: int64

First, we see that the bins are in order when we sort them, and not by alphabetic order. More importanty, we see that they are not equal in size of records, and not based on any other meaningful split.

Setting the bins’ limits

We can set the bins to be more meaningful by setting the limits explicitly. As experts in bicycles, we might know the temperature ranges that are suitable for different accessories and clothings. Let’s fix the ranges based on this domain knowledge:

  • Split the temperature value into 5 bins based on bicycle professional ranges

  • Count the number of records in each bin

  • Sort the values by the temperature range (the index of the series)

(
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        bins=[-18, 0, 8, 16, 24, 40]
    )
    .value_counts()
    .sort_index()
)
(-18, 0]    1454
(0, 8]      1731
(8, 16]     1651
(16, 24]    2148
(24, 40]    1776
Name: Temperature(°C), dtype: int64

Adding meaningful labels

  • Create a new column in the table for the temperature ranges

  • Split the temperature value into 5 bins based on bicycle professional ranges

  • Count the number of records in each bin

  • Add human readable labels to the bins

  • Sort the values by the temperature range (the index of the series)

bike_share_data['Temperature Range'] = (
    pd
    .cut(
        bike_share_data
        ['Temperature(°C)'], 
        bins=[-18, 0, 8, 16, 24, 40],
        labels=['Below Freezing', 'Freezing', 'Cold', 'Warm','Sizzling']
    )
    
)
(
    bike_share_data
    ['Temperature Range']
    .value_counts()
    .sort_index()
)
Below Freezing    1454
Freezing          1731
Cold              1651
Warm              2148
Sizzling          1776
Name: Temperature Range, dtype: int64

We see that the order is still the right order and not alphabetical, and the split is more balanced in the number of records.

Boxplot for each bin

Now we can take every group of records and calculate and plot their box-plot showing the mean and the different quantiles of the different groups.

(
    bike_share_data
    [['Rented Bike Count','Temperature Range']]
    .boxplot(by='Temperature Range')
);
../_images/02.08_group_values_17_0.png

What is wrong with Historgram

As much as historgrams are popular and simple to plot, they have many limitation:

  • It depends (too much) on the number of bins.

  • It depends (too much) on variable’s maximum and minimum.

  • It doesn’t allow to detect relevant values.

  • It doesn’t allow to discern continuous from discrete variables.

  • It makes it hard to compare distributions.

We will use a couple of other plot option to see the data distributions more accurately.

Kernel Density Estimator (KDE)

This method is calculating and plotting the estimation of the data distribution, and it is part of the Pandas built-in functions:

(
    bike_share_data
    ['Rented Bike Count']
    .plot
    .kde(
        title='Rented Bike Count',
        grid=True
    )
);
../_images/02.08_group_values_20_0.png
(
    bike_share_data
    ['Temperature(°C)']
    .plot
    .kde(
        title='Temperature(°C)', 
        grid=True
    )
);
../_images/02.08_group_values_21_0.png

Cumulative Distribution Function (CDF)

The second option is using counts of the different percentiles of the data and using a cumulative plot makes it easy to find the value of each percentile and calculate the percentage of data points between every given percentiles.

%pip install statsmodels
Requirement already satisfied: statsmodels in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (0.13.2)
Requirement already satisfied: pandas>=0.25 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.3.5)
Requirement already satisfied: packaging>=21.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (21.3)
Requirement already satisfied: patsy>=0.5.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: scipy>=1.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.7.3)
Requirement already satisfied: numpy>=1.17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from packaging>=21.3->statsmodels) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2022.1)
Requirement already satisfied: six in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.
from statsmodels.distributions.empirical_distribution import ECDF
import matplotlib.pyplot as plt
ecdf = ECDF(bike_share_data['Temperature(°C)'])
plt.plot(ecdf.x, ecdf.y)
plt.grid(True)
plt.title('Temperature(°C)'); 
../_images/02.08_group_values_25_0.png
ecdf = ECDF(bike_share_data['Rented Bike Count'])
plt.plot(ecdf.x, ecdf.y)
plt.grid(True)
plt.title('Rented Bike Count'); 
../_images/02.08_group_values_26_0.png

Group by Quantiles using qcut

When using a target score such as grades or number of rentals, it makes sense to use the split using the Quantiles of the scores.

bike_share_data['Usage Level'] = (
    pd
    .qcut(
        bike_share_data['Rented Bike Count'], 
        q=4,
        labels=['Low', 'Medium', 'High', 'Very High']
    )
)
(
    bike_share_data
    ['Usage Level']
    .value_counts()
    .sort_index()
)
Low          2194
Medium       2186
High         2190
Very High    2190
Name: Usage Level, dtype: int64
(
    bike_share_data
    [['Temperature(°C)','Usage Level']]
    .boxplot(by='Usage Level')
);
../_images/02.08_group_values_29_0.png

Heat map with the new value group

Now that we have the new value categories and we can create heat map to see the hours of the day that are the peak hours that can be used with higher price tiers.

import seaborn as sns
sns.heatmap(
    bike_share_data
    [['Usage Level','Hour','Date']]
    .pivot_table(
        index='Hour', 
        columns='Usage Level',
        aggfunc='count'
    )
    .droplevel(0, axis='columns'),
    cmap="YlOrRd"
);
../_images/02.08_group_values_32_0.png