Date Functions

In the previous section we created an advanced analysis with merge and groupby and other Pandas capabilities, which are harder to do in Excel. In this section, we will check the more traditional functions that are often used in Excel spreadsheet and we will see how to use them easily in Pandas.

Open In Studio Lab

Date Functions

Let’s review some common functions in Excel and their replacement in Pandas

TODAY

=TODAY()

Like many of the Pandas functions, most of the arguments are optional and the default values are often as expected. For example, the time stamp and even the date depends on the timezone needed. In the next example, we will specific the optional tz value to ‘America/Chicago’:

You can see that the data includes columns such as Seasons or Functioning Day, however, we might want to calculate such columns differently.

import pandas as pd
now = (
    pd
    .Timestamp
    .today(
        tz='America/Chicago'
    )
)
now
Timestamp('2022-05-22 15:27:59.957694-0500', tz='America/Chicago')

DAY, MONTH, YEAR

We can apply all the various date functions from excel such as DAY, MINUTE, HOUR etc.

now.day, now.month, now.year
(22, 5, 2022)

DATE

we can create a date with given values such as day, month, and year

=Date(2021,1,1)

and in pandas:

new_year = (
    pd
    .Timestamp(
        year=2021,
        month=1,
        day=1
    )
)
new_year
Timestamp('2021-01-01 00:00:00')

NETWORKDAYS

A common function in business context is =NETWORKDAYS(), that calculates the number of working days between two dates. In Pandas we can use the following:

import numpy as np 

(
    np
    .busday_count(
        new_year.date(), 
        now.date()
    )
)
361

WORKDAY

The next useful function is =WORKDAY(NUMBER) that can be used when you want to get the date after a given number of working days

new_year = pd.Timestamp("2021-01-01")
new_year.day_name()
'Friday'
workday_100 = new_year + 100 * pd.offsets.BusinessDay()
workday_100
Timestamp('2021-05-21 00:00:00')

The one hundred working day in 2021 is May 21st.

Pandas has a wide range of options for date offsets, such as weeks, month-end, semi-month-end (15th day), querter, retail year (aka 52-53 week), and many others.

Real life example

Loading Data

As usual, let’s load a dataset to work on. We will take another dataset that is used in machine learning education, “Bike Share”. This is a data set about the demand of bike share service in Seoul. Please note that we need to modify the default encoding of read_csv from ‘UTF-8’ to ‘latin1’.

bike_share_data = (
    pd
    .read_csv(
        'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', 
        encoding='latin1'
    )
)
bike_share_data
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day
0 01/12/2017 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
1 01/12/2017 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
2 01/12/2017 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes
3 01/12/2017 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes
4 01/12/2017 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 30/11/2018 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday Yes
8756 30/11/2018 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8757 30/11/2018 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes
8758 30/11/2018 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday Yes
8759 30/11/2018 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday Yes

8760 rows × 14 columns

Date Columns Manipulations

We will start with creating a few columns based on the date column, such as day of the week and month.

  • Start with the table we loaded above

  • Make sure that the Date column is in datetime format that we see (note the day before the month)

  • Calculate the day-of-week (DOW) of each row (to use the built-in functions we need to access the date accessor (.dt))

  • Calculate the month value of each row

enriched_bike_share_data = (
    bike_share_data
    .assign(Date = pd.to_datetime(bike_share_data.Date, format='%d/%m/%Y', errors='coerce'))
    .assign(DOW = lambda x : x.Date.dt.day_name())
    .assign(month = lambda x : x.Date.dt.month)
)
enriched_bike_share_data
Date Rented Bike Count Hour Temperature(°C) Humidity(%) Wind speed (m/s) Visibility (10m) Dew point temperature(°C) Solar Radiation (MJ/m2) Rainfall(mm) Snowfall (cm) Seasons Holiday Functioning Day DOW month
0 2017-12-01 254 0 -5.2 37 2.2 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes Friday 12
1 2017-12-01 204 1 -5.5 38 0.8 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes Friday 12
2 2017-12-01 173 2 -6.0 39 1.0 2000 -17.7 0.0 0.0 0.0 Winter No Holiday Yes Friday 12
3 2017-12-01 107 3 -6.2 40 0.9 2000 -17.6 0.0 0.0 0.0 Winter No Holiday Yes Friday 12
4 2017-12-01 78 4 -6.0 36 2.3 2000 -18.6 0.0 0.0 0.0 Winter No Holiday Yes Friday 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8755 2018-11-30 1003 19 4.2 34 2.6 1894 -10.3 0.0 0.0 0.0 Autumn No Holiday Yes Friday 11
8756 2018-11-30 764 20 3.4 37 2.3 2000 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes Friday 11
8757 2018-11-30 694 21 2.6 39 0.3 1968 -9.9 0.0 0.0 0.0 Autumn No Holiday Yes Friday 11
8758 2018-11-30 712 22 2.1 41 1.0 1859 -9.8 0.0 0.0 0.0 Autumn No Holiday Yes Friday 11
8759 2018-11-30 584 23 1.9 43 1.3 1909 -9.3 0.0 0.0 0.0 Autumn No Holiday Yes Friday 11

8760 rows × 16 columns

Simple Data Visulizations

  • Start with the enriched table above

  • Group the data by date column

  • Calculate the mean of values per each day

  • Plot the results with date as x and number of rented bike as y

(
    enriched_bike_share_data
    .groupby('Date')
    .mean()
    .plot(y='Rented Bike Count')
);
../_images/02.02_date_functions_20_0.png
  • Start with the enriched table above

  • Group the data by date column

  • Calculate the mean of values per each day

  • Plot the results

  • with scatter plot with temp as x and number of rented bike as y

(
    enriched_bike_share_data
    .groupby('Date')
    .mean()
    .plot
    .scatter(x='Temperature(°C)', y='Rented Bike Count')
);
../_images/02.02_date_functions_22_0.png

Analyzing Corrlations between the columns

  • Start with the enriched table above

  • Group the data by date column

  • Calculate the mean of values per each day

  • Calculate the correlation values between the numeric values

  • Take the values that are related to the first column (Rented Bike Count), skip the first two rows (self correlation, and hour column)

  • Add style to the output

  • Highlight the maximum correlation value with green background

  • and Highlight the minimum correlation value with red background

(
    enriched_bike_share_data
    .groupby('Date')
    .mean()
    .corr()
    .iloc[2:,[0]]
    .style
    .highlight_max(color='green')
    .highlight_min(color='red')
)
  Rented Bike Count
Temperature(°C) 0.699824
Humidity(%) 0.028158
Wind speed (m/s) -0.174859
Visibility (10m) 0.145376
Dew point temperature(°C) 0.602648
Solar Radiation (MJ/m2) 0.684069
Rainfall(mm) -0.214252
Snowfall (cm) -0.240836
month 0.201733

We can see that the higher correlation is with the temperature (the higher the temperature, the more bikes are rented), and with the Snowfall (the more snowfall, the fewer bikes are reneted), which make sense.