Logistics Analysis¶

In this example, we will analyze the accuracy of a public transit system. The example, is analyzing three lines in Seattle, and it is based on the great blog post The Waiting Time Paradox, or, Why Is My Bus Always Late?

import pandas as pd
import numpy as np

Loading the data¶

The data that we have is capturing a couple of bus lines in Seattle, WA. Here is a map of the various lines that operate in the city.

from IPython.display import Image
Image(url='http://jakevdp.github.io/images/seattle-transit-map.png') 

url = 'https://gist.githubusercontent.com/jakevdp/82409002fcc5142a2add0168c274a869/raw/1bbabf78333306dbc45b9f33662500957b2b6dc3/arrival_times.csv'

import requests
from io import StringIO

response = requests.get(url)
response

<Response [200]>

The CSV file is read through a URL and therefore, we need to convert the text of the response to a simple string as we get when we read a local file. For that we will use the StringIO functionality as follows:

bus_arrival_times = (
    pd
    .read_csv(
        StringIO(
            response.text
        )
    )
)

bus_arrival_times

	OPD_DATE	VEHICLE_ID	RTE	DIR	TRIP_ID	STOP_ID	STOP_NAME	SCH_STOP_TM	ACT_STOP_TM
0	2016-03-26	6201	673	S	30908177	431	3RD AVE & PIKE ST (431)	01:11:57	01:13:19
1	2016-03-26	6201	673	S	30908033	431	3RD AVE & PIKE ST (431)	23:19:57	23:16:13
2	2016-03-26	6201	673	S	30908028	431	3RD AVE & PIKE ST (431)	21:19:57	21:18:46
3	2016-03-26	6201	673	S	30908019	431	3RD AVE & PIKE ST (431)	19:04:57	19:01:49
4	2016-03-26	6201	673	S	30908252	431	3RD AVE & PIKE ST (431)	16:42:57	16:42:39
...	...	...	...	...	...	...	...	...	...
39152	2016-05-27	6215	674	N	30905957	578	3RD AVE & PIKE ST (578)	10:50:25	10:49:50
39153	2016-05-27	6215	674	N	30905720	578	3RD AVE & PIKE ST (578)	13:02:27	13:07:10
39154	2016-05-27	6215	674	N	30905729	578	3RD AVE & PIKE ST (578)	15:14:29	15:15:55
39155	2016-05-27	6215	674	N	30905740	578	3RD AVE & PIKE ST (578)	17:26:29	17:25:40
39156	2016-05-27	6216	674	N	30905885	578	3RD AVE & PIKE ST (578)	18:06:29	18:07:01

39157 rows × 9 columns

Data Exploration¶

Let’s check how many days, buses, routes, trips, stops, etc. do we have in the data.

Start with the bus arrival times data above
Convert every cell to string value
Describe the table (count, unique values, top and frequency)
Focus only on the unique values count

(
    bus_arrival_times
    .astype(str)
    .describe()
    .loc['unique']
)

OPD_DATE          63
VEHICLE_ID       115
RTE                3
DIR                2
TRIP_ID         1660
STOP_ID            2
STOP_NAME          2
SCH_STOP_TM     1357
ACT_STOP_TM    27633
Name: unique, dtype: object

We see that we have 63 days of data, for 115 buses, in 3 lines (RTE) and 2 directions and schedule and arrival times in 2 stops over 1660 trips.

Data Cleanup¶

Let’s make the data easier to understand and to calculate the accuracy of the lines

Start with the bus arrival times table above
Filter out invalid records with TRIP_ID = 0
Create a scheduled column timestamp with the combination of the date and the scheduled stop time
Create a actual column timestamp with the combination of the date and the actual stop time
Calculate the difference between the above timestamps
Calculate the difference in minutes by scaling to seconds and then 60 (seconds in a minute)
Fix the rows that fall across midnights (difference larger than 1000 minutes)
Add a column with the route symbol (C, D or E)
Add a column with the direction of the trip (northbound or southbound)
Focus only on the new columns we created here

trips_accuracy = (
    bus_arrival_times
    .query('TRIP_ID > 0')
    .assign(scheduled = lambda x : pd.to_datetime(x.OPD_DATE + ' ' + x.SCH_STOP_TM, errors='raise'))
    .assign(actual = lambda x : pd.to_datetime(x.OPD_DATE + ' ' + x.ACT_STOP_TM))
    .assign(
        time_diff = lambda x : (x.actual - x.scheduled)
    )
    .assign(
        minutes_late = lambda x : x.time_diff.dt.total_seconds() / 60
    )
    .assign(minutes_late = lambda x: np.where(
        x.minutes_late < 1000, 
        x.minutes_late, 
        x.minutes_late - 60*24)
    )
    .assign(minutes_late = lambda x: np.where(
        x.minutes_late > -1000, 
        x.minutes_late, 
        x.minutes_late + 60*24)
    )
    .assign(route = lambda x: x.RTE.replace({
        673: 'C', 
        674: 'D', 
        675: 'E'})
        .astype('category')
    )
    .assign(direction = lambda x :x.DIR.replace({
        'N': 'northbound', 
        'S': 'southbound'})
        .astype('category')
    )
    .loc[:,['route', 'direction', 'scheduled', 'actual', 'minutes_late']]
)
trips_accuracy

	route	direction	scheduled	actual	minutes_late
0	C	southbound	2016-03-26 01:11:57	2016-03-26 01:13:19	1.366667
1	C	southbound	2016-03-26 23:19:57	2016-03-26 23:16:13	-3.733333
2	C	southbound	2016-03-26 21:19:57	2016-03-26 21:18:46	-1.183333
3	C	southbound	2016-03-26 19:04:57	2016-03-26 19:01:49	-3.133333
4	C	southbound	2016-03-26 16:42:57	2016-03-26 16:42:39	-0.300000
...	...	...	...	...	...
39152	D	northbound	2016-05-27 10:50:25	2016-05-27 10:49:50	-0.583333
39153	D	northbound	2016-05-27 13:02:27	2016-05-27 13:07:10	4.716667
39154	D	northbound	2016-05-27 15:14:29	2016-05-27 15:15:55	1.433333
39155	D	northbound	2016-05-27 17:26:29	2016-05-27 17:25:40	-0.816667
39156	D	northbound	2016-05-27 18:06:29	2016-05-27 18:07:01	0.533333

38917 rows × 5 columns

Data Visualization¶

Accuracy Histogram¶

We can quickly graph the distribution of the accuracy

(
    trips_accuracy
    .hist(bins=50)
);

../_images/08.01_Logistics_analysis_13_0.png

Or even break it down by the different routes and directions to see if the grouops are behaving in a similar way

(
    trips_accuracy
    .query('minutes_late < 30')
    .hist(
        bins=50,
        sharex=True, 
        by=['route','direction'], 
        figsize=(10,6)
    )
);

../_images/08.01_Logistics_analysis_15_0.png

Fancier Visualization¶

The default visualization functionality of Pandas is often enhanced with a set of graphic libraries that we will cover in a dedicated chapter. Here is an improved visualization using Seaborn.

%pip install seaborn --quiet

Note: you may need to restart the kernel to use updated packages.

import seaborn as sns
import matplotlib.pyplot as plt
g = (
    sns
    .FacetGrid(
        trips_accuracy, 
        row="direction", 
        col="route"
    )
)

g.map(
    plt.hist, 
    "minutes_late", 
    bins=np.arange(-10, 20)
)
g.set_titles('{col_name} {row_name}')
g.set_axis_labels('minutes late', 'number of trips');

../_images/08.01_Logistics_analysis_18_0.png

You might expect that the buses stick closer to their schedule near the beginning of each one-way trip and show more spread near the end, and this is borne out in the data: the southbound C-line and northbound D and E lines are near the beginning of their respective routes, and in the opposite direction they are near the end.

Scheduled and Actual Arrival Intervals¶

Start with the trips_accuracy table above
Sort the value based on the Scheduled time
Group by route and direction
Focus only on the Scheduled and actual time columns
calculate the diff between every adjacent trips in each Group
Convert null values to 0
Convert the time delta values to minutes
Merge with the original table based on the similar index of both tables
Change the name of the column from the diff table to include the suffix _interval

trips_intervals = (
    trips_accuracy
    .sort_values('scheduled')
    .groupby(['route', 'direction'])
    ['scheduled','actual']
    .diff()
    .fillna(pd.Timedelta(seconds=0))
    .astype('timedelta64[m]')
    .merge(
        trips_accuracy,
        left_index=True, 
        right_index=True,
        suffixes=("_interval","")
    )
)
trips_intervals

/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:5: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """

	scheduled_interval	actual_interval	route	direction	scheduled	actual	minutes_late
19512	0.0	0.0	C	northbound	2016-03-26 00:00:25	2016-03-26 00:05:01	4.600000
79	0.0	0.0	C	southbound	2016-03-26 00:04:57	2016-03-26 00:05:18	0.350000
19725	0.0	0.0	D	northbound	2016-03-26 00:05:25	2016-03-26 00:05:06	-0.316667
155	0.0	0.0	E	southbound	2016-03-26 00:27:59	2016-03-26 00:28:04	0.083333
19569	0.0	0.0	E	northbound	2016-03-26 00:28:10	2016-03-26 00:29:46	1.600000
...	...	...	...	...	...	...	...
19462	15.0	13.0	D	southbound	2016-05-27 23:44:58	2016-05-27 23:45:36	0.633333
4609	15.0	16.0	C	southbound	2016-05-27 23:49:57	2016-05-27 23:49:35	-0.366667
9735	30.0	27.0	E	southbound	2016-05-27 23:53:59	2016-05-27 23:57:15	3.266667
29313	30.0	30.0	E	northbound	2016-05-27 23:55:10	2016-05-27 23:54:09	-1.016667
19426	15.0	-1425.0	D	southbound	2016-05-27 23:59:58	2016-05-27 00:01:28	1.500000

38917 rows × 7 columns

We can see that we have a few issues with the midnights and some of the interval are a day apart. Previously, we fixed it in the flow, however, now we can use a function that is fixing this issues on a specific column to simplify our code.

Start with the given table
If the value is smaller than 1,000
Assign to the given column name the original value of the column
Otheriwse, assign a value that is (60 minutes by 24 hours) minutes smaller
Repeat for values that are larger than -1,000

def fix_midnights(df, col_name):
    fixed_df = (
        df
        .assign(**{col_name : lambda x: np.where(
            x[col_name] < 1_000, 
            x[col_name], 
            x[col_name] - 60*24)}
        )
        .assign(**{col_name : lambda x: np.where(
            x[col_name] > -1_000, 
            x[col_name], 
            x[col_name] + 60*24)}
        )
    )
    return fixed_df

Now that we have a business logic function that can fix the midnight problem that we saw in our data, we can apply on the two problematic columns in our table:

Start with the trip intervals table above
Fix the midnight problem in scheduled_interval column
Fix the midnight problem in actual_interval column

fixed_trip_intervals = (
    trips_intervals
    .pipe(fix_midnights, 'scheduled_interval')
    .pipe(fix_midnights, 'actual_interval')
)
fixed_trip_intervals

	scheduled_interval	actual_interval	route	direction	scheduled	actual	minutes_late
19512	0.0	0.0	C	northbound	2016-03-26 00:00:25	2016-03-26 00:05:01	4.600000
79	0.0	0.0	C	southbound	2016-03-26 00:04:57	2016-03-26 00:05:18	0.350000
19725	0.0	0.0	D	northbound	2016-03-26 00:05:25	2016-03-26 00:05:06	-0.316667
155	0.0	0.0	E	southbound	2016-03-26 00:27:59	2016-03-26 00:28:04	0.083333
19569	0.0	0.0	E	northbound	2016-03-26 00:28:10	2016-03-26 00:29:46	1.600000
...	...	...	...	...	...	...	...
19462	15.0	13.0	D	southbound	2016-05-27 23:44:58	2016-05-27 23:45:36	0.633333
4609	15.0	16.0	C	southbound	2016-05-27 23:49:57	2016-05-27 23:49:35	-0.366667
9735	30.0	27.0	E	southbound	2016-05-27 23:53:59	2016-05-27 23:57:15	3.266667
29313	30.0	30.0	E	northbound	2016-05-27 23:55:10	2016-05-27 23:54:09	-1.016667
19426	15.0	15.0	D	southbound	2016-05-27 23:59:58	2016-05-27 00:01:28	1.500000

38917 rows × 7 columns

Data Visualization¶

Actual Intervals Distribution¶

Create grid of graphs from the fixed trip intervals table
each row will have a different direction
each column will have a different route
In each graph create an histogram with the actual interval value
with bins between 0 and 50
Add a title for each graph with the route and direction
Add labels to the axis

g = (
    sns
    .FacetGrid(
        fixed_trip_intervals, 
        row="direction", 
        col="route"
    )
)
g.map(
    plt.hist, 
    "actual_interval", 
    bins=np.arange(50)
)
g.set_titles('{col_name} {row_name}')
g.set_axis_labels('actual interval (minutes)', 'number of buses');

../_images/08.01_Logistics_analysis_27_0.png

the distributions may be affected by non-constant scheduled arrival intervals. Let’s understand the different intervals that are scheduled. We will repeat the graphs logic above and plot the schedule interval instead of the actual ones.

g = (
    sns
    .FacetGrid(
        fixed_trip_intervals, 
        row="direction", 
        col="route"
    )
)
g.map(plt.hist, "scheduled_interval", bins=np.arange(20) - 0.5)
g.set_titles('{col_name} {row_name}')
g.set_axis_labels('scheduled interval (minutes)', 'frequency');

../_images/08.01_Logistics_analysis_29_0.png

Heat Maps¶

We can see the most of the lines have 7, 12 and 15 minutes interval for most of the trips. If we to see the heat map of the interval plans, we can add the hour and the day-of-week (dow) to the table using eval:

Start with the fixed trip intervals table above
Add a column with the hour in the day of each trip
Add a column with the name of the day in the week
Convert the name of the day to be an ordered category based on the day names from the calendar (it will make the order of the days in the graph to be as expected and not alphabetic)

from pandas.api.types import CategoricalDtype
import calendar
trip_intervals_with_time = (
    fixed_trip_intervals
    .eval("hour = scheduled.dt.hour")
    .assign(dow = lambda x : x.scheduled.dt.day_name().astype(
        CategoricalDtype(
            ordered=True, 
            categories=list(calendar.day_name))
        )
    )
)
trip_intervals_with_time

	scheduled_interval	actual_interval	route	direction	scheduled	actual	minutes_late	hour	dow
19512	0.0	0.0	C	northbound	2016-03-26 00:00:25	2016-03-26 00:05:01	4.600000	0	Saturday
79	0.0	0.0	C	southbound	2016-03-26 00:04:57	2016-03-26 00:05:18	0.350000	0	Saturday
19725	0.0	0.0	D	northbound	2016-03-26 00:05:25	2016-03-26 00:05:06	-0.316667	0	Saturday
155	0.0	0.0	E	southbound	2016-03-26 00:27:59	2016-03-26 00:28:04	0.083333	0	Saturday
19569	0.0	0.0	E	northbound	2016-03-26 00:28:10	2016-03-26 00:29:46	1.600000	0	Saturday
...	...	...	...	...	...	...	...	...	...
19462	15.0	13.0	D	southbound	2016-05-27 23:44:58	2016-05-27 23:45:36	0.633333	23	Friday
4609	15.0	16.0	C	southbound	2016-05-27 23:49:57	2016-05-27 23:49:35	-0.366667	23	Friday
9735	30.0	27.0	E	southbound	2016-05-27 23:53:59	2016-05-27 23:57:15	3.266667	23	Friday
29313	30.0	30.0	E	northbound	2016-05-27 23:55:10	2016-05-27 23:54:09	-1.016667	23	Friday
19426	15.0	15.0	D	southbound	2016-05-27 23:59:58	2016-05-27 00:01:28	1.500000	23	Friday

38917 rows × 9 columns

To plot the heatmap we will use the built-in heatmap graph of Seaborn library:

Start with the trip intervals table with the hour and dow above
Take only the hour, dow and scheduled interval columns
Create a pivot table with hour as the index and dow as columns (with mean as default aggregation function)
Clip values that are larger than 60 minutes to 60
Remove the first level of the columns names (the name of the calculated column scheduled interval)

sns.heatmap(
    trip_intervals_with_time
    [['hour','dow','scheduled_interval']]
    .pivot_table(index='hour', columns='dow')
    .clip(upper=60)
    .droplevel(0, axis='columns'),
    cmap="YlOrRd"
);

../_images/08.01_Logistics_analysis_33_0.png

As expected the larger intervals are during the night (between midnight and 5 AM) and the shortest are during the rush hours of the morning and afternoon. We can also see the Saturday and Sunday have longer interval as expected.

Let’s check how the actual arrival times behave across the week and create a similar heatmap of the minute late column that we calculated above.

sns.heatmap(
    trip_intervals_with_time
    [['hour','dow','minutes_late']]
    .pivot_table(index='hour', columns='dow')
    .clip(upper=60)
    .droplevel(0, axis='columns'),
    cmap="YlOrRd"
);

../_images/08.01_Logistics_analysis_35_0.png

Most of the problems are at 5 PM mostly on Monday, Thursday and Friday. We can also see some higher delay on the last shift of the night at 4 AM, however, it is probably not affecting that many people like the delays of the rush hours of the afternoon.

Let’s break it down by route:

minutes_late_heatmap_by_route = (
    trip_intervals_with_time
    [['hour','dow','minutes_late','route']]
    .pivot_table(index=['route','hour'], columns=['dow'])
    .clip(upper=60)
    .droplevel(0, axis='columns')
)

Heatmap as a table with values¶

This time we will create the heat map more similar to the way it will be presented in Excel using style (similar to Conditional Formatting in Excel):

Start with the pivot table per route
Apply style to table
Format the values to be with up to 2 digits of precision
Apply conditional backgroud to each cell with gradient from yellow to orange to red

(
    minutes_late_heatmap_by_route
    .style
    .format("{:.2}", na_rep="-")
    .background_gradient(cmap='YlOrRd')
)

	dow	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
route	hour
C	0	-0.017	-1.4	-1.0	-0.67	-0.62	1.6	0.98
	1	0.33	-0.21	-1.3	-0.67	0.34	-0.25	-1.9
	2	0.14	-0.68	1.2	1.3	0.4	2.1	2.1
	3	0.36	0.91	-2.1	1.5	-	-	0.15
	4	3.6	-1.6	5.0	1.5	-	-	4.0
	5	1.2	0.61	0.39	0.51	-0.0023	-0.55	-0.37
	6	0.33	0.42	0.16	0.04	0.28	-0.7	0.018
	7	1.7	2.2	2.1	1.6	2.0	0.36	-0.51
	8	2.3	2.5	2.6	3.7	4.3	-0.58	-1.7
	9	1.0	2.0	2.7	2.5	2.8	0.73	-2.3
	10	-0.33	-0.44	-0.46	1.1	0.98	1.7	-1.9
	11	-1.4	-0.93	-0.11	0.23	1.4	3.1	-1.7
	12	-0.57	-1.2	-0.23	-0.19	0.96	0.88	-1.2
	13	-2.0	-1.8	-1.0	-0.83	-0.082	2.3	-0.51
	14	0.01	0.12	0.69	0.052	1.1	2.0	-1.2
	15	0.52	0.19	0.54	1.3	2.1	1.9	-1.9
	16	2.1	1.6	2.0	2.8	4.8	1.6	-1.8
	17	4.1	3.4	3.4	6.7	6.5	0.51	-1.2
	18	2.6	1.2	1.7	4.8	5.3	-0.00075	-0.74
	19	0.24	-0.1	0.4	1.8	2.3	2.0	0.27
	20	2.1	1.5	0.52	2.6	2.7	2.5	1.1
	21	1.6	0.88	1.8	2.9	2.7	3.4	0.044
	22	0.34	0.39	-0.1	1.5	0.99	2.1	0.52
	23	-0.37	-0.67	-0.91	0.27	-0.28	1.1	-1.2
D	0	0.11	1.0	0.088	-0.11	1.7	1.6	2.4
	1	3.6	0.68	0.12	1.1	4.0	3.0	3.1
	2	2.2	1.7	3.6	2.2	1.8	3.1	3.1
	3	2.0	1.8	1.7	2.1	-0.5	3.1	2.5
	4	2.6	1.4	1.1	1.8	0.071	2.0	3.2
	5	0.4	0.12	0.28	0.69	1.3	-1.8	-1.4
	6	0.82	1.2	0.91	1.3	1.2	-0.77	-0.38
	7	0.55	1.1	1.1	0.76	0.51	-1.5	-0.3
	8	2.7	2.4	2.8	2.1	1.3	0.0045	-0.95
	9	0.42	1.2	1.8	2.0	1.1	-0.64	-0.46
	10	0.52	0.28	0.19	0.84	1.7	0.94	0.36
	11	-0.0036	1.1	0.93	1.3	1.9	1.2	0.2
	12	0.57	0.82	0.78	1.7	2.2	0.65	-0.02
	13	0.45	1.3	1.2	0.89	1.2	1.7	0.78
	14	0.19	0.9	1.3	1.5	1.5	0.77	0.44
	15	2.5	1.9	2.3	1.7	3.3	1.3	0.17
	16	4.7	0.43	0.62	2.2	3.1	1.5	0.086
	17	5.2	1.9	2.1	6.8	6.5	2.4	0.4
	18	1.5	1.4	1.3	4.8	3.8	2.5	1.5
	19	1.4	0.43	0.93	4.2	2.2	1.4	0.65
	20	2.3	1.3	2.3	3.4	3.2	1.7	1.2
	21	2.5	2.4	2.6	4.0	2.9	3.4	2.5
	22	1.9	1.1	1.2	1.7	3.2	3.4	2.3
	23	2.0	1.3	2.0	2.0	3.7	3.4	2.3
E	0	1.7	1.4	1.6	0.81	2.5	0.15	-0.89
	1	0.92	1.9	1.5	0.17	2.8	-0.23	-0.33
	2	2.8	1.2	0.9	-1.2	4.3	1.9	1.1
	3	2.3	1.3	2.3	2.8	-	3.3	2.8
	5	0.85	1.3	0.56	1.4	1.6	2.8	-0.47
	6	0.89	1.9	1.3	1.5	1.6	-0.85	1.0
	7	1.1	2.2	1.7	1.5	1.4	0.28	0.29
	8	2.2	1.9	2.8	2.2	1.1	0.77	1.8
	9	1.0	1.1	1.2	1.3	1.3	0.93	1.1
	10	1.1	0.85	2.0	2.1	2.5	1.6	1.5
	11	0.75	1.2	0.76	1.4	2.1	-0.39	0.74
	12	1.2	1.4	2.2	1.9	2.5	1.5	0.25
	13	0.21	0.53	0.58	0.88	2.3	1.0	0.98
	14	1.0	2.1	1.2	2.5	4.6	0.93	1.1
	15	1.8	1.7	2.5	2.6	4.2	0.9	1.3
	16	2.8	2.3	3.1	4.4	4.9	-0.32	1.1
	17	2.7	3.1	2.9	3.9	4.3	0.61	1.2
	18	0.62	0.36	0.69	0.7	2.1	-0.35	-0.39
	19	0.29	0.77	1.5	0.85	2.0	0.066	-1.0
	20	1.3	1.7	1.8	2.4	2.1	-0.27	-0.68
	21	1.5	3.0	2.7	1.4	2.5	-0.66	-1.3
	22	2.2	2.4	2.5	2.3	2.3	-0.48	-0.57
	23	2.3	2.2	3.2	1.9	3.5	-0.37	-1.2

Now it is easy to see the 5-7 minutes delays during the rush hours of the C and D routes. We can also explore other problemaic times on Saturday, etc.

From Excel to Pandas

Logistics Analysis

Contents

Logistics Analysis¶

Loading the data¶

Data Exploration¶

Data Cleanup¶

Data Visualization¶

Accuracy Histogram¶

Fancier Visualization¶

Scheduled and Actual Arrival Intervals¶

Data Visualization¶

Actual Intervals Distribution¶

Heat Maps¶

Heatmap as a table with values¶