Merge/Join Tables (VLOOKUP)¶

The merge function mimics the functionality of JOIN in SQL queries and replaces the VLOOKUP functionality in Excel. It is one of the most powerful and useful functions for dataframes in Pandas. The main idea is to:

join two (or more) dataframe table using similar keys in each of the tables.
enrich tables with loopup data
validate the match values of the key columns in the tables

import pandas as pd

Loading lookup data¶

For this example, we will take one of the common lookup data, zip code. The loopup is not trivial as there are a few tousands (more than 33,000 as you can see below) of values and a simple Excel file will struggle to do it efficently and quickly.

We will start with loading the loopup data from Simple Maps, which I’ve downloaded to a public S3 bucket.

url = 'https://mlguy-public.s3-eu-west-1.amazonaws.com/excel2pandas/chapter3/simplemaps_uszips_basicv1/uszips.csv'

import requests
from io import StringIO

response = requests.get(url)
response

<Response [200]>

The CSV file is read through a URL and therefore, we need to convert the text of the response to a simple string as we get when we read a local file. For that we will use the StringIO functionality as follows:

zip_lookup = (
    pd
    .read_csv(
        StringIO(
            response.text
        )
    )
)

zip_lookup

	zip	lat	lng	city	state_id	state_name	zcta	parent_zcta	population	density	county_fips	county_name	county_weights	county_names_all	county_fips_all	imprecise	military	timezone
0	601	18.18004	-66.75218	Adjuntas	PR	Puerto Rico	True	NaN	17242	111.4	72001	Adjuntas	{'72001':99.43,'72141':0.57}	Adjuntas\|Utuado	72001\|72141	False	False	America/Puerto_Rico
1	602	18.36073	-67.17517	Aguada	PR	Puerto Rico	True	NaN	38442	523.5	72003	Aguada	{'72003':100}	Aguada	72003	False	False	America/Puerto_Rico
2	603	18.45439	-67.12202	Aguadilla	PR	Puerto Rico	True	NaN	48814	667.9	72005	Aguadilla	{'72005':100}	Aguadilla	72005	False	False	America/Puerto_Rico
3	606	18.16724	-66.93828	Maricao	PR	Puerto Rico	True	NaN	6437	60.4	72093	Maricao	{'72093':94.88,'72121':1.35,'72153':3.78}	Maricao\|Yauco\|Sabana Grande	72093\|72153\|72121	False	False	America/Puerto_Rico
4	610	18.29032	-67.12243	Anasco	PR	Puerto Rico	True	NaN	27073	312.0	72011	AÃ±asco	{'72003':0.55,'72011':99.45}	AÃ±asco\|Aguada	72011\|72003	False	False	America/Puerto_Rico
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
33092	99923	55.97796	-130.03671	Hyder	AK	Alaska	True	NaN	15	2.1	2198	Prince of Wales-Hyder	{'02198':100}	Prince of Wales-Hyder	02198	False	False	America/Sitka
33093	99925	55.55767	-132.97627	Klawock	AK	Alaska	True	NaN	927	5.7	2198	Prince of Wales-Hyder	{'02198':100}	Prince of Wales-Hyder	02198	False	False	America/Sitka
33094	99926	55.12617	-131.48928	Metlakatla	AK	Alaska	True	NaN	1635	4.2	2198	Prince of Wales-Hyder	{'02198':100}	Prince of Wales-Hyder	02198	False	False	America/Metlakatla
33095	99927	56.25100	-133.37571	Point Baker	AK	Alaska	True	NaN	38	0.2	2198	Prince of Wales-Hyder	{'02198':100}	Prince of Wales-Hyder	02198	False	False	America/Sitka
33096	99929	56.36950	-131.93648	Wrangell	AK	Alaska	True	NaN	2484	0.4	2275	Wrangell	{'02275':100}	Wrangell	02275	False	False	America/Sitka

33097 rows × 18 columns

Lookup Data Exploration¶

The table above shows us the type of data that we can get from enrichment based on the zip code, including city, state, latitude, longitude, population count and density.

We can explore the values that we have in this table, before we start to use it for enrichment.

Counting Values¶

The simplest aggregation function for each group is the size. How many zip codes do we have in each state?

(
    zip_lookup
    .groupby('state_name')
    .size()
)

state_name
Alabama                  642
Alaska                   238
Arizona                  405
Arkansas                 591
California              1761
Colorado                 525
Connecticut              282
Delaware                  67
District of Columbia      52
Florida                  981
Georgia                  735
Hawaii                    94
Idaho                    277
Illinois                1383
Indiana                  775
Iowa                     934
Kansas                   697
Kentucky                 767
Louisiana                515
Maine                    432
Maryland                 468
Massachusetts            537
Michigan                 986
Minnesota                884
Mississippi              423
Missouri                1022
Montana                  361
Nebraska                 581
Nevada                   175
New Hampshire            248
New Jersey               595
New Mexico               368
New York                1794
North Carolina           808
North Dakota             383
Ohio                    1195
Oklahoma                 648
Oregon                   417
Pennsylvania            1795
Puerto Rico              131
Rhode Island              77
South Carolina           424
South Dakota             371
Tennessee                628
Texas                   1935
Utah                     287
Vermont                  255
Virginia                 896
Washington               596
West Virginia            706
Wisconsin                772
Wyoming                  178
dtype: int64

Sorting Values¶

To sort the values is also simple with sort_value() function, and see the population size of each state (based on the zip code lookup data):

Start with the zip lookup data above
Group the row by state_name and take only the population values
Sum up all the population values for each zip code area in the state Group
Sort the states by the accumulated population value in descending order

us_population_distribution = (
    zip_lookup
    .groupby('state_name')['population']
    .sum()
    .sort_values(ascending=False)
)

us_population_distribution

state_name
California              39140219
Texas                   27884135
Florida                 20594164
New York                19618323
Illinois                12821487
Pennsylvania            12790950
Ohio                    11639989
Georgia                 10297534
North Carolina          10155624
Michigan                 9957465
New Jersey               8881845
Virginia                 8414042
Washington               7280023
Arizona                  6949259
Massachusetts            6830133
Tennessee                6644470
Indiana                  6637220
Missouri                 6090358
Maryland                 6003435
Wisconsin                5777892
Colorado                 5531233
Minnesota                5527473
South Carolina           4955925
Alabama                  4864630
Louisiana                4663243
Kentucky                 4446667
Oregon                   4081732
Oklahoma                 3918473
Connecticut              3581504
Puerto Rico              3384779
Iowa                     3133061
Utah                     3043806
Arkansas                 2990472
Mississippi              2988710
Nevada                   2923171
Kansas                   2908448
New Mexico               2089570
Nebraska                 1904901
West Virginia            1828786
Idaho                    1687214
Hawaii                   1422019
New Hampshire            1343673
Maine                    1332721
Rhode Island             1056611
Montana                  1042359
Delaware                  949495
South Dakota              864012
North Dakota              751250
Alaska                    737979
District of Columbia      684390
Vermont                   624977
Wyoming                   582091
Name: population, dtype: int64

Visualization of the data¶

(
    us_population_distribution
    .plot
    .bar(figsize=(13,5))
);

Loading the main data¶

We will take the data about brewries in the US that we used in one of the previous lessson, and enrich it with the data from the zip codes table. We will load it from the local file after we retried the data from the API before.

breweries_data= pd.read_csv('../data/us_breweries.csv')

breweries_data.head()

	id	name	brewery_type	street	address_2	address_3	city	state	county_province	postal_code	country	longitude	latitude	phone	website_url	updated_at	created_at
0	2	Avondale Brewing Co	micro	201 41st St S	NaN	NaN	Birmingham	Alabama	NaN	35222-1932	United States	-86.774322	33.524521	2.057775e+09	http://www.avondalebrewing.com	2018-08-23T23:19:57.825Z	2018-07-24T01:32:47.255Z
1	44	Trim Tab Brewing	micro	2721 5th Ave S	NaN	NaN	Birmingham	Alabama	NaN	35233-3401	United States	-86.791400	33.512849	2.057031e+09	http://www.trimtabbrewing.com	2018-08-23T23:20:31.423Z	2018-07-24T01:32:47.815Z
2	46	Yellowhammer Brewery	micro	2600 Clinton Ave W	NaN	NaN	Huntsville	Alabama	NaN	35805-3046	United States	-86.593201	34.727752	2.569756e+09	http://www.yellowhammerbrewery.com	2018-08-23T23:20:33.102Z	2018-07-24T01:32:47.838Z
3	55	Bearpaw River Brewing Co	micro	4605 E Palmer Wasilla Hwy	NaN	NaN	Wasilla	Alaska	NaN	99654-7679	United States	-149.412710	61.575269	NaN	http://bearpawriverbrewing.com	2018-08-23T23:20:40.743Z	2018-07-24T01:32:47.967Z
4	76	King Street Brewing Co	micro	9050 King Street	NaN	NaN	Anchorage	Alaska	NaN	99515	United States	-149.879076	61.138489	9.073365e+09	http://www.kingstreetbrewing.com	2018-08-23T23:20:57.179Z	2018-07-24T01:32:48.301Z

We can see that the postal code is sometimes in the longer format (for example, 35222-1932) compare to the zip codes that we have in our lookup table (for example, 35222). We will convert them to the shorter format by taking the first 5 characters ([:5]) of the string of the postal_code column.

breweries_data_with_zip = (
    breweries_data
    .assign(zip_code = lambda x : x.postal_code.str[:5])
)

breweries_data_with_zip.head()

	id	name	brewery_type	street	address_2	address_3	city	state	county_province	postal_code	country	longitude	latitude	phone	website_url	updated_at	created_at	zip_code
0	2	Avondale Brewing Co	micro	201 41st St S	NaN	NaN	Birmingham	Alabama	NaN	35222-1932	United States	-86.774322	33.524521	2.057775e+09	http://www.avondalebrewing.com	2018-08-23T23:19:57.825Z	2018-07-24T01:32:47.255Z	35222
1	44	Trim Tab Brewing	micro	2721 5th Ave S	NaN	NaN	Birmingham	Alabama	NaN	35233-3401	United States	-86.791400	33.512849	2.057031e+09	http://www.trimtabbrewing.com	2018-08-23T23:20:31.423Z	2018-07-24T01:32:47.815Z	35233
2	46	Yellowhammer Brewery	micro	2600 Clinton Ave W	NaN	NaN	Huntsville	Alabama	NaN	35805-3046	United States	-86.593201	34.727752	2.569756e+09	http://www.yellowhammerbrewery.com	2018-08-23T23:20:33.102Z	2018-07-24T01:32:47.838Z	35805
3	55	Bearpaw River Brewing Co	micro	4605 E Palmer Wasilla Hwy	NaN	NaN	Wasilla	Alaska	NaN	99654-7679	United States	-149.412710	61.575269	NaN	http://bearpawriverbrewing.com	2018-08-23T23:20:40.743Z	2018-07-24T01:32:47.967Z	99654
4	76	King Street Brewing Co	micro	9050 King Street	NaN	NaN	Anchorage	Alaska	NaN	99515	United States	-149.879076	61.138489	9.073365e+09	http://www.kingstreetbrewing.com	2018-08-23T23:20:57.179Z	2018-07-24T01:32:48.301Z	99515

Joining the tables¶

The joining of the table is based on a joined key. In this case we want to use the 5-digits zip code as the lookup or join key. In the previous step with shorten the longer zip codes to the shorter 5 digits format, and now we will make sure that the lookup table also have it in the same format.

Start with the zip loopup table
Add a column zip_code that is based on the value of the column zip
Convert the numeric value into string (astype(str)),
Pad the string with zeros when the number is shorter than 5 digits (zfill(5)).

zip_lookup_as_string = (
    zip_lookup
    .assign(
        zip_code = lambda x : x.zip
        .astype(str)
        .str
        .zfill(5)
    )
)

The join itself is simple.

Start with the breweries table that you want to enrich
Join using merge with the second zip loopup table
Define the key column with the same name (zip_code), and we use it using the on argument
Lastly, we want to have all breweries, even if we don’t find the zip code in the lookup table. Therefore, we are using LEFT join using the how argument.

enriched_breweries_data = (
    breweries_data_with_zip
    .merge(
        zip_lookup_as_string, 
        on='zip_code', 
        how='left'
    )
)

We have now many more columns as all the columns of both tables are joined to the enriched table. We will ask the Jupyter notebook to show us all the columns by removing the default maximum number of columns to display.

In the table below we can see all the columns. If the same column name is found in both tables and it wasn’t the column that was used for the join or merge, the column of first (“left”) table will be appended with x (city_x, for example), and the column of the second (“right”) table will be appended with y (city_y, for example).

pd.set_option('display.max_columns', None)
enriched_breweries_data

	id	name	brewery_type	street	address_2	address_3	city_x	state	county_province	postal_code	country	longitude	latitude	phone	website_url	updated_at	created_at	zip_code	zip	lat	lng	city_y	state_id	state_name	zcta	parent_zcta	population	density	county_fips	county_name	county_weights	county_names_all	county_fips_all	imprecise	military	timezone
0	2	Avondale Brewing Co	micro	201 41st St S	NaN	NaN	Birmingham	Alabama	NaN	35222-1932	United States	-86.774322	33.524521	2.057775e+09	http://www.avondalebrewing.com	2018-08-23T23:19:57.825Z	2018-07-24T01:32:47.255Z	35222	35222.0	33.52409	-86.77025	Birmingham	AL	Alabama	True	NaN	9033.0	815.8	1073.0	Jefferson	{'01073':100}	Jefferson	01073	False	False	America/Chicago
1	44	Trim Tab Brewing	micro	2721 5th Ave S	NaN	NaN	Birmingham	Alabama	NaN	35233-3401	United States	-86.791400	33.512849	2.057031e+09	http://www.trimtabbrewing.com	2018-08-23T23:20:31.423Z	2018-07-24T01:32:47.815Z	35233	35233.0	33.50876	-86.80194	Birmingham	AL	Alabama	True	NaN	1715.0	300.7	1073.0	Jefferson	{'01073':100}	Jefferson	01073	False	False	America/Chicago
2	46	Yellowhammer Brewery	micro	2600 Clinton Ave W	NaN	NaN	Huntsville	Alabama	NaN	35805-3046	United States	-86.593201	34.727752	2.569756e+09	http://www.yellowhammerbrewery.com	2018-08-23T23:20:33.102Z	2018-07-24T01:32:47.838Z	35805	35805.0	34.70858	-86.62125	Huntsville	AL	Alabama	True	NaN	21200.0	949.7	1089.0	Madison	{'01089':100}	Madison	01089	False	False	America/Chicago
3	55	Bearpaw River Brewing Co	micro	4605 E Palmer Wasilla Hwy	NaN	NaN	Wasilla	Alaska	NaN	99654-7679	United States	-149.412710	61.575269	NaN	http://bearpawriverbrewing.com	2018-08-23T23:20:40.743Z	2018-07-24T01:32:47.967Z	99654	99654.0	61.45483	-149.90045	Wasilla	AK	Alaska	True	NaN	62135.0	40.4	2170.0	Matanuska-Susitna	{'02170':100}	Matanuska-Susitna	02170	False	False	America/Anchorage
4	76	King Street Brewing Co	micro	9050 King Street	NaN	NaN	Anchorage	Alaska	NaN	99515	United States	-149.879076	61.138489	9.073365e+09	http://www.kingstreetbrewing.com	2018-08-23T23:20:57.179Z	2018-07-24T01:32:48.301Z	99515	99515.0	61.11739	-149.88889	Anchorage	AK	Alaska	True	NaN	23402.0	818.8	2020.0	Anchorage	{'02020':100}	Anchorage	02020	False	False	America/Anchorage
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7445	345	Barrel Brothers Brewing Company	micro	399 Business Park Ct Ste 506	NaN	NaN	Windsor	California	NaN	95492-6652	United States	NaN	NaN	7.076969e+09	http://www.barrelbrothersbrewing.com	2018-08-11T21:35:55.480Z	2018-07-24T01:32:52.148Z	95492	95492.0	38.53034	-122.81716	Windsor	CA	California	True	NaN	29920.0	579.0	6097.0	Sonoma	{'06097':100}	Sonoma	06097	False	False	America/Los_Angeles
7446	352	Bay Bridge Brewing Co	micro	688 Marsat Ct Ste B	NaN	NaN	Chula Vista	California	NaN	91911-4697	United States	NaN	NaN	6.199347e+09	http://www.baybridgebrewing.com	2018-08-11T21:35:55.704Z	2018-07-24T01:32:52.238Z	91911	91911.0	32.60738	-117.05405	Chula Vista	CA	California	True	NaN	85365.0	2736.2	6073.0	San Diego	{'06073':100}	San Diego	06073	False	False	America/Los_Angeles
7447	367	Benchmark Brewing Co	micro	6190 Fairmount Ave Ste G	NaN	NaN	San Diego	California	NaN	92120-3428	United States	NaN	NaN	6.197953e+09	http://www.benchmarkbrewing.com	2018-08-11T21:35:56.157Z	2018-07-24T01:32:52.420Z	92120	92120.0	32.79468	-117.07114	San Diego	CA	California	True	NaN	29142.0	1501.7	6073.0	San Diego	{'06073':100}	San Diego	06073	False	False	America/Los_Angeles
7448	477	Chino Valley Brewery	micro	1609 E. Grove Ave, Unit 109	NaN	NaN	Ontario	California	NaN	91761-5786	United States	NaN	NaN	9.512917e+09	http://www.chinovalleybrewery.com	2018-08-11T21:35:59.331Z	2018-07-24T01:32:53.774Z	91761	91761.0	34.03459	-117.59212	Ontario	CA	California	True	NaN	61124.0	754.0	6071.0	San Bernardino	{'06071':100}	San Bernardino	06071	False	False	America/Los_Angeles
7449	509	Crooked Goat Brewing	micro	120 Morris St Ste 120	NaN	NaN	Sebastopol	California	NaN	95472-3867	United States	NaN	NaN	7.078274e+09	http://www.crookedgoatbrewing.com	2018-08-11T21:36:00.652Z	2018-07-24T01:32:54.205Z	95472	95472.0	38.39817	-122.86574	Sebastopol	CA	California	True	NaN	30762.0	157.8	6097.0	Sonoma	{'06097':100}	Sonoma	06097	False	False	America/Los_Angeles

7450 rows × 36 columns

Now, that we have the enriched table, we can analyze the joined data and explore the different counties across the US:

Start with the enriched breweries data above
Group the breweries by county_name
For each group add the State ID
count the number of breweries,
sum up the populations of each zip code area in each group
and calculate the average population density in the county
Sort that list by the population size

(
    enriched_breweries_data
    .groupby('county_name')
    .agg(
        state=('state_id', 'min'), 
        brewry_count=('id', 'count'),
        population_sum=('population', 'sum'),
        density_average=('density', 'mean')
        )
    .sort_values(by='population_sum', ascending=False)
)

	state	brewry_count	population_sum	density_average
county_name
San Diego	CA	154	6059002.0	1670.298052
Los Angeles	CA	126	4529555.0	2834.088095
Cook	IL	98	4510406.0	5011.613265
King	WA	117	3519428.0	2407.258120
Orange	CA	88	3261503.0	1574.422727
...	...	...	...	...
Clearwater	ID	1	164.0	5.800000
Koochiching	MN	1	151.0	334.400000
Harding	NM	1	139.0	0.100000
Wabasha	MN	1	88.0	29.500000
Keweenaw	MI	1	87.0	1.000000

959 rows × 4 columns

We can see that san Diego is the largest county in terms of population and also a beer county with 154 different brewries in it.

Let’s change the sort key to find the counties with the most breweries

(
    enriched_breweries_data
    .groupby('county_name')
    .agg(
        state=('state_id', 'min'), 
        brewry_count=('id', 'count'),
        population_sum=('population', 'sum'),
        density_average=('density', 'mean')
        )
    .sort_values(by='brewry_count', ascending=False)
)

	state	brewry_count	population_sum	density_average
county_name
San Diego	CA	154	6059002.0	1670.298052
Los Angeles	CA	126	4529555.0	2834.088095
King	WA	117	3519428.0	2407.258120
Cook	IL	98	4510406.0	5011.613265
Jefferson	AL	93	1951501.0	770.348387
...	...	...	...	...
Flagler	FL	1	43813.0	247.300000
Fond du Lac	WI	1	10579.0	37.900000
Pitkin	CO	1	10653.0	8.900000
Pine	MN	1	9609.0	21.600000
Accomack	VA	1	2899.0	58.800000

959 rows × 4 columns

If you remember the map of the US that we saw in the previous section, it is now more clear the Cook county in IL is likely the dark area we saw in our hexbin visualization of the data, as a dense population of breweries.

We can now ask ourselves where do we have the most brewries per population and add this column to our table

(
    enriched_breweries_data
    .groupby('county_name')
    .agg(
        state=('state_id', 'min'), 
        brewry_count=('id', 'count'),
        population_sum=('population', 'sum'),
        density_average=('density', 'mean')
        )
    .assign(brewry_per_population = lambda x : x.brewry_count / x.population_sum * 1000)
    .sort_values(by='brewry_per_population', ascending=False)
)

	state	brewry_count	population_sum	density_average	brewry_per_population
county_name
Keweenaw	MI	1	87.0	1.000000	11.494253
Wabasha	MN	1	88.0	29.500000	11.363636
Harding	NM	1	139.0	0.100000	7.194245
Koochiching	MN	1	151.0	334.400000	6.622517
Clearwater	ID	1	164.0	5.800000	6.097561
...	...	...	...	...	...
Cabarrus	NC	5	299927.0	256.800000	0.016671
Fort Bend	TX	6	371516.0	472.583333	0.016150
Chesapeake	VA	1	65603.0	155.100000	0.015243
Guadalupe	TX	7	494106.0	113.828571	0.014167
Hidalgo	TX	4	287487.0	798.050000	0.013914

959 rows × 5 columns

We can see, as it is often the case, that we have on the top and the bottom of the table counties with very few brewries. The top of the table of brewries per 1000 people is held by Keweenaw that was last on our previous table of counties by population. To avoid this, we can filter (using query) our table to counties that have more than 5 brewries.

(
    enriched_breweries_data
    .groupby('county_name')
    .agg(
        state=('state_id', 'min'), 
        brewry_count=('id', 'count'),
        population_sum=('population', 'sum'),
        density_average=('density', 'mean')
        )
    .assign(brewry_per_population = lambda x : x.brewry_count / x.population_sum * 1000)
    .query("brewry_count > 5")
    .sort_values(by='brewry_per_population', ascending=False)
)

	state	brewry_count	population_sum	density_average	brewry_per_population
county_name
Schuyler	NY	7	15117.0	26.285714	0.463055
Grafton	NH	7	25969.0	28.714286	0.269552
Windsor	VT	7	28073.0	36.228571	0.249350
Waldo	ME	7	28697.0	28.357143	0.243928
Grand	CO	7	30992.0	3.485714	0.225865
...	...	...	...	...	...
St. Charles	MO	7	356115.0	426.171429	0.019657
Kings	CA	30	1706298.0	14137.416667	0.017582
Gwinnett	GA	6	349573.0	737.716667	0.017164
Fort Bend	TX	6	371516.0	472.583333	0.016150
Guadalupe	TX	7	494106.0	113.828571	0.014167

299 rows × 5 columns

And the winner is Schuyler county in New York, with 7 brewries and a 0.46 brewry for every 1,000 people

Geographic Mapping¶

The merge that we did allows us now to have geo location information for all the breweries, including the ones that didn’t have it in the original data set, as we have the geo location of the zip code area.

We can also add layers on real maps using the library GeoPandas. Let’s start with installing the library to our environment.

pip install geopandas

Collecting geopandas

  Downloading geopandas-0.10.2-py2.py3-none-any.whl (1.0 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/1.0 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 0.9/1.0 MB 27.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 24.7 MB/s eta 0:00:00
?25hRequirement already satisfied: pandas>=0.25.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from geopandas) (1.3.5)

Collecting pyproj>=2.2.0
  Downloading pyproj-3.2.1-cp37-cp37m-manylinux2010_x86_64.whl (6.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/6.3 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━ 4.4/6.3 MB 132.0 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 6.3/6.3 MB 129.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 85.6 MB/s eta 0:00:00
?25h

Collecting shapely>=1.6
  Downloading Shapely-1.8.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/2.0 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 109.6 MB/s eta 0:00:00
?25h

Collecting fiona>=1.8
  Downloading Fiona-1.8.21-cp37-cp37m-manylinux2014_x86_64.whl (16.7 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/16.7 MB ? eta -:--:--

     ━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/16.7 MB 173.1 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━ 11.9/16.7 MB 175.4 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 16.7/16.7 MB 178.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 16.7/16.7 MB 178.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.7/16.7 MB 94.9 MB/s eta 0:00:00
?25h

Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: certifi in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (2022.5.18.1)
Requirement already satisfied: six>=1.7 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (1.16.0)
Requirement already satisfied: click>=4.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (8.1.3)
Requirement already satisfied: attrs>=17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (21.4.0)

Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from fiona>=1.8->geopandas) (47.1.0)
Collecting click-plugins>=1.0

  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25.0->geopandas) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25.0->geopandas) (1.21.6)
Requirement already satisfied: pytz>=2017.3 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from pandas>=0.25.0->geopandas) (2022.1)

Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from click>=4.0->fiona>=1.8->geopandas) (4.11.4)

Requirement already satisfied: typing-extensions>=3.6.4 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from importlib-metadata->click>=4.0->fiona>=1.8->geopandas) (4.2.0)
Requirement already satisfied: zipp>=0.5 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from importlib-metadata->click>=4.0->fiona>=1.8->geopandas) (3.8.0)

Installing collected packages: shapely, pyproj, munch, cligj, click-plugins, fiona, geopandas

Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.21 geopandas-0.10.2 munch-2.5.0 pyproj-3.2.1 shapely-1.8.2

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

We also need to install a library to handle geo coordinates

pip install descartes

Collecting descartes
  Downloading descartes-1.1.0-py3-none-any.whl (5.8 kB)

Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from descartes) (3.5.2)
Requirement already satisfied: fonttools>=4.22.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (4.33.3)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (21.3)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (9.1.1)
Requirement already satisfied: numpy>=1.17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (1.21.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (1.4.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from matplotlib->descartes) (3.0.9)

Requirement already satisfied: typing-extensions in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->descartes) (4.2.0)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib->descartes) (1.16.0)

Installing collected packages: descartes

Successfully installed descartes-1.1.0

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

We will import the library

import geopandas

Create a data frame that is designed for geo data
Start with the enriched breweries data above
Define the geometry of the data from
x as the longitude (lng column)
y as the latitude (lat column)

gdf = (
    geopandas
    .GeoDataFrame(
    enriched_breweries_data, 
    geometry=geopandas
    .points_from_xy(
        enriched_breweries_data.lng, 
        enriched_breweries_data.lat)
    )
)

Create a geo dataframe for the world map from the built-in dataset of GeoPandas library

world = (
    geopandas
    .read_file(
        geopandas
        .datasets
        .get_path('naturalearth_lowres')
    )
)

Start with the world data frame
Filter it to use only the USA parts
Focus on the geometry boundaries of the map
Plot the map
using black color
and thin lines (0.2)
Now, plot the breweries data
on the above map
using red dots
and one pixel for each

# We restrict to USA
ax = (
    world
    [world.iso_a3 == 'USA']
    ['geometry']
    .boundary
    .plot(
        color='black', 
        edgecolor='black', 
        linewidth=0.2
    )
)

( 
    gdf
    .plot(
        ax=ax, 
        color='red',
        markersize = 1
    )
);

This is a good start, but we can’t really use it. Let’s make it more usable to know where is the beer hub in the US.

Create a 2D histogram based on the enriched breweries data
using x as latitude (lat column),
y as longtitude (lng column),
and 1000 bins for the histogram
Focus only on the main land using its coordinates
Then, create a smooth heat map from the histogram

import numpy as np
from scipy import ndimage

heatmap, xedges, yedges = np.histogram2d(
    enriched_breweries_data['lat'], 
    enriched_breweries_data['lng'],
    bins=1000, 
    range=[
        [25, 50],  # North-South extent of US
        [-125, -65]  # East-West extent of US, 
    ]
)
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
logheatmap = np.log(heatmap)
logheatmap[np.isneginf(logheatmap)] = 0
logheatmap = ndimage.filters.gaussian_filter(logheatmap, 30, mode='nearest')  # smooth out peaks

/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/ipykernel_launcher.py:14: RuntimeWarning: divide by zero encountered in log
  

ax = (
    world
    [world.iso_a3 == 'USA']
    ['geometry']
    .boundary
    .plot(
        color='k', 
        edgecolor='black', 
        linewidth=0.5
    )
)

ax.imshow(logheatmap, cmap='coolwarm', extent=extent)
ax.invert_yaxis()
ax.set(xlim=[-125, -65], ylim=[25, 50]);

We can clearly see the hubs in San Diego, New York, Illinois, and another one in Colorado.

From Excel to Pandas

Merge/Join Tables (VLOOKUP)

Contents