Adding New Columns¶

In the second chapter we will learn how to add new columns to the tables based on a set of functions. This is similar to the Functions that are available in Excel with =FUNC(A1)

We will start with loading a csv file that is hosted in data.world on the “Median Value Per Sq ft per zip code in the US”

import pandas as pd

us_median_sq_ft_value = (
    pd
    .read_csv('https://query.data.world/s/xrfy7fb7oq55gpzh6bvs6jtonv32lk')
)

us_median_sq_ft_value.head()

	RegionID	RegionName	City	State	Metro	CountyName	SizeRank	1996-04	1996-05	1996-06	...	2016-12	2017-01	2017-02	2017-03	2017-04	2017-05	2017-06	2017-07	2017-08	2017-09
0	61639	10025	New York	NY	New York	New York	1	NaN	NaN	NaN	...	1306.0	1290	1277	1272	1271	1277	1281	1279	1274	1270
1	84654	60657	Chicago	IL	Chicago	Cook	2	134.0	134.0	133.0	...	291.0	294	296	295	293	294	294	294	294	295
2	61637	10023	New York	NY	New York	New York	3	NaN	NaN	NaN	...	1602.0	1597	1580	1567	1569	1570	1560	1543	1529	1521
3	84616	60614	Chicago	IL	Chicago	Cook	4	149.0	150.0	150.0	...	330.0	333	333	331	330	330	329	328	329	329
4	93144	79936	El Paso	TX	El Paso	El Paso	5	50.0	51.0	50.0	...	81.0	81	81	81	81	81	82	82	82	82

5 rows × 265 columns

Constant Value¶

The simplest way to add a column is to put a constant value. Since we know that this data is only for USA we can add a column with the value USA to a new column called country. We will be able to later merge this table with data from other countries and then this new column will be useful.

For setting a value in a column we will use the assign function of Pandas. You will need to scroll all the way to the right of the output to see the new country column.

(
    us_median_sq_ft_value
    .assign(country="USA")
)

	RegionID	RegionName	City	State	Metro	CountyName	SizeRank	1996-04	1996-05	1996-06	...	2017-01	2017-02	2017-03	2017-04	2017-05	2017-06	2017-07	2017-08	2017-09	country
0	61639	10025	New York	NY	New York	New York	1	NaN	NaN	NaN	...	1290	1277	1272	1271	1277	1281	1279	1274	1270	USA
1	84654	60657	Chicago	IL	Chicago	Cook	2	134.0	134.0	133.0	...	294	296	295	293	294	294	294	294	295	USA
2	61637	10023	New York	NY	New York	New York	3	NaN	NaN	NaN	...	1597	1580	1567	1569	1570	1560	1543	1529	1521	USA
3	84616	60614	Chicago	IL	Chicago	Cook	4	149.0	150.0	150.0	...	333	333	331	330	330	329	328	329	329	USA
4	93144	79936	El Paso	TX	El Paso	El Paso	5	50.0	51.0	50.0	...	81	81	81	81	81	82	82	82	82	USA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
14750	73940	36564	Fairhope	AL	Daphne	Baldwin	14751	NaN	NaN	NaN	...	230	238	243	245	244	245	246	244	240	USA
14751	59107	3293	Woodstock	NH	Claremont	Grafton	14752	58.0	59.0	61.0	...	129	129	128	129	130	130	130	130	131	USA
14752	82396	55713	Buhl	MN	Duluth	Saint Louis	14753	NaN	NaN	NaN	...	72	73	74	74	73	73	74	77	79	USA
14753	66881	21405	Annapolis	MD	Baltimore	Anne Arundel	14754	146.0	149.0	151.0	...	470	462	457	457	455	454	455	458	460	USA
14754	94815	85220	Apache Junction	AZ	Phoenix	Pinal	14755	NaN	NaN	NaN	...	117	119	119	119	120	121	122	124	124	USA

14755 rows × 266 columns

Excel Functions Equivalent¶

Most of the functions that you are used to use in Excel have a direct Equivalent in Pands. We will try a few of them in the following examples. We will only look at the City column in the dataset.

For example: =TRIM('cell for trimming') can be replaced by strip. We will:

Take only the City column from the table
Add a column that strips spaces from around the string (str) of the City column
Show the head (first 5 lines) of the new sub-table

(
    us_median_sq_ft_value[['City']]
    .assign(trimmed = us_median_sq_ft_value.City.str.strip())
    .head()
)

	City	trimmed
0	New York	New York
1	Chicago	Chicago
2	New York	New York
3	Chicago	Chicago
4	El Paso	El Paso

or to remove spaces between words, you will need to run ‘Find and Replace’ option in Excel and in Pandas we can use the replace function:

Take only the City column from the table
Add a column that is removes every space (‘ ‘) from the string of the City column
Show the head of the new sub-table

(
    us_median_sq_ft_value[['City']]
    .assign(no_space = us_median_sq_ft_value.City.str.replace(' ',''))
    .head()
)

	City	no_space
0	New York	NewYork
1	Chicago	Chicago
2	New York	NewYork
3	Chicago	Chicago
4	El Paso	ElPaso

More Textual Functions¶

Another commonly used Excel functions are LEFT and RIGHT (=LEFT(Cell, number of digits)). In Pandas, we can use the ‘:’, which is used for until or from:

Take only the City column from the table
Add a column that takes the first 5 characters (=LEFT) from the string of the City column
Add a column that takes the last 5 characters (=RIGHT) from the string of the City column
Show the head of the new sub-table

(
    us_median_sq_ft_value[['City']]
    .assign(city_left = us_median_sq_ft_value.City.str[:5]) # until the 5th character
    .assign(city_right = us_median_sq_ft_value.City.str[-5:]) # from the 5th characters from the end (=right)
    .head()
)

	City	city_left	city_right
0	New York	New Y	York
1	Chicago	Chica	icago
2	New York	New Y	York
3	Chicago	Chica	icago
4	El Paso	El Pa	Paso

Using Lambda¶

Lambda is a bit confusing at first glance, however, it is used to define a the function that we want to apply on a each row or columns of the table. For example, let’s define two new columns: one that is calculating the quantile of the region compare to all the regions, and one that is checking if the state is one of [‘NY’,’NJ’]

We will focus only on the State and the last month in the data and add the two new columns to it.

Take only the State and the values of the last period columns from the table
Add a column that checks if the State is either ‘NY’ or ‘NJ’
Add a column that calculates the ratio of the average price of the row to the maximum price (quantile)

(
    us_median_sq_ft_value[['State','2017-09']]
    .assign(is_NY_or_NJ=lambda x : x.State.isin(['NY','NJ']))
    .assign(last_quantile=lambda x : x['2017-09']/max(x['2017-09']))    
)

	State	2017-09	is_NY_or_NJ	last_quantile
0	NY	1270	True	0.702434
1	IL	295	False	0.163164
2	NY	1521	True	0.841261
3	IL	329	False	0.181969
4	TX	82	False	0.045354
...	...	...	...	...
14750	AL	240	False	0.132743
14751	NH	131	False	0.072456
14752	MN	79	False	0.043695
14753	MD	460	False	0.254425
14754	AZ	124	False	0.068584

14755 rows × 4 columns

Multiple Data Sources¶

We can do a more complicated analysis that is based on additional external sources. For example, we want to calculate the differences between red states and blue states. We can then ask various questions regarding the impacts or correlation between the house prices and the election results and voting patterns.

First we will load data regarding the voting in various states across the years. A quick internet search finds an interesting table in the following Wikipedia page. Since there are multiple tables on that page, we can filter them using the match option, and using the word Year as the filter.

Read the HTML tables in the wikipedia entry on the topic of red and blue states
retrive on the ones that have the word ‘Year’

red_blue_states_wikipedia_entry = 'https://en.wikipedia.org/wiki/Red_states_and_blue_states'
wikipedia_page_tables = (
    pd
    .read_html(
        red_blue_states_wikipedia_entry, 
        match='Year'
    )
)

As we saw in the loading data from web site, the list can includes multiple tables. However, we filtered it and only have one table, and we can access it with the [0] modifier.

red_blue_states = wikipedia_page_tables[0]
red_blue_states.head()

	Year	1972	1976	1980	1984	1988	1992	1996	2000	2004	2008	2012	2016	2020
	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated	Democratic Republican (lighter shading indicates win ≤5%) Winner received plurality of the vote but did not receive an outright majority of the popular vote Winner did not receive a plurality of the vote and lost the popular vote Winner chosen by the House of Representatives Electoral votes invalidated
0	Democratic candidate	George McGovern	Jimmy Carter	Jimmy Carter	Walter Mondale	Michael Dukakis	Bill Clinton	Bill Clinton	Al Gore	John Kerry	Barack Obama	Barack Obama	Hillary Clinton	Joe Biden
1	Republican candidate	Richard Nixon	Gerald Ford	Ronald Reagan	Ronald Reagan	George H. W. Bush	George H. W. Bush	Bob Dole	George W. Bush	George W. Bush	John McCain	Mitt Romney	Donald Trump	Donald Trump
2	National popular vote	Nixon	Carter	Reagan	Reagan	Bush	Clinton	Clinton	Gore	Bush	Obama	Obama	Clinton	Biden
3	Alabama	Nixon	Carter	Reagan	Reagan	Bush	Bush	Dole	Bush	Bush	McCain	Romney	Trump	Trump
4	Alaska	Nixon	Ford	Reagan	Reagan	Bush	Bush	Dole	Bush	Bush	McCain	Romney	Trump	Trump

We will start with creating a new list of states that voted for a specific candidate in a specific electio. First, let’s use Trump in the 2016 elections.

We are using here a few new functions such as rename, query and iloc, which we will discuss in more details in the next sections.

red_candidate = 'Trump'

We will:

Take the large table of the multiple elections results
Remove the complicated multi level header from the table (droplevel)
Filter to only states that have the name of the red candidate, we defined above, in the column of the election year (2020, for example). Note, we need to surround the name of the column with ` as it starts with a number
Take only the first column from the result table, which holds the name of the state (iloc[:,0]).

red_states_2020 = (
    red_blue_states
    .droplevel(1, axis=1) 
    .query("`2020` == @red_candidate")
    .iloc[:,0]
)
red_states_2020

          Alabama
           Alaska
         Arkansas
         Florida
           Idaho
         Indiana
            Iowa
          Kansas
        Kentucky
       Louisiana
     Mississippi
        Missouri
         Montana
  North Carolina
    North Dakota
            Ohio
        Oklahoma
  South Carolina
    South Dakota
       Tennessee
           Texas
            Utah
   West Virginia
         Wyoming
Name: Year, dtype: object

Now that we have the list of the red states of 2020, we need to match them to the states that we have in our house prices data set. We see that we have to match the short version in the data set with the long version in the red state list.

We will use a another option to create a data frame using a Dictionary. It is not hard to find a list of mapping of states names and abbreviations.

us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

and we can use the dictionary to create a table (DataFrame) from it:

Create a new dataframe table
Using the dictionary of the us state abbreviation above
with row for each entry
and name the column of the value ‘Abbreviation’

us_state_abbrev_df = (
    pd
    .DataFrame
    .from_dict(us_state_abbrev, 
                orient='index',
                columns=['Abbreviation'])
)

Before we continue, we can keep this table as file to be used in future analyses. We could also create this file in Excel and load it as we will do in the future with this file.

us_state_abbrev_df.to_csv('../data/us_state_abbrev.csv')

Next, we will:

Start with the abbreviation list of all the US states
Take only the ones that also appear in the list of red states in the 2020 election
Take all the rows left, and the index column (country name) with the abbreviation column

red_state_abbrev = (
    us_state_abbrev_df
    .loc[red_states_2020]
    .loc[:,'Abbreviation']
)
red_state_abbrev

Alabama           AL
Alaska            AK
Arkansas          AR
Florida           FL
Idaho             ID
Indiana           IN
Iowa              IA
Kansas            KS
Kentucky          KY
Louisiana         LA
Mississippi       MS
Missouri          MO
Montana           MT
North Carolina    NC
North Dakota      ND
Ohio              OH
Oklahoma          OK
South Carolina    SC
South Dakota      SD
Tennessee         TN
Texas             TX
Utah              UT
West Virginia     WV
Wyoming           WY
Name: Abbreviation, dtype: object

Finally:

Start with the table of the median value of sqr feet per zip code
Add a new column (is_red) that checks if the State in each row is in the list of red states that we calculated above.
Create two groups for red and blue states and the a value in the last period in the table (September 2017)
Calculate the average (mean) price for each of the two groups
Plot the two values
Title the graph with “Average square feet value between blue and red states”

(
    us_median_sq_ft_value
    .assign(is_red=lambda x : x.State.isin(red_state_abbrev))
    .groupby('is_red')['2017-09']
    .mean()
    .plot
    .bar(title="Average square feet value between blue and red states")
);

../_images/02.01_Adding_New_Columns_32_0.png

We can now see that the average value of square feet in the blue states (is_red = False) is about twice the average of red states (is_red = True)

From Excel to Pandas

Adding New Columns

Contents