Reading Data from Web Sites¶

Many times you want to add external data to your analysis. External data is often found in external web sites, usually in the format of an HTML table. You can import HTML directly into Excel using “Import”.

In this notebook, we will learn how to use pandas read_html to load tables from web site with ease, including cases where the data is complex to retrieve.

%pip install lxml requests

Requirement already satisfied: lxml in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (4.8.0)
Requirement already satisfied: requests in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (2.27.1)

Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (2022.5.18.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (1.26.9)

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

Loading HTML data¶

We will load a few tables from simple sites as well as complex ones

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

Loading from simple web sites¶

If you have a simple web page with a few tables, you can pass the URL of the page to Pandas and call read_html.

url = 'https://en.wikipedia.org/wiki/The_Championships,_Wimbledon'
dfs = (
    pd
    .read_html(url)
)

read_html is returning a list of all the tables in the page. We can scroll through the content of the tables on that page, by changing the index in the list. We can print the second element (index 1 in the list), we will get the table with the prize amounts for the Wimbeldon Tennis competition.

dfs[1]

	0	1	2	3	4	5	6	7	8	9	10	11
0	2021 Event	W	F	SF	QF	Round of 16	Round of 32	Round of 64	Round of 1281	Q3	Q2	Q1
1	Singles	£1,700,000	£900,000	£465,000	£300,000	£181,000	£115,000	£75,000	£48,000	£25,500	£15,500	£8,500
2	Doubles	£480,000	£240,000	£120,000	£60,000	£30,000	£19,000	£12,000	NaN	NaN	NaN	NaN

Loading from complex web sites¶

Many times pages will be more complex and we can still extract the relevant data.

url = 'https://ncov2019.live/data'
table_id = 'sortable_table_world'

Here we will look like a browser to the web site, as some web sites block their context to bots and crawlers. Since we are not hitting the web site a lot, pretending to be a browser is considered as acceptable usage.

We will send to the web site a header that a browser is sending and get the reply of the page as text. The text (r.text) will be parsed by the read_html function and create the dataframe. We will also add the HTML id of the table that we want. This id (‘sortable_table_world’) can be found when using the inspect option in Chrome, Safari, Firefox and other browsers.

import requests

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

response = requests.get(url, headers=header)

df_list = (
  pd
  .read_html(
    response.text, 
    attrs={'id': table_id}
  )
)

df_list[0]

	Name	Confirmed	Per Million	Deceased	Per Million.1	Tests	Active	Recovered	Per Million.2	Vaccinated	Population
0	TOTAL	527,450,948 +453,711	66535.988	6,300,159 +398	794.742	6,356,859,452 +521,215	37819979	483,231,537 +443,455	60957.873	5173118206	7927303146
1	★ Afghanistan	179,716 +42	4428.960	7,698 +1	189.710	958454	9277	162741	4010.630	5359930	40577445
2	★ Albania	275,864 +26	96056.770	3497	1217.670	1822887	242	272125	94754.840	1302093	2871885
3	★ Algeria	265,854 +3	5863.480	6875	151.630	230861	80593	178386	3934.350	7840131	45340682
4	★ Andorra	42572	0.000	153	0.000	249838	662	41757	0.000	57874	77498
...	...	...	...	...	...	...	...	...	...	...	...
228	★ Western Sahara	10	0.000	1	0.000	Unknown	Unknown	9	0.000	Unknown	624963
229	★ Niue	9	0.000	Unknown	0.000	Unknown	Unknown	9	0.000	1650	1647
230	★ Nauru	8	0.000	Unknown	0.000	Unknown	Unknown	5	0.000	9464	10952
231	★ Saint Helena	2	0.000	Unknown	0.000	Unknown	Unknown	2	0.000	4361	6111
232	★ China	223,145 +169	155.030	5,222 +3	3.630	160000000	4405	213,518 +476	148.350	1287195000	1439323776

233 rows × 11 columns

covid_19_status = df_list[0]

Simple Data Visualization¶

External data is often “messy”, and it is a best practice to clean it up before staring to work with it. In the next cell we will make sure that the numeric columns are indeed numeric (removing “Unknown”, for example). We will also resolve the conflict of having multiple columns with the same name (“Per Million”, in this exammple), and translate to more meaningful names.

Start with the Covid-19 status table
Convert the first ‘Per Million’ column from the table to be numeric
Convert the second ‘Per Million’ column from the table to be numeric
Fill missing values in the table with 0
Sort the table in decreasing order by number of deceased cases
Take the first top 20 country rows
Plot the results
as bar graph with the name of the country and the number of deceased cases per million people

(
    covid_19_status
    .assign(confirmed_cases_per_m=lambda x: pd.to_numeric(x['Per Million'],  errors='coerce'))
    .assign(deceased_cases_per_m=lambda x: pd.to_numeric(x['Per Million.1'],  errors='coerce'))
    .fillna(0)
    .sort_values(by='deceased_cases_per_m',ascending=False)
    .iloc[:20,:]
    .plot
    .bar(
        x='Name', 
        y='deceased_cases_per_m',
        title='Death Cases by COVID-19 by Country')
);

../_images/01.02 - loading data from HTML_16_0.png

More examples¶

Using Functions¶

Functions are a more complicated topic and you are not expected to write functions most of the times. Functions are also available in Excel as Macros. You should be familiar with using functions, read them to understand what they do, and also modifying them to fit your needs.

We often find the a set of commands is repeating itself many times in our analyses. This is the time to wrap these commands as a function. In the next cell, we will create a function that is wrapping a few of the commands that we used in the previous cells such as choosing a specific index of a table from the table list and setting the first row as header.

Read the HTML content of the given URL
Parse the HTML and extract all the tables in it
Take the first (of other given index) table from that list
Set the header of the table from the values in the first row
Skip the first (header) row
Return that table

def get_table_from_html(url, attrs=None, index=0):
    response = requests.get(url, headers=header)
    df_list = pd.read_html(response.text, attrs=attrs)
    df = df_list[index]
    df.columns = df.iloc[0]
    df = df[1:]
    return df

Once we have the function defined (def), we can try it in a simple way. Let’s say that we want to add to our CRM (Customer Relationship Management) system a new target market, as the consumer good companies (CPG). We find in the web a table with the top CPG companies for 2019.

url_top_companies = 'https://consumergoods.com/top-100-consumer-goods-companies-2019'
df = get_table_from_html(url_top_companies)

Let’s look at the table

df.style

0	Rank/Company	Net Revenue ($M)	1-Year Sales Growth	Key Product Categories
1	1. Nestle SA*	$92,085	2.10%	Food, Beverage, Confectionery
2	2. Procter & Gamble	$66,832	2.70%	Household Goods, Health & Beauty Aids, OTC Pharma
3	3. PepsiCo	$64,661	1.80%	Food, Beverage
4	4. Unilever N.V.*	$56,188	-5.10%	Household Goods, Food, Health & Beauty Aids
5	5. Anheuser-Busch InBev	$54,619	-3.20%	Wine & Spirits
6	6. Christian Dior*	$51,607	7.20%	Apparel/Footwear/Accessories, Wine & Spirits, Health & Beauty Aids
7	7. LVMH Moët Hennessy Louis Vuitton*	$51,607	9.80%	Apparel/Footwear/Accessories, Wine & Spirits, Health & Beauty Aids
8	8. JBS S.A.*	$44,587	11.30%	Food
9	9. Tyson Foods	$40,052	4.70%	Food
10	10. Nike Inc.	$36,397	6.00%	Apparel/Footwear/Accessories
11	11. Imperial Brands PLC*	$33,641	0.90%	Tobacco
12	12. 3M Co.	$32,765	3.50%	Household Goods
13	13. Coca-Cola Co.	$31,856	-10.00%	Beverage, Food
14	14. L’Oreal*	$29,687	3.50%	Health & Beauty Aids
15	15. Philip Morris International	$29,625	3.10%	Tobacco
16	16. Danone*	$27,168	-0.60%	Food, Beverage
17	17. British American Tobacco PLC*	$26,993	25.20%	Tobacco
18	18. Kraft Heinz	$26,259	0.70%	Food
19	19. Mondelez International	$25,938	0.20%	Food, Confectionary
20	20. Haier Smart Home Co.	$25,755	12.20%	Housewares/Appliances
21	21. Altria Group	$25,364	-0.80%	Tobacco
22	22. Heineken Holding N.V.*	$24,765	3.90%	Wine & Spirits
23	23. Adidas AG*	$24,153	3.30%	Apparel/Footwear/Accessories
24	24. WH Group Ltd.	$22,605	1.00%	Food
25	25. Henkel AG*	$21,931	-0.60%	Household Goods, Health & Beauty Aids
26	26. Whirlpool Corp.	$21,037	-1.00%	Housewares/Appliances
27	27. Japan Tobacco*	$20,565	3.60%	Tobacco
28	28. Fonterra Cooperative Group	$20,438	6.30%	Food, Beverage
29	29. Asahi Group Holdings*	$19,677	1.70%	Wine & Spirits, Beverages, Food
30	30. BSH Hausgerate*	$19,667	-3.00%	Housewares/Appliances
31	31. San Miguel Corp.*	$19,652	24.10%	Wine & Spirits, Food
32	32. Kimberly-Clark Corp.	$18,486	0.80%	Household Goods
33	33. Kirin Holdings*	$17,916	3.60%	Wine & Spirits
34	34. Associated British Foods*	$17,164	1.40%	Food
35	35. General Mills	$15,740	0.80%	Food
36	36. Colgate-Palmolive Co.	$15,544	0.60%	Household Goods
37	37. Kering*	$15,060	26.30%	Apparel/Footwear/Accessories
38	38. Grupo Bimbo*	$14,765	7.80%	Food
39	39. Kao Corp.*	$13,995	1.30%	Household Goods, Health & Beauty Aids
40	40. Stanley Black & Decker	$13,982	7.80%	Housewares/Appliances
41	41. Johnson & Johnson (Consumer)	$13,853	1.80%	OTC Pharma, Household Goods, Health & Beauty Aids
42	42. VF Corp.	$13,849	17.30%	Apparel/Footwear/Accessories
43	43. Uni-President Enterprises*	$13,845	7.90%	Food
44	44. Estee Lauder Companies	$13,683	15.70%	Health & Beauty Aids
45	45. Kellogg Co.	$13,547	5.40%	Food
46	46. Diageo PLC*	$13,405	0.90%	Wine & Spirits
47	47. AB Electrolux*	$12,821	2.80%	Housewares/Appliances
48	48. RB*	$12,597	-0.20%	Household Goods
49	49. Essity*	$12,239	8.40%	Household Goods
50	50. Compagnie Financiere Richemont SA*	$12,100	3.10%	Apparel/Footwear/Accessories
51	51. Nipponham Group*	$11,779	5.60%	Food
52	52. Keurig Dr Pepper	$11,024	2.30%	Beverages
53	53. MolsonCoors Brewing Co.	$10,770	-2.10%	Wine & Spirits
54	54. Shiseido Co.*	$10,160	8.90%	Health & Beauty Aids
55	55. Pernod Ricard*	$9,905	-0.30%	Wine & Spirits
56	56. Nintendo Co.*	$9,797	115.80%	Toys & Games
57	57. China Mengniu Dairy Co.*	$9,691	14.70%	Food
58	58. PVH Corp.	$9,657	8.30%	Apparel/Footwear/Accessories
59	59. Hormel Foods	$9,546	4.10%	Food
60	60. Coty, Inc.	$9,398	22.80%	Health & Beauty Aids
61	61. Carlsberg A/S*	$9,233	3.00%	Wine & Spirits
62	62. Saputo Inc.*	$8,766	3.40%	Food
63	63. Campbell Soup Co.	$8,685	10.10%	Food, Beverages
64	64. Newell Brands	$8,631	-9.60%	Household Goods, Housewares/Appliances
65	65. Swatch Group SA*	$8,535	6.10%	Apparel/Footwear/Accessories
66	66. BRF - Brasil Foods*	$8,474	3.20%	Food
67	67. GlaxoSmithKline Consumer Healthcare*	$8,440	-1.20%	OTC Pharma
68	68. Beiersdorf AG*	$7,972	2.50%	Household Goods, Health & Beauty Aids
69	69. Conagra Brands	$7,938	1.40%	Food
70	70. Hershey Co.	$7,791	3.70%	Food, Confectionery
71	71. Dean Foods Co.	$7,755	-0.50%	Food, Beverages
72	72. Constellation Brands	$7,585	3.50%	Wine & Spirits
73	73. Groupe SEB*	$7,508	5.10%	Housewares/Appliances
74	74. Thai Beverage Public Co.*	$7,496	20.90%	Wine & Spirits, Food, Beverages
75	75. J.M. Smucker Co.	$7,357	-0.50%	Food
76	76. PT Gudang Garam*	$6,805	14.90%	Tobacco
77	77. Hanesbrands	$6,804	5.10%	Apparel/Footwear/Accessories
78	78. Hermes International*	$6,575	7.50%	Apparel/Footwear/Accessories
79	79. Unicharm Corp.*	$6,388	7.30%	Household Goods
80	80. Bandai Namco Holdings*	$6,295	9.40%	Toys & Games
81	81. Post Holdings	$6,257	19.80%	Food
82	82. Ralph Lauren Corp.	$6,182	-7.10%	Apparel/Footwear/Accessories
83	83. Clorox Co.	$6,124	2.50%	Household Goods
84	84. Bayer Consumer Health*	$6,006	-7.00%	OTC Pharma
85	85. Tapestry Inc.	$5,880	31.00%	Apparel/Footwear/Accessories
86	86. McCormick & Co.	$5,409	11.90%	Food
87	87. Savencia SA	$5,360	0.20%	Food
88	88. Kewpie Corp.*	$5,323	2.10%	Food
89	89. Avon Products	$5,248	-5.70%	Health & Beauty Aids, Apparel/Footwear/Accessories, Household Goods
90	90. Electronic Arts	$5,150	6.30%	Toys & Games
91	91. First Pacific Co.	$5,136	-1.90%	Food
92	92. ITC Ltd.*	$5,048	-22.00%	Tobacco, Food, Health & Beauty Aids
93	93. Herbalife Ltd.	$4,892	10.50%	Food, Household Goods, Health & Beauty Aids
94	94. Sapporo Holdings Ltd.*	$4,843	-5.40%	Wine & Spirits, Food, Beverages
95	95. Arcelik A.S.*	$4,665	29.10%	Housewares/Appliances
96	96. Hasbro Inc.	$4,580	-12.10%	Toys & Games
97	97. Mattel Inc.	$4,511	-7.60%	Toys & Games
98	98. Husqvarna AB*	$4,244	4.30%	Housewares/Appliances
99	99. Church & Dwight Co.	$4,146	9.80%	Household Goods, Health & Beauty Aids
100	100. Spectrum Brands Holdings	$3,809	2.80%	Household Goods, Housewares/Appliances

Cleaning the data¶

We see that the table has a column with both the rank and the name of the company, and we want to have only the name of company in a column (as we have the rank in the index of the table). There are a fwe options to do it using string functions such as RIGHT or LEFT. In this case, you could use:

RIGHT(cell,LEN(cell)-SEARCH("char", cell))

In this example, we will use a powerful tool of regular expression (regex). We will not dive into the syntax of regex now, and it time you will find it easier to read or even write.

clean_names_df = (
    df
    .assign(company_name = lambda x : x['Rank/Company'].str.extract(r'\d+. ([^*]*)[*]?'))
)

The next manipulation that we want to do is to have a numeric column with the annual revenues. The data in the table is in millions (which is better for presentation purposes). We will run two manipulations on the data:

Start with the clean verion of the table data frame above
Remove the dollar sign ($) and the comma (,) from the ‘Net Revenue’ Column
Convert the above value to number and scale it to dollars by multipling by a million

clean_names_df = (
    clean_names_df
    .assign(annual_num = lambda x : x['Net Revenue ($M)'].replace('[\$,]', '', regex=True))
    .assign(annual_revenues = lambda x : pd.to_numeric(x.annual_num)*10**6)
)

clean_names_df.head()

	Rank/Company	Net Revenue ($M)	1-Year Sales Growth	Key Product Categories	company_name	annual_num	annual_revenues
1	1. Nestle SA*	$92,085	2.10%	Food, Beverage, Confectionery	Nestle SA	92085	92085000000
2	2. Procter & Gamble	$66,832	2.70%	Household Goods, Health & Beauty Aids, OTC Pharma	Procter & Gamble	66832	66832000000
3	3. PepsiCo	$64,661	1.80%	Food, Beverage	PepsiCo	64661	64661000000
4	4. Unilever N.V.*	$56,188	-5.10%	Household Goods, Food, Health & Beauty Aids	Unilever N.V.	56188	56188000000
5	5. Anheuser-Busch InBev	$54,619	-3.20%	Wine & Spirits	Anheuser-Busch InBev	54619	54619000000

Enriching the data¶

The next column we want to add to our table before importing it to our CRM system is domain of the company. We will need to use for it a search engine such as Google and a python library that is wrapping it to make it easy for us to use it. First, let’s install the python library:

pip install google

Collecting google

  Downloading google-3.0.0-py2.py3-none-any.whl (45 kB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/45.3 kB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.3/45.3 kB 5.5 MB/s eta 0:00:00
?25hRequirement already satisfied: beautifulsoup4 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from google) (4.11.1)
Requirement already satisfied: soupsieve>1.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from beautifulsoup4->google) (2.3.2.post1)

Installing collected packages: google

Successfully installed google-3.0.0

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

Now, let’s try it on one of the company names

from googlesearch import search
for url in search('Nestle', stop=1):
    print(url)

https://www.nestle.com/

The above cell is using a powerful concept in programming called for-loop or list-comprehesion. However, we will try to minimize this programming concept and convert it to a simpler command that is returning the first value

next(search('Hasbro Inc.', stop=1),'Nope')

'https://shop.hasbro.com/en-us'

Search Domain Function¶

Now we can see the following function that will return to us the domain name of every company we will call it. The function is using various complicated concepts such as regex that we already saw, and try-except part which is used to handle errors (what happened if we don’t find a domain name). The function is also waiting 20 seconds before calling the search engine to avoid exceeding some limits that systems are putting against bots (which we just created). The sleep will make the actual run of the data much longer, but it is still automated and we don’t need to run this repeative task manually.

For now, we don’t need to understand these details:

Wait 20 seconds between calls to avoid throttling by Google
Search for the website of the name of the company and return the first result
Extract the format of the http or https of the above result
Take the first match
Return the match or ‘NA’ if not found

import re
import time

def search_domain(company_name):
    time.sleep(20)
    try: 
        domain = next(search(str(company_name)+' website', stop=1))
        m = re.search('https?://.*?\.?([0-9a-z-]+\.co[^/]+)/.*', domain)
        answer = m.group(1) 
    except:
        answer = 'NA'
    return answer

Let’s test it with one of the compamy names

search_domain('3M co.')

'3m.com'

Now it is time to call our search_domain function on all the rows in our table. This will take an hour or so, due to our sleep time between the calls. For now, let’s run it only on the head (first 5) of the table.

clean_names_df_head = clean_names_df.head()
clean_names_df_head['domain'] = (
    clean_names_df_head
    .company_name
    .apply(search_domain)
)

clean_names_df_head

	Rank/Company	Net Revenue ($M)	1-Year Sales Growth	Key Product Categories	company_name	annual_num	annual_revenues	domain
1	1. Nestle SA*	$92,085	2.10%	Food, Beverage, Confectionery	Nestle SA	92085	92085000000	nestle.com
2	2. Procter & Gamble	$66,832	2.70%	Household Goods, Health & Beauty Aids, OTC Pharma	Procter & Gamble	66832	66832000000	pg.com
3	3. PepsiCo	$64,661	1.80%	Food, Beverage	PepsiCo	64661	64661000000	pepsico.com
4	4. Unilever N.V.*	$56,188	-5.10%	Household Goods, Food, Health & Beauty Aids	Unilever N.V.	56188	56188000000	unilever.com
5	5. Anheuser-Busch InBev	$54,619	-3.20%	Wine & Spirits	Anheuser-Busch InBev	54619	54619000000	ab-inbev.com

Writing to CSV/Excel file¶

Once we finished our table creation, we can write it as a file that we can later import to our CRM system.

(
    clean_names_df
    .to_csv(
        '../data/top_100_cpg_companies.csv',
        index=False
        )
)

From Excel to Pandas

Reading Data from Web Sites

Contents