Reading Data from Web Sites

Many times you want to add external data to your analysis. External data is often found in external web sites, usually in the format of an HTML table. You can import HTML directly into Excel using “Import”.

In this notebook, we will learn how to use pandas read_html to load tables from web site with ease, including cases where the data is complex to retrieve.

%pip install lxml requests
Requirement already satisfied: lxml in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (4.8.0)
Requirement already satisfied: requests in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (2.27.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (2022.5.18.1)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from requests) (1.26.9)
WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

Loading HTML data

We will load a few tables from simple sites as well as complex ones

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

Loading from simple web sites

If you have a simple web page with a few tables, you can pass the URL of the page to Pandas and call read_html.

url = 'https://en.wikipedia.org/wiki/The_Championships,_Wimbledon'
dfs = (
    pd
    .read_html(url)
)

read_html is returning a list of all the tables in the page. We can scroll through the content of the tables on that page, by changing the index in the list. We can print the second element (index 1 in the list), we will get the table with the prize amounts for the Wimbeldon Tennis competition.

dfs[1]
0 1 2 3 4 5 6 7 8 9 10 11
0 2021 Event W F SF QF Round of 16 Round of 32 Round of 64 Round of 1281 Q3 Q2 Q1
1 Singles £1,700,000 £900,000 £465,000 £300,000 £181,000 £115,000 £75,000 £48,000 £25,500 £15,500 £8,500
2 Doubles £480,000 £240,000 £120,000 £60,000 £30,000 £19,000 £12,000 NaN NaN NaN NaN

Loading from complex web sites

Many times pages will be more complex and we can still extract the relevant data.

url = 'https://ncov2019.live/data'
table_id = 'sortable_table_world'

Here we will look like a browser to the web site, as some web sites block their context to bots and crawlers. Since we are not hitting the web site a lot, pretending to be a browser is considered as acceptable usage.

We will send to the web site a header that a browser is sending and get the reply of the page as text. The text (r.text) will be parsed by the read_html function and create the dataframe. We will also add the HTML id of the table that we want. This id (‘sortable_table_world’) can be found when using the inspect option in Chrome, Safari, Firefox and other browsers.

import requests

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

response = requests.get(url, headers=header)

df_list = (
  pd
  .read_html(
    response.text, 
    attrs={'id': table_id}
  )
)
df_list[0]
Name Confirmed Per Million Deceased Per Million.1 Tests Active Recovered Per Million.2 Vaccinated Population
0 TOTAL 527,450,948 +453,711 66535.988 6,300,159 +398 794.742 6,356,859,452 +521,215 37819979 483,231,537 +443,455 60957.873 5173118206 7927303146
1 ★ Afghanistan 179,716 +42 4428.960 7,698 +1 189.710 958454 9277 162741 4010.630 5359930 40577445
2 ★ Albania 275,864 +26 96056.770 3497 1217.670 1822887 242 272125 94754.840 1302093 2871885
3 ★ Algeria 265,854 +3 5863.480 6875 151.630 230861 80593 178386 3934.350 7840131 45340682
4 ★ Andorra 42572 0.000 153 0.000 249838 662 41757 0.000 57874 77498
... ... ... ... ... ... ... ... ... ... ... ...
228 ★ Western Sahara 10 0.000 1 0.000 Unknown Unknown 9 0.000 Unknown 624963
229 ★ Niue 9 0.000 Unknown 0.000 Unknown Unknown 9 0.000 1650 1647
230 ★ Nauru 8 0.000 Unknown 0.000 Unknown Unknown 5 0.000 9464 10952
231 ★ Saint Helena 2 0.000 Unknown 0.000 Unknown Unknown 2 0.000 4361 6111
232 ★ China 223,145 +169 155.030 5,222 +3 3.630 160000000 4405 213,518 +476 148.350 1287195000 1439323776

233 rows × 11 columns

covid_19_status = df_list[0]

Simple Data Visualization

External data is often “messy”, and it is a best practice to clean it up before staring to work with it. In the next cell we will make sure that the numeric columns are indeed numeric (removing “Unknown”, for example). We will also resolve the conflict of having multiple columns with the same name (“Per Million”, in this exammple), and translate to more meaningful names.

  • Start with the Covid-19 status table

  • Convert the first ‘Per Million’ column from the table to be numeric

  • Convert the second ‘Per Million’ column from the table to be numeric

  • Fill missing values in the table with 0

  • Sort the table in decreasing order by number of deceased cases

  • Take the first top 20 country rows

  • Plot the results

  • as bar graph with the name of the country and the number of deceased cases per million people

(
    covid_19_status
    .assign(confirmed_cases_per_m=lambda x: pd.to_numeric(x['Per Million'],  errors='coerce'))
    .assign(deceased_cases_per_m=lambda x: pd.to_numeric(x['Per Million.1'],  errors='coerce'))
    .fillna(0)
    .sort_values(by='deceased_cases_per_m',ascending=False)
    .iloc[:20,:]
    .plot
    .bar(
        x='Name', 
        y='deceased_cases_per_m',
        title='Death Cases by COVID-19 by Country')
);
../_images/01.02 - loading data from HTML_16_0.png

More examples

Using Functions

Functions are a more complicated topic and you are not expected to write functions most of the times. Functions are also available in Excel as Macros. You should be familiar with using functions, read them to understand what they do, and also modifying them to fit your needs.

We often find the a set of commands is repeating itself many times in our analyses. This is the time to wrap these commands as a function. In the next cell, we will create a function that is wrapping a few of the commands that we used in the previous cells such as choosing a specific index of a table from the table list and setting the first row as header.

  • Read the HTML content of the given URL

  • Parse the HTML and extract all the tables in it

  • Take the first (of other given index) table from that list

  • Set the header of the table from the values in the first row

  • Skip the first (header) row

  • Return that table

def get_table_from_html(url, attrs=None, index=0):
    response = requests.get(url, headers=header)
    df_list = pd.read_html(response.text, attrs=attrs)
    df = df_list[index]
    df.columns = df.iloc[0]
    df = df[1:]
    return df

Once we have the function defined (def), we can try it in a simple way. Let’s say that we want to add to our CRM (Customer Relationship Management) system a new target market, as the consumer good companies (CPG). We find in the web a table with the top CPG companies for 2019.

url_top_companies = 'https://consumergoods.com/top-100-consumer-goods-companies-2019'
df = get_table_from_html(url_top_companies)

Let’s look at the table

df.style
0 Rank/Company Net Revenue ($M) 1-Year Sales Growth Key Product Categories
1 1. Nestle SA* $92,085 2.10% Food, Beverage, Confectionery
2 2. Procter & Gamble $66,832 2.70% Household Goods, Health & Beauty Aids, OTC Pharma
3 3. PepsiCo $64,661 1.80% Food, Beverage
4 4. Unilever N.V.* $56,188 -5.10% Household Goods, Food, Health & Beauty Aids
5 5. Anheuser-Busch InBev $54,619 -3.20% Wine & Spirits
6 6. Christian Dior* $51,607 7.20% Apparel/Footwear/Accessories, Wine & Spirits, Health & Beauty Aids
7 7. LVMH Moët Hennessy Louis Vuitton* $51,607 9.80% Apparel/Footwear/Accessories, Wine & Spirits, Health & Beauty Aids
8 8. JBS S.A.* $44,587 11.30% Food
9 9. Tyson Foods $40,052 4.70% Food
10 10. Nike Inc. $36,397 6.00% Apparel/Footwear/Accessories
11 11. Imperial Brands PLC* $33,641 0.90% Tobacco
12 12. 3M Co. $32,765 3.50% Household Goods
13 13. Coca-Cola Co. $31,856 -10.00% Beverage, Food
14 14. L’Oreal* $29,687 3.50% Health & Beauty Aids
15 15. Philip Morris International $29,625 3.10% Tobacco
16 16. Danone* $27,168 -0.60% Food, Beverage
17 17. British American Tobacco PLC* $26,993 25.20% Tobacco
18 18. Kraft Heinz $26,259 0.70% Food
19 19. Mondelez International $25,938 0.20% Food, Confectionary
20 20. Haier Smart Home Co. $25,755 12.20% Housewares/Appliances
21 21. Altria Group $25,364 -0.80% Tobacco
22 22. Heineken Holding N.V.* $24,765 3.90% Wine & Spirits
23 23. Adidas AG* $24,153 3.30% Apparel/Footwear/Accessories
24 24. WH Group Ltd. $22,605 1.00% Food
25 25. Henkel AG* $21,931 -0.60% Household Goods, Health & Beauty Aids
26 26. Whirlpool Corp. $21,037 -1.00% Housewares/Appliances
27 27. Japan Tobacco* $20,565 3.60% Tobacco
28 28. Fonterra Cooperative Group $20,438 6.30% Food, Beverage
29 29. Asahi Group Holdings* $19,677 1.70% Wine & Spirits, Beverages, Food
30 30. BSH Hausgerate* $19,667 -3.00% Housewares/Appliances
31 31. San Miguel Corp.* $19,652 24.10% Wine & Spirits, Food
32 32. Kimberly-Clark Corp. $18,486 0.80% Household Goods
33 33. Kirin Holdings* $17,916 3.60% Wine & Spirits
34 34. Associated British Foods* $17,164 1.40% Food
35 35. General Mills $15,740 0.80% Food
36 36. Colgate-Palmolive Co. $15,544 0.60% Household Goods
37 37. Kering* $15,060 26.30% Apparel/Footwear/Accessories
38 38. Grupo Bimbo* $14,765 7.80% Food
39 39. Kao Corp.* $13,995 1.30% Household Goods, Health & Beauty Aids
40 40. Stanley Black & Decker $13,982 7.80% Housewares/Appliances
41 41. Johnson & Johnson (Consumer) $13,853 1.80% OTC Pharma, Household Goods, Health & Beauty Aids
42 42. VF Corp. $13,849 17.30% Apparel/Footwear/Accessories
43 43. Uni-President Enterprises* $13,845 7.90% Food
44 44. Estee Lauder Companies $13,683 15.70% Health & Beauty Aids
45 45. Kellogg Co. $13,547 5.40% Food
46 46. Diageo PLC* $13,405 0.90% Wine & Spirits
47 47. AB Electrolux* $12,821 2.80% Housewares/Appliances
48 48. RB* $12,597 -0.20% Household Goods
49 49. Essity* $12,239 8.40% Household Goods
50 50. Compagnie Financiere Richemont SA* $12,100 3.10% Apparel/Footwear/Accessories
51 51. Nipponham Group* $11,779 5.60% Food
52 52. Keurig Dr Pepper $11,024 2.30% Beverages
53 53. MolsonCoors Brewing Co. $10,770 -2.10% Wine & Spirits
54 54. Shiseido Co.* $10,160 8.90% Health & Beauty Aids
55 55. Pernod Ricard* $9,905 -0.30% Wine & Spirits
56 56. Nintendo Co.* $9,797 115.80% Toys & Games
57 57. China Mengniu Dairy Co.* $9,691 14.70% Food
58 58. PVH Corp. $9,657 8.30% Apparel/Footwear/Accessories
59 59. Hormel Foods $9,546 4.10% Food
60 60. Coty, Inc. $9,398 22.80% Health & Beauty Aids
61 61. Carlsberg A/S* $9,233 3.00% Wine & Spirits
62 62. Saputo Inc.* $8,766 3.40% Food
63 63. Campbell Soup Co. $8,685 10.10% Food, Beverages
64 64. Newell Brands $8,631 -9.60% Household Goods, Housewares/Appliances
65 65. Swatch Group SA* $8,535 6.10% Apparel/Footwear/Accessories
66 66. BRF - Brasil Foods* $8,474 3.20% Food
67 67. GlaxoSmithKline Consumer Healthcare* $8,440 -1.20% OTC Pharma
68 68. Beiersdorf AG* $7,972 2.50% Household Goods, Health & Beauty Aids
69 69. Conagra Brands $7,938 1.40% Food
70 70. Hershey Co. $7,791 3.70% Food, Confectionery
71 71. Dean Foods Co. $7,755 -0.50% Food, Beverages
72 72. Constellation Brands $7,585 3.50% Wine & Spirits
73 73. Groupe SEB* $7,508 5.10% Housewares/Appliances
74 74. Thai Beverage Public Co.* $7,496 20.90% Wine & Spirits, Food, Beverages
75 75. J.M. Smucker Co. $7,357 -0.50% Food
76 76. PT Gudang Garam* $6,805 14.90% Tobacco
77 77. Hanesbrands $6,804 5.10% Apparel/Footwear/Accessories
78 78. Hermes International* $6,575 7.50% Apparel/Footwear/Accessories
79 79. Unicharm Corp.* $6,388 7.30% Household Goods
80 80. Bandai Namco Holdings* $6,295 9.40% Toys & Games
81 81. Post Holdings $6,257 19.80% Food
82 82. Ralph Lauren Corp. $6,182 -7.10% Apparel/Footwear/Accessories
83 83. Clorox Co. $6,124 2.50% Household Goods
84 84. Bayer Consumer Health* $6,006 -7.00% OTC Pharma
85 85. Tapestry Inc. $5,880 31.00% Apparel/Footwear/Accessories
86 86. McCormick & Co. $5,409 11.90% Food
87 87. Savencia SA $5,360 0.20% Food
88 88. Kewpie Corp.* $5,323 2.10% Food
89 89. Avon Products $5,248 -5.70% Health & Beauty Aids, Apparel/Footwear/Accessories, Household Goods
90 90. Electronic Arts $5,150 6.30% Toys & Games
91 91. First Pacific Co. $5,136 -1.90% Food
92 92. ITC Ltd.* $5,048 -22.00% Tobacco, Food, Health & Beauty Aids
93 93. Herbalife Ltd. $4,892 10.50% Food, Household Goods, Health & Beauty Aids
94 94. Sapporo Holdings Ltd.* $4,843 -5.40% Wine & Spirits, Food, Beverages
95 95. Arcelik A.S.* $4,665 29.10% Housewares/Appliances
96 96. Hasbro Inc. $4,580 -12.10% Toys & Games
97 97. Mattel Inc. $4,511 -7.60% Toys & Games
98 98. Husqvarna AB* $4,244 4.30% Housewares/Appliances
99 99. Church & Dwight Co. $4,146 9.80% Household Goods, Health & Beauty Aids
100 100. Spectrum Brands Holdings $3,809 2.80% Household Goods, Housewares/Appliances

Cleaning the data

We see that the table has a column with both the rank and the name of the company, and we want to have only the name of company in a column (as we have the rank in the index of the table). There are a fwe options to do it using string functions such as RIGHT or LEFT. In this case, you could use:

RIGHT(cell,LEN(cell)-SEARCH("char", cell))

In this example, we will use a powerful tool of regular expression (regex). We will not dive into the syntax of regex now, and it time you will find it easier to read or even write.

clean_names_df = (
    df
    .assign(company_name = lambda x : x['Rank/Company'].str.extract(r'\d+. ([^*]*)[*]?'))
)

The next manipulation that we want to do is to have a numeric column with the annual revenues. The data in the table is in millions (which is better for presentation purposes). We will run two manipulations on the data:

  • Start with the clean verion of the table data frame above

  • Remove the dollar sign ($) and the comma (,) from the ‘Net Revenue’ Column

  • Convert the above value to number and scale it to dollars by multipling by a million

clean_names_df = (
    clean_names_df
    .assign(annual_num = lambda x : x['Net Revenue ($M)'].replace('[\$,]', '', regex=True))
    .assign(annual_revenues = lambda x : pd.to_numeric(x.annual_num)*10**6)
)
clean_names_df.head()
Rank/Company Net Revenue ($M) 1-Year Sales Growth Key Product Categories company_name annual_num annual_revenues
1 1. Nestle SA* $92,085 2.10% Food, Beverage, Confectionery Nestle SA 92085 92085000000
2 2. Procter & Gamble $66,832 2.70% Household Goods, Health & Beauty Aids, OTC Pharma Procter & Gamble 66832 66832000000
3 3. PepsiCo $64,661 1.80% Food, Beverage PepsiCo 64661 64661000000
4 4. Unilever N.V.* $56,188 -5.10% Household Goods, Food, Health & Beauty Aids Unilever N.V. 56188 56188000000
5 5. Anheuser-Busch InBev $54,619 -3.20% Wine & Spirits Anheuser-Busch InBev 54619 54619000000

Enriching the data

The next column we want to add to our table before importing it to our CRM system is domain of the company. We will need to use for it a search engine such as Google and a python library that is wrapping it to make it easy for us to use it. First, let’s install the python library:

pip install google
Collecting google
  Downloading google-3.0.0-py2.py3-none-any.whl (45 kB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/45.3 kB ? eta -:--:--
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.3/45.3 kB 5.5 MB/s eta 0:00:00
?25hRequirement already satisfied: beautifulsoup4 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from google) (4.11.1)
Requirement already satisfied: soupsieve>1.2 in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (from beautifulsoup4->google) (2.3.2.post1)
Installing collected packages: google
Successfully installed google-3.0.0
WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

Now, let’s try it on one of the company names

from googlesearch import search
for url in search('Nestle', stop=1):
    print(url)
https://www.nestle.com/

The above cell is using a powerful concept in programming called for-loop or list-comprehesion. However, we will try to minimize this programming concept and convert it to a simpler command that is returning the first value

next(search('Hasbro Inc.', stop=1),'Nope')
'https://shop.hasbro.com/en-us'

Search Domain Function

Now we can see the following function that will return to us the domain name of every company we will call it. The function is using various complicated concepts such as regex that we already saw, and try-except part which is used to handle errors (what happened if we don’t find a domain name). The function is also waiting 20 seconds before calling the search engine to avoid exceeding some limits that systems are putting against bots (which we just created). The sleep will make the actual run of the data much longer, but it is still automated and we don’t need to run this repeative task manually.

For now, we don’t need to understand these details:

  • Wait 20 seconds between calls to avoid throttling by Google

  • Search for the website of the name of the company and return the first result

  • Extract the format of the http or https of the above result

  • Take the first match

  • Return the match or ‘NA’ if not found

import re
import time

def search_domain(company_name):
    time.sleep(20)
    try: 
        domain = next(search(str(company_name)+' website', stop=1))
        m = re.search('https?://.*?\.?([0-9a-z-]+\.co[^/]+)/.*', domain)
        answer = m.group(1) 
    except:
        answer = 'NA'
    return answer

Let’s test it with one of the compamy names

search_domain('3M co.')
'3m.com'

Now it is time to call our search_domain function on all the rows in our table. This will take an hour or so, due to our sleep time between the calls. For now, let’s run it only on the head (first 5) of the table.

clean_names_df_head = clean_names_df.head()
clean_names_df_head['domain'] = (
    clean_names_df_head
    .company_name
    .apply(search_domain)
)
clean_names_df_head
Rank/Company Net Revenue ($M) 1-Year Sales Growth Key Product Categories company_name annual_num annual_revenues domain
1 1. Nestle SA* $92,085 2.10% Food, Beverage, Confectionery Nestle SA 92085 92085000000 nestle.com
2 2. Procter & Gamble $66,832 2.70% Household Goods, Health & Beauty Aids, OTC Pharma Procter & Gamble 66832 66832000000 pg.com
3 3. PepsiCo $64,661 1.80% Food, Beverage PepsiCo 64661 64661000000 pepsico.com
4 4. Unilever N.V.* $56,188 -5.10% Household Goods, Food, Health & Beauty Aids Unilever N.V. 56188 56188000000 unilever.com
5 5. Anheuser-Busch InBev $54,619 -3.20% Wine & Spirits Anheuser-Busch InBev 54619 54619000000 ab-inbev.com

Writing to CSV/Excel file

Once we finished our table creation, we can write it as a file that we can later import to our CRM system.

(
    clean_names_df
    .to_csv(
        '../data/top_100_cpg_companies.csv',
        index=False
        )
)