Reading Data from External API¶

Often you want to add external data to your analysis. External data can be retrieved from API (Application Programable Interface) that is provided by various services such as CRM (Customer Relationship Management) systems. You can import HTML directly into Excel using “Import”.

In this notebook, we will learn how to use pandas from_dict to load data from these external API

Loading API data¶

We will load a few tables from simple API as well as complex ones

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

Loading from simple API¶

If you have a simple API with no authentication, you can pass the URL of the requests library that we used in our previous toturial for loading HTML

For this excercise we will use the API of the Wikipedia page views. We will use it to test the idea that people who are thinking to buy stocks, are going to the wikipedia page of that stock, and therefore, if we see an increase in the views of a page, we will see an increase in the price of that share.

We will check the format of the API call from the technical page. We see the we need to plug the name of stock ticker, and we will use the f-string syntax that is replacing the {page} (or any other variable in curly brackets) with the value of the variable (for example, AMZN).

page = 'AMZN'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

page_views_api_url = f'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{page}/daily/2020030100/2020033000'

import requests
response = requests.get(page_views_api_url, headers=header)
response

<Response [200]>

We can use two methods to extract the data that we want from the JSON format. The first is using json_normalize and the second is using from_dict

df = (
    pd
    .json_normalize(
        response.json(), 
        'items'
        )
)
df.head(5)

	project	article	granularity	timestamp	access	agent	views
0	en.wikipedia	AMZN	daily	2020030100	all-access	all-agents	2
1	en.wikipedia	AMZN	daily	2020030200	all-access	all-agents	6
2	en.wikipedia	AMZN	daily	2020030300	all-access	all-agents	16
3	en.wikipedia	AMZN	daily	2020030400	all-access	all-agents	1
4	en.wikipedia	AMZN	daily	2020030500	all-access	all-agents	1

page_view_data = (
    requests
    .get(page_views_api_url, headers=header)
    .json()['items']
)
page_views_table = (
    pd
    .DataFrame
    .from_dict(page_view_data)
)
page_views_table.head(5)

	project	article	granularity	timestamp	access	agent	views
0	en.wikipedia	AMZN	daily	2020030100	all-access	all-agents	2
1	en.wikipedia	AMZN	daily	2020030200	all-access	all-agents	6
2	en.wikipedia	AMZN	daily	2020030300	all-access	all-agents	16
3	en.wikipedia	AMZN	daily	2020030400	all-access	all-agents	1
4	en.wikipedia	AMZN	daily	2020030500	all-access	all-agents	1

Simple Data Visualization¶

Start with the page view table from the previous API call
Convert the timestamp column to date time format using the %Y%m%d%H format
Extract the day of the month from the above timestamp
Plot the results
as bar chart with the day of the month as x and the number of views as y
and title the graph as ‘Page views for …’ with the name of the page

(
    page_views_table
    .assign(timestamp = lambda x: pd.to_datetime(x.timestamp, format='%Y%m%d%H', errors='coerce'))
    .assign(day = lambda x: x.timestamp.dt.day)
    .plot
    .bar(
        x='day', y='views',
        title=f'Page views for {page}'
    )
);

Loading the page views of all NASDAQ 100¶

We saw in the previous section how to load the values of a table from Wikipeia (or other HTML pages). Let’s use it now to load all the NASDAQ 100 index companies to compare their page views.

%pip install lxml

Requirement already satisfied: lxml in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (4.8.0)

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

nasdaq_100_wikipedia_entry = 'https://en.wikipedia.org/wiki/NASDAQ-100'
nasdaq_100_tickers = (
    pd
    .read_html(
        nasdaq_100_wikipedia_entry, 
        match='Ticker'
    )
)

nasdaq_100_tickers[0]

	Company	Ticker	GICS Sector	GICS Sub-Industry
0	Activision Blizzard	ATVI	Communication Services	Interactive Home Entertainment
1	Adobe	ADBE	Information Technology	Application Software
2	ADP	ADP	Information Technology	Data Processing & Outsourced Services
3	Airbnb	ABNB	Consumer Discretionary	Internet & Direct Marketing Retail
4	Align	ALGN	Health Care	Health Care Supplies
...	...	...	...	...
97	Walgreens Boots Alliance	WBA	Consumer Staples	Drug Retail
98	Workday	WDAY	Information Technology	Application Software
99	Xcel Energy	XEL	Utilities	Multi-Utilities
100	Zoom	ZM	Information Technology	Application Software
101	Zscaler	ZS	Information Technology	Application Software

102 rows × 4 columns

Now we want to get all the views of each company on our list. We will define a simple function that will get the name of the company and return the views of this company’s page.

Generate the URL of the wikipedia page view using the name of the company
Load the API response from above URL
Parse the JSON format of the response using the key ‘items’
Return the newly create dataframe or an empty data frame if failed

def get_views(company):
    company_page_url = f'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{company.iat[0,0]}/daily/2020090100/2020093000'
    response = requests.get(company_page_url, headers=header)
    try:
        df = (
            pd
            .json_normalize(
                response.json(), 
                'items'
            )
        )
    except:
        df = pd.DataFrame()
    
    return df

Now we can apply the function above on each of the row from the nasdaq 100 tickets list:

Start with the first table from the nasdaq 100 list above
Group the ticker by the company name (in case we have multiple ticker to a company)
Apply the get view function on each of the company names and concat all the data frames together

nasdaq_100_views = (
    nasdaq_100_tickers[0]
    .head(2)
    .groupby('Company')[['Company']]
    .apply(get_views)
)

nasdaq_100_views

		project	article	granularity	timestamp	access	agent	views
Company
Activision Blizzard	0	en.wikipedia	Activision_Blizzard	daily	2020090100	all-access	all-agents	1437
	1	en.wikipedia	Activision_Blizzard	daily	2020090200	all-access	all-agents	1715
	2	en.wikipedia	Activision_Blizzard	daily	2020090300	all-access	all-agents	1989
	3	en.wikipedia	Activision_Blizzard	daily	2020090400	all-access	all-agents	1628
	4	en.wikipedia	Activision_Blizzard	daily	2020090500	all-access	all-agents	1260
	5	en.wikipedia	Activision_Blizzard	daily	2020090600	all-access	all-agents	1253
	6	en.wikipedia	Activision_Blizzard	daily	2020090700	all-access	all-agents	1218
	7	en.wikipedia	Activision_Blizzard	daily	2020090800	all-access	all-agents	1331
	8	en.wikipedia	Activision_Blizzard	daily	2020090900	all-access	all-agents	1537
	9	en.wikipedia	Activision_Blizzard	daily	2020091000	all-access	all-agents	1431
	10	en.wikipedia	Activision_Blizzard	daily	2020091100	all-access	all-agents	1352
	11	en.wikipedia	Activision_Blizzard	daily	2020091200	all-access	all-agents	1042
	12	en.wikipedia	Activision_Blizzard	daily	2020091300	all-access	all-agents	1066
	13	en.wikipedia	Activision_Blizzard	daily	2020091400	all-access	all-agents	1327
	14	en.wikipedia	Activision_Blizzard	daily	2020091500	all-access	all-agents	1346
	15	en.wikipedia	Activision_Blizzard	daily	2020091600	all-access	all-agents	1267
	16	en.wikipedia	Activision_Blizzard	daily	2020091700	all-access	all-agents	1289
	17	en.wikipedia	Activision_Blizzard	daily	2020091800	all-access	all-agents	1241
	18	en.wikipedia	Activision_Blizzard	daily	2020091900	all-access	all-agents	1117
	19	en.wikipedia	Activision_Blizzard	daily	2020092000	all-access	all-agents	1137
	20	en.wikipedia	Activision_Blizzard	daily	2020092100	all-access	all-agents	4108
	21	en.wikipedia	Activision_Blizzard	daily	2020092200	all-access	all-agents	3319
	22	en.wikipedia	Activision_Blizzard	daily	2020092300	all-access	all-agents	2992
	23	en.wikipedia	Activision_Blizzard	daily	2020092400	all-access	all-agents	2312
	24	en.wikipedia	Activision_Blizzard	daily	2020092500	all-access	all-agents	1806
	25	en.wikipedia	Activision_Blizzard	daily	2020092600	all-access	all-agents	1557
	26	en.wikipedia	Activision_Blizzard	daily	2020092700	all-access	all-agents	1549
	27	en.wikipedia	Activision_Blizzard	daily	2020092800	all-access	all-agents	1564
	28	en.wikipedia	Activision_Blizzard	daily	2020092900	all-access	all-agents	1585
	29	en.wikipedia	Activision_Blizzard	daily	2020093000	all-access	all-agents	1703
Adobe	0	en.wikipedia	Adobe	daily	2020090100	all-access	all-agents	2003
	1	en.wikipedia	Adobe	daily	2020090200	all-access	all-agents	886
	2	en.wikipedia	Adobe	daily	2020090300	all-access	all-agents	844
	3	en.wikipedia	Adobe	daily	2020090400	all-access	all-agents	874
	4	en.wikipedia	Adobe	daily	2020090500	all-access	all-agents	708
	5	en.wikipedia	Adobe	daily	2020090600	all-access	all-agents	663
	6	en.wikipedia	Adobe	daily	2020090700	all-access	all-agents	795
	7	en.wikipedia	Adobe	daily	2020090800	all-access	all-agents	827
	8	en.wikipedia	Adobe	daily	2020090900	all-access	all-agents	754
	9	en.wikipedia	Adobe	daily	2020091000	all-access	all-agents	1080
	10	en.wikipedia	Adobe	daily	2020091100	all-access	all-agents	697
	11	en.wikipedia	Adobe	daily	2020091200	all-access	all-agents	679
	12	en.wikipedia	Adobe	daily	2020091300	all-access	all-agents	755
	13	en.wikipedia	Adobe	daily	2020091400	all-access	all-agents	806
	14	en.wikipedia	Adobe	daily	2020091500	all-access	all-agents	719
	15	en.wikipedia	Adobe	daily	2020091600	all-access	all-agents	850
	16	en.wikipedia	Adobe	daily	2020091700	all-access	all-agents	778
	17	en.wikipedia	Adobe	daily	2020091800	all-access	all-agents	779
	18	en.wikipedia	Adobe	daily	2020091900	all-access	all-agents	642
	19	en.wikipedia	Adobe	daily	2020092000	all-access	all-agents	863
	20	en.wikipedia	Adobe	daily	2020092100	all-access	all-agents	843
	21	en.wikipedia	Adobe	daily	2020092200	all-access	all-agents	775
	22	en.wikipedia	Adobe	daily	2020092300	all-access	all-agents	750
	23	en.wikipedia	Adobe	daily	2020092400	all-access	all-agents	784
	24	en.wikipedia	Adobe	daily	2020092500	all-access	all-agents	659
	25	en.wikipedia	Adobe	daily	2020092600	all-access	all-agents	717
	26	en.wikipedia	Adobe	daily	2020092700	all-access	all-agents	634
	27	en.wikipedia	Adobe	daily	2020092800	all-access	all-agents	740
	28	en.wikipedia	Adobe	daily	2020092900	all-access	all-agents	993
	29	en.wikipedia	Adobe	daily	2020093000	all-access	all-agents	832

We see that we have more than 2000 rows in the concatenated dataframe where each line is a view of a company in a specific day in the range of September 2020.

Let’s look on one of the companies (for example, Lululemon Atheletica)

nasdaq_100_views.loc['Activision Blizzard']

	project	article	granularity	timestamp	access	agent	views
0	en.wikipedia	Activision_Blizzard	daily	2020090100	all-access	all-agents	1437
1	en.wikipedia	Activision_Blizzard	daily	2020090200	all-access	all-agents	1715
2	en.wikipedia	Activision_Blizzard	daily	2020090300	all-access	all-agents	1989
3	en.wikipedia	Activision_Blizzard	daily	2020090400	all-access	all-agents	1628
4	en.wikipedia	Activision_Blizzard	daily	2020090500	all-access	all-agents	1260
5	en.wikipedia	Activision_Blizzard	daily	2020090600	all-access	all-agents	1253
6	en.wikipedia	Activision_Blizzard	daily	2020090700	all-access	all-agents	1218
7	en.wikipedia	Activision_Blizzard	daily	2020090800	all-access	all-agents	1331
8	en.wikipedia	Activision_Blizzard	daily	2020090900	all-access	all-agents	1537
9	en.wikipedia	Activision_Blizzard	daily	2020091000	all-access	all-agents	1431
10	en.wikipedia	Activision_Blizzard	daily	2020091100	all-access	all-agents	1352
11	en.wikipedia	Activision_Blizzard	daily	2020091200	all-access	all-agents	1042
12	en.wikipedia	Activision_Blizzard	daily	2020091300	all-access	all-agents	1066
13	en.wikipedia	Activision_Blizzard	daily	2020091400	all-access	all-agents	1327
14	en.wikipedia	Activision_Blizzard	daily	2020091500	all-access	all-agents	1346
15	en.wikipedia	Activision_Blizzard	daily	2020091600	all-access	all-agents	1267
16	en.wikipedia	Activision_Blizzard	daily	2020091700	all-access	all-agents	1289
17	en.wikipedia	Activision_Blizzard	daily	2020091800	all-access	all-agents	1241
18	en.wikipedia	Activision_Blizzard	daily	2020091900	all-access	all-agents	1117
19	en.wikipedia	Activision_Blizzard	daily	2020092000	all-access	all-agents	1137
20	en.wikipedia	Activision_Blizzard	daily	2020092100	all-access	all-agents	4108
21	en.wikipedia	Activision_Blizzard	daily	2020092200	all-access	all-agents	3319
22	en.wikipedia	Activision_Blizzard	daily	2020092300	all-access	all-agents	2992
23	en.wikipedia	Activision_Blizzard	daily	2020092400	all-access	all-agents	2312
24	en.wikipedia	Activision_Blizzard	daily	2020092500	all-access	all-agents	1806
25	en.wikipedia	Activision_Blizzard	daily	2020092600	all-access	all-agents	1557
26	en.wikipedia	Activision_Blizzard	daily	2020092700	all-access	all-agents	1549
27	en.wikipedia	Activision_Blizzard	daily	2020092800	all-access	all-agents	1564
28	en.wikipedia	Activision_Blizzard	daily	2020092900	all-access	all-agents	1585
29	en.wikipedia	Activision_Blizzard	daily	2020093000	all-access	all-agents	1703

Start with the page view table of all the nasdaq 100
Filter to include only the Lululemon Athletica views
Convert the timestamp column to date time format using the %Y%m%d%H format
Extract the day of the month from the above timestamp
Plot the results
as bar chart with the day of the month as x and the number of views as y
and title the graph as ‘Page views for …’ with the name of the focused company

focused_company = 'Activision Blizzard'
(
    nasdaq_100_views
    .loc[focused_company]
    .assign(timestamp=lambda x: pd.to_datetime(x.timestamp, format='%Y%m%d%H', errors='coerce'))
    .assign(day = lambda x: x.timestamp.dt.day)
    .plot
    .bar(
        x='day', y='views',
        title=f'Page views for {focused_company}'
    )
);

Loading from complex API¶

Many times API are more complex, nevertheless, we can still extract the relevant data.

Let’s get some stock data from alphavantage.co

This web service is providing useful investment information using an API, however, like many other API providers, it requires an API key to identify the request (and sometimes chanrge for it).

To allow usage of secrets such as the API key for the service call, we will use a simple mechanism using .env file that each user should create and update with their private keys.

Once we have the module to handle secrets installed, we can import it to the notebook and use it. You can get your free API key from alphavantage.co. Next time you will execute the cell and the whole notebook, the API key will be retrieved from the local file.

pip install python-dotenv

Requirement already satisfied: python-dotenv in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (0.20.0)

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

import os
from dotenv import load_dotenv
load_dotenv()

ALPHAVANTAGE_API_KEY = os.getenv("ALPHAVANTAGE_API_KEY")

From the API documentation we can learn how to extract time-series stock information of a specific symbol. Let’s use Amazon (AMZN) for example.

symbol = 'AMZN'

request_url = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={symbol}&interval=1min&apikey={ALPHAVANTAGE_API_KEY}'

df = (
    pd
    .read_json(request_url)
)

df

	Meta Data	Time Series (1min)
1. Information	Intraday (1min) open, high, low, close prices ...	NaN
2. Symbol	AMZN	NaN
3. Last Refreshed	2022-05-20 19:59:00	NaN
4. Interval	1min	NaN
5. Output Size	Compact	NaN
...	...	...
2022-05-20 15:07:00	NaN	{'1. open': '2108.8392', '2. high': '2111.6600...
2022-05-20 15:06:00	NaN	{'1. open': '2111.4100', '2. high': '2111.4100...
2022-05-20 15:05:00	NaN	{'1. open': '2111.1683', '2. high': '2111.1683...
2022-05-20 15:04:00	NaN	{'1. open': '2113.0875', '2. high': '2114.8200...
2022-05-20 15:03:00	NaN	{'1. open': '2117.0000', '2. high': '2117.0100...

106 rows × 2 columns

Extracting the relevant data¶

We see that the data that we are looking for (open, close, high and low prices and volume in each minute) is available in a column of the dataframe table. We will filter the table to include only the rows with the prices data and split it to separated columns.

Start with the dataframe table that was returned from the API call
Rename the long ‘Time Series…’ column name to short ‘ts’
Filter out (using query) all the rows where ‘ts’ column has no value (null)
Filter (using loc) only the ‘ts’ column
Parse each value in the column as a series of key-value pairs and create a column for each pair

json_series = (
    df
    .rename(columns={'Time Series (1min)':'ts'})
    .query('ts.notnull()')
    .loc[:,'ts']
    .apply(pd.Series)
)

json_series

	1. open	2. high	3. low	4. close	5. volume
2022-05-20 19:59:00	2157.0000	2157.0000	2157.0000	2157.0000	463
2022-05-20 19:55:00	2157.0000	2157.0000	2157.0000	2157.0000	227
2022-05-20 19:45:00	2157.1000	2157.1000	2157.1000	2157.1000	529
2022-05-20 18:43:00	2154.4000	2154.4000	2154.4000	2154.4000	435
2022-05-20 18:36:00	2151.8200	2151.8200	2151.8200	2151.8200	134
...	...	...	...	...	...
2022-05-20 15:07:00	2108.8392	2111.6600	2108.7900	2109.9900	8308
2022-05-20 15:06:00	2111.4100	2111.4100	2108.5600	2109.3198	4978
2022-05-20 15:05:00	2111.1683	2111.1683	2107.8600	2108.1201	9070
2022-05-20 15:04:00	2113.0875	2114.8200	2110.0200	2111.6524	7870
2022-05-20 15:03:00	2117.0000	2117.0100	2112.5700	2112.6300	13826

100 rows × 5 columns

Quick Visualization¶

Now that we have the prices from the API we can plot them.

Start with the parsed data table above
Convert the values to numeric form (astype(‘float’))
Slice the table to include only the high and low values (iloc)
Reverse the order of the time-series to be from oldest to newest (iloc[::-1])
Plot the data
with the title of the ticker
and with 45° rotation.

(
    json_series
    .filter(['2. high','3. low'])
    .astype('float')
    .iloc[::-1]
    .plot(
        title=symbol, 
        rot=45
    )
)

<AxesSubplot:title={'center':'AMZN'}>

From Excel to Pandas

Reading Data from External API

Contents