{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Split a Table into Groups\n", "\n", "The _groupby_ function is one of the most powerful and useful functions for dataframes in Pandas. The main flow of the _groupby_ function is as follows:\n", "- **split** a large dataframe table into groups based on some values or categories in some of the columns, and then \n", "- **apply** some aggregation or other function on each group, and then \n", "- **combine** them back together into a single table. \n", "\n", "[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aiola-lab/from-excel-to-pandas/blob/master/notebooks/03.02_group_by.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n", "\n", "We will now start with another data from API from an open source [brewery-DB API](https://www.openbrewerydb.org/). We can analyze it to answer questions regarding states or other groups in the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "open_brewery_api_url = 'https://api.openbrewerydb.org/breweries?per_page=50'\n", "\n", "import requests\n", "response = requests.get(open_brewery_api_url)\n", "response" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamebrewery_typestreetaddress_2address_3citystatecounty_provincepostal_codecountrylongitudelatitudephonewebsite_urlupdated_atcreated_at
010-56-brewing-company-knox10-56 Brewing Companymicro400 Brown CirNoneNoneKnoxIndianaNone46534United States-86.62795441.2897156308165790None2021-10-23T02:24:55.243Z2021-10-23T02:24:55.243Z
110-barrel-brewing-co-bend-110 Barrel Brewing Colarge62970 18th StNoneNoneBendOregonNone97701-9847United StatesNoneNone5415851007http://www.10barrel.com2021-10-23T02:24:55.243Z2021-10-23T02:24:55.243Z
210-barrel-brewing-co-bend-210 Barrel Brewing Colarge1135 NW Galveston Ave Ste BNoneNoneBendOregonNone97703-2465United StatesNoneNone5415851007None2021-10-23T02:24:55.243Z2021-10-23T02:24:55.243Z
310-barrel-brewing-co-bend-pub-bend10 Barrel Brewing Co - Bend Publarge62950 NE 18th StNoneNoneBendOregonNone97701United States-121.280953644.09121095415851007None2021-10-23T02:24:55.243Z2021-10-23T02:24:55.243Z
410-barrel-brewing-co-boise-boise10 Barrel Brewing Co - Boiselarge826 W Bannock StNoneNoneBoiseIdahoNone83702-5857United States-116.20292943.6185162083445870http://www.10barrel.com2021-10-23T02:24:55.243Z2021-10-23T02:24:55.243Z
\n", "
" ], "text/plain": [ " id name \\\n", "0 10-56-brewing-company-knox 10-56 Brewing Company \n", "1 10-barrel-brewing-co-bend-1 10 Barrel Brewing Co \n", "2 10-barrel-brewing-co-bend-2 10 Barrel Brewing Co \n", "3 10-barrel-brewing-co-bend-pub-bend 10 Barrel Brewing Co - Bend Pub \n", "4 10-barrel-brewing-co-boise-boise 10 Barrel Brewing Co - Boise \n", "\n", " brewery_type street address_2 address_3 city \\\n", "0 micro 400 Brown Cir None None Knox \n", "1 large 62970 18th St None None Bend \n", "2 large 1135 NW Galveston Ave Ste B None None Bend \n", "3 large 62950 NE 18th St None None Bend \n", "4 large 826 W Bannock St None None Boise \n", "\n", " state county_province postal_code country longitude \\\n", "0 Indiana None 46534 United States -86.627954 \n", "1 Oregon None 97701-9847 United States None \n", "2 Oregon None 97703-2465 United States None \n", "3 Oregon None 97701 United States -121.2809536 \n", "4 Idaho None 83702-5857 United States -116.202929 \n", "\n", " latitude phone website_url updated_at \\\n", "0 41.289715 6308165790 None 2021-10-23T02:24:55.243Z \n", "1 None 5415851007 http://www.10barrel.com 2021-10-23T02:24:55.243Z \n", "2 None 5415851007 None 2021-10-23T02:24:55.243Z \n", "3 44.0912109 5415851007 None 2021-10-23T02:24:55.243Z \n", "4 43.618516 2083445870 http://www.10barrel.com 2021-10-23T02:24:55.243Z \n", "\n", " created_at \n", "0 2021-10-23T02:24:55.243Z \n", "1 2021-10-23T02:24:55.243Z \n", "2 2021-10-23T02:24:55.243Z \n", "3 2021-10-23T02:24:55.243Z \n", "4 2021-10-23T02:24:55.243Z " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brewery_list = (\n", " pd\n", " .json_normalize(\n", " response\n", " .json()\n", " )\n", ")\n", "\n", "brewery_list.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading all pages\n", "\n", "If the number of results from an API is high, the API often uses pagination and returns a constant number of results in each page. \n", "* Start with an empty list of values\n", "* For each page between page 1 and page 150:\n", "* Read the data from the API with the page number\n", "* Add the data to the growing list\n", "* Finally, create a dataframe with the above whole list of breweries " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "api_dataset = [] \n", "\n", "#looping through and putting data to the list api_dataset\n", "for page in range (1, 150):\n", " response = requests.get(open_brewery_api_url + f\"&page={page}\").json()\n", " api_dataset.extend(response)\n", "\n", "breweries_data = pd.json_normalize(api_dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving the dataset\n", "\n", "We can save the dataset for later usage. We can save it in Excel format or CSV." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "breweries_data.to_csv('../data/us_breweries.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Counting Values\n", "\n", "The simplest aggregation function for each group is the _size_. How many breweries do we have in each state?\n", "* Start with the brewery table above\n", "* Group the rows by _state_\n", "* Count the size of each group" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state\n", "Alabama 44\n", "Alaska 51\n", "Arizona 116\n", "Arkansas 43\n", "Bouche du Rhône 2\n", "California 855\n", "Colorado 391\n", "Connecticut 86\n", "Delaware 26\n", "District of Columbia 16\n", "Florida 290\n", "Georgia 96\n", "Hawaii 21\n", "Idaho 57\n", "Illinois 232\n", "Indiana 148\n", "Iowa 83\n", "Kansas 40\n", "Kentucky 53\n", "Louisiana 40\n", "MIssouri 1\n", "Maine 109\n", "Maryland 105\n", "Massachusetts 152\n", "Michigan 355\n", "Minnesota 161\n", "Mississippi 15\n", "Missouri 128\n", "Montana 87\n", "Nebraska 53\n", "Nevada 47\n", "New Hampshire 69\n", "New Jersey 110\n", "New Mexico 82\n", "New York 384\n", "North Carolina 277\n", "North Dakota 26\n", "Ohio 279\n", "Oklahoma 39\n", "Oregon 264\n", "Pennsylvania 317\n", "Rhode Island 21\n", "South Carolina 69\n", "South Dakota 40\n", "Tennessee 102\n", "Texas 315\n", "Utah 34\n", "Vermont 54\n", "Virginia 237\n", "Washington 437\n", "Washington 1\n", "West Virginia 28\n", "Wisconsin 199\n", "Wyoming 29\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " breweries_data\n", " .groupby('state')\n", " .size()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting Values\n", "\n", "To sort the values is also simple with _sort_value()_ function\n", "* Start with the brewery table above\n", "* Group the rows by _state_\n", "* Count the size of each group\n", "* Sort the list of states and and the count of the breweries in descending order" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state\n", "California 855\n", "Washington 437\n", "Colorado 391\n", "New York 384\n", "Michigan 355\n", "Pennsylvania 317\n", "Texas 315\n", "Florida 290\n", "Ohio 279\n", "North Carolina 277\n", "Oregon 264\n", "Virginia 237\n", "Illinois 232\n", "Wisconsin 199\n", "Minnesota 161\n", "Massachusetts 152\n", "Indiana 148\n", "Missouri 128\n", "Arizona 116\n", "New Jersey 110\n", "Maine 109\n", "Maryland 105\n", "Tennessee 102\n", "Georgia 96\n", "Montana 87\n", "Connecticut 86\n", "Iowa 83\n", "New Mexico 82\n", "South Carolina 69\n", "New Hampshire 69\n", "Idaho 57\n", "Vermont 54\n", "Kentucky 53\n", "Nebraska 53\n", "Alaska 51\n", "Nevada 47\n", "Alabama 44\n", "Arkansas 43\n", "Louisiana 40\n", "South Dakota 40\n", "Kansas 40\n", "Oklahoma 39\n", "Utah 34\n", "Wyoming 29\n", "West Virginia 28\n", "Delaware 26\n", "North Dakota 26\n", "Rhode Island 21\n", "Hawaii 21\n", "District of Columbia 16\n", "Mississippi 15\n", "Bouche du Rhône 2\n", "Washington 1\n", "MIssouri 1\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " breweries_data\n", " .groupby('state')\n", " .size()\n", " .sort_values(ascending=False)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization of the data\n", "\n", "Let's do a simple visualization of the data without any map, just using the longitude and latitude values of each brewery. \n", "* Start with the brewery table above\n", "* Convert the longitude and latitude to be numeric values \n", "* Plot the results\n", "* as hexbin with longitude and latitudes as the coordinates \n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "(\n", " breweries_data\n", " .assign(longitude = lambda x : pd.to_numeric(x.longitude))\n", " .assign(latitude = lambda x : pd.to_numeric(x.latitude))\n", " .plot\n", " .hexbin(\n", " x='longitude', \n", " y='latitude', \n", " gridsize=20\n", " )\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph above looks like the map of the US and the distribution of breweries in geographic terms. We can see the deep green in NY area which we already know that is third on the list of states (after the huge California and Colorado), but also see an even deeper blue area around Illinois. We will dive deeper into these questions, when we merge the zip code level with the data, and find the where should you live or visit if you want to taste as many breweries as possible. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More advanced Aggregation Functions\n", "\n", "Our data set doesn't have many numeric field, and we should load a different data set to examine the more advanced aggregation options of _groupby_. We will use one of the famouse data sets used for learning to build machine learning models. The last column _annual_income_ has only two values _<=50K_ and _>50K_, and we want to try and predict based on the other parameters such as age, gender, years of education, etc. in which bucket each person lands. \n", "\n", "* Define the column names for the data\n", "* Read the CSV format of the data from the dataset URL\n", "* Set the names of the columns as defined above" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassideducationeducation_nummarital_statusoccupationrelationshipracegendercapital_gaincapital_losshours_per_weeknative_countryannual_income
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
................................................
3255627Private257302Assoc-acdm12Married-civ-spouseTech-supportWifeWhiteFemale0038United-States<=50K
3255740Private154374HS-grad9Married-civ-spouseMachine-op-inspctHusbandWhiteMale0040United-States>50K
3255858Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale0040United-States<=50K
3255922Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale0020United-States<=50K
3256052Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States>50K
\n", "

32561 rows × 15 columns

\n", "
" ], "text/plain": [ " age workclass id education education_num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "... ... ... ... ... ... \n", "32556 27 Private 257302 Assoc-acdm 12 \n", "32557 40 Private 154374 HS-grad 9 \n", "32558 58 Private 151910 HS-grad 9 \n", "32559 22 Private 201490 HS-grad 9 \n", "32560 52 Self-emp-inc 287927 HS-grad 9 \n", "\n", " marital_status occupation relationship race \\\n", "0 Never-married Adm-clerical Not-in-family White \n", "1 Married-civ-spouse Exec-managerial Husband White \n", "2 Divorced Handlers-cleaners Not-in-family White \n", "3 Married-civ-spouse Handlers-cleaners Husband Black \n", "4 Married-civ-spouse Prof-specialty Wife Black \n", "... ... ... ... ... \n", "32556 Married-civ-spouse Tech-support Wife White \n", "32557 Married-civ-spouse Machine-op-inspct Husband White \n", "32558 Widowed Adm-clerical Unmarried White \n", "32559 Never-married Adm-clerical Own-child White \n", "32560 Married-civ-spouse Exec-managerial Wife White \n", "\n", " gender capital_gain capital_loss hours_per_week native_country \\\n", "0 Male 2174 0 40 United-States \n", "1 Male 0 0 13 United-States \n", "2 Male 0 0 40 United-States \n", "3 Male 0 0 40 United-States \n", "4 Female 0 0 40 Cuba \n", "... ... ... ... ... ... \n", "32556 Female 0 0 38 United-States \n", "32557 Male 0 0 40 United-States \n", "32558 Female 0 0 40 United-States \n", "32559 Male 0 0 20 United-States \n", "32560 Female 15024 0 40 United-States \n", "\n", " annual_income \n", "0 <=50K \n", "1 <=50K \n", "2 <=50K \n", "3 <=50K \n", "4 <=50K \n", "... ... \n", "32556 <=50K \n", "32557 >50K \n", "32558 <=50K \n", "32559 <=50K \n", "32560 >50K \n", "\n", "[32561 rows x 15 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult_names = ['age','workclass','id','education','education_num','marital_status',\n", " 'occupation', 'relationship','race','gender','capital_gain',\n", " 'capital_loss','hours_per_week','native_country','annual_income']\n", "adults_data = (\n", " pd\n", " .read_csv(\n", " 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', \n", " names = adult_names\n", " )\n", ")\n", "adults_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calclulating mean\n", "\n", "First we can calculate the average age in each income group, and see if there is any difference.\n", "* Start with the adults data that was loaded above\n", "* Group using the _annual\\_income_ value\n", "* Take only the _age_ value\n", "* Calculate the average of each group" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "annual_income\n", " <=50K 36.783738\n", " >50K 44.249841\n", "Name: age, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " adults_data\n", " .groupby('annual_income')\n", " ['age']\n", " .mean()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the average age of people who earned more than 50K is over 44, while the average age of the lower income (less than 50K) is younger and around 36 years" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Group by multiple keys\n", "\n", "Let's see if women earn less than men, and we will do that by grouping both by the income group (_annual_income_) and the gender column.\n", "\n", "* Start with the adults data that was loaded above\n", "* Group using the _annual\\_income_ and the _gender_ values \n", "* Calculate the size of each group" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "annual_income gender \n", " <=50K Female 9592\n", " Male 15128\n", " >50K Female 1179\n", " Male 6662\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " adults_data\n", " .groupby(['annual_income','gender'])\n", " .size()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is hard to calculate the percentage and evaluate the difference from the size of each group. Let's calculate the percentage and also format the out to make it easier to read the results.\n", "\n", "* Start with the adults data that was loaded above\n", "* Group using the _annual\\_income_ and _gender_ values\n", "* Take only the _id_ value\n", "* Calculate the size of each group\n", "* Group again by the first level (_annual\\_income_)\n", "* For each row in each group calculate the ratio of the size of the row with the total size of the group\n", "* Apply style to the results\n", "* Format the results as percentage with two digit precision. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
  id
annual_incomegender 
<=50K Female38.80%
Male61.20%
>50K Female15.04%
Male84.96%
\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " adults_data\n", " .groupby(['annual_income','gender'])\n", " [['id']]\n", " .count()\n", " .groupby(level=0)\n", " .apply(lambda x: x / float(x.sum()))\n", " .style\n", " .format('{:.2%}')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now easily see that the percentage of women with higher income is less than half that of the lower income. Hopefully, the current data (this data is extracted from the 1994 Census database) has improved the gender equlity in this aspect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Group Functions\n", "\n", "We saw before how to apply a function on each row in a dataframe table, but many times we want to apply a function on a group of rows based on a value of a column or a couple of columns.\n", "\n", "We can also take each of the groups and calculate more advanced functions such as Correlation. Let's see if we find Correlation with in each group between years of education and the number of work per week (which shows some quality of hard work)\n", "\n", "* Start with the adults data that was loaded above\n", "* Take the _annual\\_income_ and _gender_ values with the _education\\_num_ and _hours\\_per\\_week_ values\n", "* Group by the _annual\\_income_ and _gender_ (and calculate the average of the other values in each group)\n", "* Calculate the Correlation of the values in each group\n", "* Replace the self-Correlation values of 1 with 0 \n", "* Apply style to the results\n", "* Format the results with three digit precision. \n", "* Color the background of the cells according to the correlation values" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
   education_numhours_per_week
annual_incomegender   
<=50K Femaleeducation_num0.00.151
hours_per_week0.1510.0
Maleeducation_num0.00.069
hours_per_week0.0690.0
>50K Femaleeducation_num0.00.187
hours_per_week0.1870.0
Maleeducation_num0.00.0477
hours_per_week0.04770.0
\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " adults_data\n", " [['annual_income','gender','education_num','hours_per_week']]\n", " .groupby(['annual_income','gender'])\n", " .corr()\n", " .replace(1,0)\n", " .style\n", " .format('{:.3}')\n", " .background_gradient(cmap='coolwarm')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that within each group the correlation of \"hard work\", as we defined it before, is higher for the female (0.151 and 0.187) than the male (0.069 and 0.0477). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": [], "nbformat": 4, "nbformat_minor": 4 }