{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Programming (Optimization) \n", "\n", "In this example, we will use an common analysis that is used to optimize decision based on constrains, and it is based on the great blog post [How to Solve Optimization Problems with Python](https://towardsdatascience.com/how-to-solve-optimization-problems-with-python-9088bf8d48e5)\n", "\n", "[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aiola-lab/from-excel-to-pandas/blob/master/notebooks/08.03_Linear_optimization.ipynb)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the data\n", "\n", "We will get the data from HTML table that are available from [Rotoguru](http://rotoguru.net), that is gather stats data from fantasy league. We will focus on the data of the NBA players. To keep the consistency of the notebook, we will check the data for a specific week ('2021-03-02')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "year = 2021\n", "month = 3\n", "day = 2" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data_url = f'http://rotoguru1.com/cgi-bin/hyday.pl?mon={month}&day={day}&year={year}&game=fd'\n", "dfs = (\n", " pd\n", " .read_html(data_url)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few HTML table in the page, and after some scorlling we can see that the table with the players data is the 6th one (index=5):\n", "* Start with the dataframe at index 5 of the list of table from the HTML page above\n", "* Rename the column to be: Position, Name,FD Points, Salary, Team, Opp., Score, Min, Stats\n", "* Filter out rows with Position string longer than 2 characters (header lines)\n", "* Add a column called _Points_ with the float value of the points to be used as the target of the optimization \n", "* Remove the dollar sign and commas (\\[$,\\]) from the Salary column and convert it to its integer value\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PositionNameFD PointsSalaryTeamOpp.ScoreMinStatsposition_lengthPoints
2PGMorant, Ja^557100mem@ was125-11133:0235pt 5rb 10as 1st 4to 2trey 11-18fg 11-14ft255.0
3PGWestbrook, Russell^53.79600wasv mem111-12535:3123pt 6rb 15as 3st 8to 3trey 8-16fg 4-9ft253.7
4SGGeorge, Paul^457900lac@ bos112-11738:3932pt 5rb 4as 1st 2to 5trey 12-26fg 3-3ft245.0
5SGDeRozan, DeMar^43.38200sasv nyk119-9329:4410pt 4rb 11as 4st 2-5fg 6-6ft243.3
6PGMurray, Jamal^41.68500den@ mil128-9737:5924pt 3rb 6as 2st 1to 1trey 10-17fg 3-3ft241.6
....................................
199CRobinson, Mitchell04500nyk@ sas93-119NaNNaN10.0
200CBryant, Thomas06200wasv mem111-125NaNNaN10.0
201CLeonard, Meyers03500miav atl80-94NaNNaN10.0
202CGasol, Marc04800lalv pho104-114NaNNaN10.0
203COturu, Daniel03500lac@ bos112-117NaNNaN10.0
\n", "

198 rows × 11 columns

\n", "
" ], "text/plain": [ " Position Name FD Points Salary Team Opp. Score \\\n", "2 PG Morant, Ja^ 55 7100 mem @ was 125-111 \n", "3 PG Westbrook, Russell^ 53.7 9600 was v mem 111-125 \n", "4 SG George, Paul^ 45 7900 lac @ bos 112-117 \n", "5 SG DeRozan, DeMar^ 43.3 8200 sas v nyk 119-93 \n", "6 PG Murray, Jamal^ 41.6 8500 den @ mil 128-97 \n", ".. ... ... ... ... ... ... ... \n", "199 C Robinson, Mitchell 0 4500 nyk @ sas 93-119 \n", "200 C Bryant, Thomas 0 6200 was v mem 111-125 \n", "201 C Leonard, Meyers 0 3500 mia v atl 80-94 \n", "202 C Gasol, Marc 0 4800 lal v pho 104-114 \n", "203 C Oturu, Daniel 0 3500 lac @ bos 112-117 \n", "\n", " Min Stats position_length \\\n", "2 33:02 35pt 5rb 10as 1st 4to 2trey 11-18fg 11-14ft 2 \n", "3 35:31 23pt 6rb 15as 3st 8to 3trey 8-16fg 4-9ft 2 \n", "4 38:39 32pt 5rb 4as 1st 2to 5trey 12-26fg 3-3ft 2 \n", "5 29:44 10pt 4rb 11as 4st 2-5fg 6-6ft 2 \n", "6 37:59 24pt 3rb 6as 2st 1to 1trey 10-17fg 3-3ft 2 \n", ".. ... ... ... \n", "199 NaN NaN 1 \n", "200 NaN NaN 1 \n", "201 NaN NaN 1 \n", "202 NaN NaN 1 \n", "203 NaN NaN 1 \n", "\n", " Points \n", "2 55.0 \n", "3 53.7 \n", "4 45.0 \n", "5 43.3 \n", "6 41.6 \n", ".. ... \n", "199 0.0 \n", "200 0.0 \n", "201 0.0 \n", "202 0.0 \n", "203 0.0 \n", "\n", "[198 rows x 11 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_players = (\n", " dfs[5]\n", " .rename(\n", " columns={\n", " 0:'Position',\n", " 1:'Name',\n", " 2:'FD Points',\n", " 3:'Salary',\n", " 4:'Team',\n", " 5:'Opp.',\n", " 6:'Score',\n", " 7:'Min',\n", " 8:'Stats'\n", " }\n", " )\n", " .assign(position_length = lambda x : x.Position.str.len())\n", " .query('position_length <= 2')\n", " .assign(Points = lambda x : x['FD Points'].astype(float))\n", " .assign(Salary = lambda x : x.Salary.str.replace('[$,]','', regex=True).astype(int))\n", ")\n", "all_players" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing the LP Library\n", "\n", "Many times you can find a python library that is implementing many of the fucntions that you need. This is another major benefit of Python over close tools such as Excel." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n", " from cryptography.utils import int_from_bytes\n", "/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n", " from cryptography.utils import int_from_bytes\n", "Requirement already satisfied: pulp in /opt/conda/lib/python3.7/site-packages (2.6.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install pulp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the problem variables\n", "\n", "The most complicated part of the analysis is to understand how to map it into the domain of the algorithm. \n", "\n", "The current problem that we want to solve using Linear Programming (LP) is what is the best roaster of NBA players we want to choose for our fantasy league. The output is this case is the list of the players names. \n", "\n", "We will create a list of all the players and assign them either 1 (selected to the team) or 0 (not selected)\n", "* Create a Linear Programming variables\n", "* from a dictionary\n", "* with no prefix to the names\n", "* and index of $x_i$ as the player names\n", "* and the value is an Integer\n", "* that is either 0 (not selected)\n", "* or 1 (selected)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from pulp import *\n", "\n", "player_vars = (\n", " LpVariable\n", " .dicts(\n", " '', \n", " list(all_players['Name']), \n", " cat='Integer',\n", " lowBound=0, \n", " upBound=1 \n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem target\n", "\n", "In the fantasy league games, we want to maximize the number of points that the player that we choose will score in the next game. \n", "\n", "Let's start with defining the LP problem as a maximization problem:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "total_score = LpProblem('Fantasy Points Problem', LpMaximize)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Variable Dictionaries\n", "\n", "The library is expecting to get a set of dictionaries each with the name of the player as the common index and the numeric value for each. We will do that for the Points (the target to maximize), the Positions (that we need to pick a couple of each position), and the Salaries (that we need to cap to a specific limit)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "points = (\n", " all_players\n", " .set_index('Name')\n", " .to_dict()\n", " ['Points']\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "salaries = (\n", " all_players\n", " .set_index('Name')\n", " .to_dict()\n", " ['Salary']\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear Programming (LP) Definition\n", "\n", "The LP problem is defined as a set of linear expressions of the form:\n", "\n", "$𝑎_1𝑥_1+𝑎_2𝑥_2+𝑎_3𝑥_3+...𝑎_𝑛𝑥_𝑛 {<=,=,>=} 𝑏$\n", "\n", "Where $x_1, x_2 ... x_n$ are the _decision_ variables we want to choose (the player names), and $a_1, a_2, ... a_n$ are the _objective_ and _constrains_ variables we have in our data.\n", "\n", "First, we add the LP sum of the points where $a_i$ is the points each player scored, and $x_i$ is the player:\n", "\n", "total score = $\\sum p_i*x_i$\n", "\n", "- $p_i$ - points scored by player$_i$\n", "- $ x_i = \\begin{cases} \n", " 1 & \\text{if player in team} \\\\\n", " 0 & \\text{if player not in team} \n", " \\end{cases}\n", "$\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "total_score += (\n", " lpSum(\n", " [\n", " points[player] * player_vars[player] \n", " for player in player_vars\n", " ]\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a real application we will want to have the projected number of points and not the sum of past points, and we will calculated a forecasting model that can identify trends or recent injuries. But for this simple example, we will take that sum as a good estimator for their expected points in the next game. \n", "\n", "Second, we add the constrain the salaries where $a_i$ is the salary of each player, and $x_i$ is the player, and the constrain value for the sum of all the salaries of the players we choose not to exceed $60,000:\n", "\n", "$\\sum s_i*x_i <= 60,000$\n", "\n", "- $s_i$ - salary of player$_i$\n", "- $ x_i = \\begin{cases} \n", " 1 & \\text{if player in team} \\\\\n", " 0 & \\text{if player not in team} \n", " \\end{cases}\n", "$\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "total_score += (\n", " lpSum(\n", " [\n", " salaries[player] * player_vars[player] \n", " for player in player_vars\n", " ]\n", " ) <= 60_000\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, we want to make sure that we have the right number of players in each position in the fantasy team. We will create a list of players for each of the positions: Point Guard (PG), Shooting Guard (SG), Small Forward (SF), Power Forward (PF), and Center (C)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "pg = (\n", " all_players\n", " .query(\"Position == 'PG'\")\n", " ['Name']\n", " .to_list()\n", ")\n", "sg = (\n", " all_players\n", " .query(\"Position == 'SG'\")\n", " ['Name']\n", " .to_list()\n", ")\n", "sf = (\n", " all_players\n", " .query(\"Position == 'SF'\")\n", " ['Name']\n", " .to_list()\n", ")\n", "pf = (\n", " all_players\n", " .query(\"Position == 'PF'\")\n", " ['Name']\n", " .to_list()\n", ")\n", "c = (\n", " all_players\n", " .query(\"Position == 'C'\")\n", " ['Name']\n", " .to_list()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And add the constrains to have exactly two PG, two SG, two SF, two PF and one C:\n", "\n", "$\\sum x_i = 2 , \\text{if player is in PG, SG, SF, PF} \\\\\n", " \\sum x_i = 1 , \\text{if player is in C} \n", " $\n", " - $ x_i = \\begin{cases} \n", " 1 & \\text{if player in team} \\\\\n", " 0 & \\text{if player not in team} \n", " \\end{cases}\n", "$" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Set Constraints\n", "total_score += lpSum([player_vars[player] for player in pg]) == 2\n", "total_score += lpSum([player_vars[player] for player in sg]) == 2\n", "total_score += lpSum([player_vars[player] for player in sf]) == 2\n", "total_score += lpSum([player_vars[player] for player in pf]) == 2\n", "total_score += lpSum([player_vars[player] for player in c]) == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LP Solution\n", "\n", "Finally, we can solve the set of linear expressions: " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "total_score.solve()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And filter the list of the possible players to the ones that were selected (their variable value is 1):" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[_Barton,_Will^,\n", " _George,_Paul^,\n", " _James,_LeBron^,\n", " _Jokic,_Nikola^,\n", " _Lyles,_Trey^,\n", " _Melton,_De'Anthony,\n", " _Morant,_Ja^,\n", " _Quickley,_Immanuel,\n", " _Saric,_Dario]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "optimized_roster = ( \n", " list(\n", " filter(\n", " lambda x : x.varValue == 1, \n", " total_score.variables()\n", " )\n", " )\n", ")\n", "optimized_roster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will extract the name of the player from the variables of the linear optimizer, and clean up the \"strange\" characters." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "optimized_roster_names = [\n", " name\n", " .name\n", " .replace('_',' ')\n", " .strip() \n", " for name in optimized_roster\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list of the players above is changing every week based on their recent performance, and we can execute this analysis every time before we define our fantasy team for the next round. \n", "\n", "We can also get the table that we used as the input for the optmization process and filter it show more details on each of our players. We will also calculate the sum of the salaries to verify that we are below the maximum budget of 60,000." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
NamePositionFD Points
Barton, Will^SF35.14900
George, Paul^SG457900
James, LeBron^SF5910800
Jokic, Nikola^C69.511000
Lyles, Trey^PF334000
Melton, De'AnthonyPG38.34600
Morant, Ja^PG557100
Quickley, ImmanuelSG37.85100
Saric, DarioPF32.54100
Total59500
\n", "
" ], "text/plain": [ " Salary\n", "Name Position FD Points \n", "Barton, Will^ SF 35.1 4900\n", "George, Paul^ SG 45 7900\n", "James, LeBron^ SF 59 10800\n", "Jokic, Nikola^ C 69.5 11000\n", "Lyles, Trey^ PF 33 4000\n", "Melton, De'Anthony PG 38.3 4600\n", "Morant, Ja^ PG 55 7100\n", "Quickley, Immanuel SG 37.8 5100\n", "Saric, Dario PF 32.5 4100\n", "Total 59500" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " all_players\n", " .query('Name in @optimized_roster_names')\n", " [['Position','Name','FD Points','Salary']]\n", " .pivot_table(\n", " index=['Name','Position','FD Points'],\n", " margins=True,\n", " margins_name='Total', \n", " aggfunc=sum\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }