Linear Programming (Optimization)¶

In this example, we will use an common analysis that is used to optimize decision based on constrains, and it is based on the great blog post How to Solve Optimization Problems with Python

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

Loading the data¶

We will get the data from HTML table that are available from Rotoguru, that is gather stats data from fantasy league. We will focus on the data of the NBA players. To keep the consistency of the notebook, we will check the data for a specific week (‘2021-03-02’)

year = 2021
month = 3
day = 2

data_url = f'http://rotoguru1.com/cgi-bin/hyday.pl?mon={month}&day={day}&year={year}&game=fd'
dfs = (
    pd
    .read_html(data_url)
)

There are a few HTML table in the page, and after some scorlling we can see that the table with the players data is the 6th one (index=5):

Start with the dataframe at index 5 of the list of table from the HTML page above
Rename the column to be: Position, Name,FD Points, Salary, Team, Opp., Score, Min, Stats
Filter out rows with Position string longer than 2 characters (header lines)
Add a column called Points with the float value of the points to be used as the target of the optimization
Remove the dollar sign and commas ([$,]) from the Salary column and convert it to its integer value

all_players = (
    dfs[5]
    .rename(
        columns={
            0:'Position',
            1:'Name',
            2:'FD Points',
            3:'Salary',
            4:'Team',
            5:'Opp.',
            6:'Score',
            7:'Min',
            8:'Stats'
            }
    )
    .assign(position_length = lambda x : x.Position.str.len())
    .query('position_length <= 2')
    .assign(Points = lambda x : x['FD Points'].astype(float))
    .assign(Salary = lambda x : x.Salary.str.replace('[$,]','', regex=True).astype(int))
)
all_players

	Position	Name	FD Points	Salary	Team	Opp.	Score	Min	Stats	position_length	Points
2	PG	Morant, Ja^	55	7100	mem	@ was	125-111	33:02	35pt 5rb 10as 1st 4to 2trey 11-18fg 11-14ft	2	55.0
3	PG	Westbrook, Russell^	53.7	9600	was	v mem	111-125	35:31	23pt 6rb 15as 3st 8to 3trey 8-16fg 4-9ft	2	53.7
4	SG	George, Paul^	45	7900	lac	@ bos	112-117	38:39	32pt 5rb 4as 1st 2to 5trey 12-26fg 3-3ft	2	45.0
5	SG	DeRozan, DeMar^	43.3	8200	sas	v nyk	119-93	29:44	10pt 4rb 11as 4st 2-5fg 6-6ft	2	43.3
6	PG	Murray, Jamal^	41.6	8500	den	@ mil	128-97	37:59	24pt 3rb 6as 2st 1to 1trey 10-17fg 3-3ft	2	41.6
...	...	...	...	...	...	...	...	...	...	...	...
199	C	Robinson, Mitchell	0	4500	nyk	@ sas	93-119	NaN	NaN	1	0.0
200	C	Bryant, Thomas	0	6200	was	v mem	111-125	NaN	NaN	1	0.0
201	C	Leonard, Meyers	0	3500	mia	v atl	80-94	NaN	NaN	1	0.0
202	C	Gasol, Marc	0	4800	lal	v pho	104-114	NaN	NaN	1	0.0
203	C	Oturu, Daniel	0	3500	lac	@ bos	112-117	NaN	NaN	1	0.0

198 rows × 11 columns

Installing the LP Library¶

Many times you can find a python library that is implementing many of the fucntions that you need. This is another major benefit of Python over close tools such as Excel.

pip install pulp

Requirement already satisfied: pulp in /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages (2.6.0)

WARNING: There was an error checking the latest version of pip.

Note: you may need to restart the kernel to use updated packages.

Defining the problem variables¶

The most complicated part of the analysis is to understand how to map it into the domain of the algorithm.

The current problem that we want to solve using Linear Programming (LP) is what is the best roaster of NBA players we want to choose for our fantasy league. The output is this case is the list of the players names.

We will create a list of all the players and assign them either 1 (selected to the team) or 0 (not selected)

Create a Linear Programming variables
from a dictionary
with no prefix to the names
and index of $x_i$ as the player names
and the value is an Integer
that is either 0 (not selected)
or 1 (selected)

from pulp import *

player_vars = (
    LpVariable
    .dicts(
        '', 
        list(all_players['Name']), 
        cat='Integer',
        lowBound=0, 
        upBound=1 
    )
)

Problem target¶

In the fantasy league games, we want to maximize the number of points that the player that we choose will score in the next game.

Let’s start with defining the LP problem as a maximization problem:

total_score = LpProblem('Fantasy Points Problem', LpMaximize)

Variable Dictionaries¶

The library is expecting to get a set of dictionaries each with the name of the player as the common index and the numeric value for each. We will do that for the Points (the target to maximize), the Positions (that we need to pick a couple of each position), and the Salaries (that we need to cap to a specific limit).

points = (
    all_players
    .set_index('Name')
    .to_dict()
    ['Points']
)

salaries = (
    all_players
    .set_index('Name')
    .to_dict()
    ['Salary']
)

Linear Programming (LP) Definition¶

The LP problem is defined as a set of linear expressions of the form:

$𝑎_1𝑥_1+𝑎_2𝑥_2+𝑎_3𝑥_3+...𝑎_𝑛𝑥_𝑛 {<=,=,>=} 𝑏$

Where $x_1, x_2 ... x_n$ are the decision variables we want to choose (the player names), and $a_1, a_2, ... a_n$ are the objective and constrains variables we have in our data.

First, we add the LP sum of the points where $a_i$ is the points each player scored, and $x_i$ is the player:

total score = $\sum p_i*x_i$

$p_i$ - points scored by player$_i$
$ x_i = \begin{cases} 1 & \text{if player in team} \\ 0 & \text{if player not in team} \end{cases} $

total_score += (
    lpSum(
        [
            points[player] * player_vars[player] 
            for player in player_vars
        ]
    )
)

In a real application we will want to have the projected number of points and not the sum of past points, and we will calculated a forecasting model that can identify trends or recent injuries. But for this simple example, we will take that sum as a good estimator for their expected points in the next game.

Second, we add the constrain the salaries where $a_i$ is the salary of each player, and $x_i$ is the player, and the constrain value for the sum of all the salaries of the players we choose not to exceed $60,000:

$\sum s_i*x_i <= 60,000$

$s_i$ - salary of player$_i$
$ x_i = \begin{cases} 1 & \text{if player in team} \\ 0 & \text{if player not in team} \end{cases} $

total_score += (
    lpSum(
        [
            salaries[player] * player_vars[player] 
            for player in player_vars
        ]
    ) <= 60_000
)

Last, we want to make sure that we have the right number of players in each position in the fantasy team. We will create a list of players for each of the positions: Point Guard (PG), Shooting Guard (SG), Small Forward (SF), Power Forward (PF), and Center (C)

pg = (
    all_players
    .query("Position == 'PG'")
    ['Name']
    .to_list()
)
sg = (
    all_players
    .query("Position == 'SG'")
    ['Name']
    .to_list()
)
sf = (
    all_players
    .query("Position == 'SF'")
    ['Name']
    .to_list()
)
pf = (
    all_players
    .query("Position == 'PF'")
    ['Name']
    .to_list()
)
c = (
    all_players
    .query("Position == 'C'")
    ['Name']
    .to_list()
)

And add the constrains to have exactly two PG, two SG, two SF, two PF and one C:

$\sum x_i = 2 , \text{if player is in PG, SG, SF, PF} \\ \sum x_i = 1 , \text{if player is in C} $

$ x_i = \begin{cases} 1 & \text{if player in team} \\ 0 & \text{if player not in team} \end{cases} $

# Set Constraints
total_score += lpSum([player_vars[player] for player in pg]) == 2
total_score += lpSum([player_vars[player] for player in sg]) == 2
total_score += lpSum([player_vars[player] for player in sf]) == 2
total_score += lpSum([player_vars[player] for player in pf]) == 2
total_score += lpSum([player_vars[player] for player in c]) == 1

LP Solution¶

Finally, we can solve the set of linear expressions:

total_score.solve()

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/pulp/apis/../solverdir/cbc/linux/64/cbc /tmp/b3c16ae7c75f40999454a5bc4373562a-pulp.mps max timeMode elapsed branch printingOptions all solution /tmp/b3c16ae7c75f40999454a5bc4373562a-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 11 COLUMNS
At line 927 RHS
At line 934 BOUNDS
At line 1133 ENDATA
Problem MODEL has 6 rows, 198 columns and 396 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Continuous objective value is 406.74 - 0.00 seconds
Cgl0003I 0 fixed, 2 tightened bounds, 0 strengthened rows, 0 substitutions
Cgl0003I 0 fixed, 1 tightened bounds, 0 strengthened rows, 0 substitutions
Cgl0004I processed model has 6 rows, 145 columns (145 integer (139 of which binary)) and 290 elements
Cutoff increment increased from 1e-05 to 0.0999
Cbc0038I Initial state - 2 integers unsatisfied sum - 0.2
Cbc0038I Solution found of -405.2
Cbc0038I Cleaned solution of -405.2
Cbc0038I Before mini branch and bound, 143 integers at bound fixed and 0 continuous
Cbc0038I Full problem 6 rows 145 columns, reduced to 0 rows 0 columns
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I Round again with cutoff of -405.5
Cbc0038I Reduced cost fixing fixed 138 variables on major pass 2
Cbc0038I Pass   1: suminf.    0.03896 (2) obj. -405.5 iterations 1
Cbc0038I Pass   2: suminf.    0.20000 (2) obj. -406.74 iterations 4
Cbc0038I Pass   3: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass   4: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass   5: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass   6: suminf.    0.03896 (2) obj. -405.5 iterations 1
Cbc0038I Pass   7: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass   8: suminf.    0.03896 (2) obj. -405.5 iterations 1
Cbc0038I Pass   9: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  10: suminf.    0.20000 (2) obj. -406.74 iterations 2
Cbc0038I Pass  11: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass  12: suminf.    0.03896 (2) obj. -405.5 iterations 1
Cbc0038I Pass  13: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  14: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  15: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  16: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  17: suminf.    0.20000 (2) obj. -406.74 iterations 2
Cbc0038I Pass  18: suminf.    0.03896 (2) obj. -405.5 iterations 2
Cbc0038I Pass  19: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass  20: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass  21: suminf.    0.03896 (2) obj. -405.5 iterations 2
Cbc0038I Pass  22: suminf.    0.20000 (2) obj. -406.74 iterations 1
Cbc0038I Pass  23: suminf.    0.03896 (2) obj. -405.5 iterations 2
Cbc0038I Pass  24: suminf.    0.03896 (2) obj. -405.5 iterations 1
Cbc0038I Pass  25: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  26: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  27: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  28: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  29: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I Pass  30: suminf.    0.03896 (2) obj. -405.5 iterations 0
Cbc0038I No solution found this major pass
Cbc0038I Before mini branch and bound, 143 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.01 seconds)
Cbc0038I After 0.01 seconds - Feasibility pump exiting with objective of -405.2 - took 0.00 seconds
Cbc0012I Integer solution of -405.2 found by feasibility pump after 0 iterations and 0 nodes (0.01 seconds)
Cbc0001I Search completed - best objective -405.2, took 0 iterations and 0 nodes (0.01 seconds)
Cbc0035I Maximum depth 0, 131 variables fixed on reduced cost
Cuts at root node changed objective from -406.74 to -405.2
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
ZeroHalf was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value:                405.20000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.01

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01

And filter the list of the possible players to the ones that were selected (their variable value is 1):

optimized_roster = ( 
    list(
        filter(
            lambda x : x.varValue == 1, 
            total_score.variables()
        )
    )
)
optimized_roster

[_Barton,_Will^,
 _George,_Paul^,
 _James,_LeBron^,
 _Jokic,_Nikola^,
 _Lyles,_Trey^,
 _Melton,_De'Anthony,
 _Morant,_Ja^,
 _Quickley,_Immanuel,
 _Saric,_Dario]

We will extract the name of the player from the variables of the linear optimizer, and clean up the “strange” characters.

optimized_roster_names = [
    name
    .name
    .replace('_',' ')
    .strip() 
    for name in optimized_roster
]

The list of the players above is changing every week based on their recent performance, and we can execute this analysis every time before we define our fantasy team for the next round.

We can also get the table that we used as the input for the optmization process and filter it show more details on each of our players. We will also calculate the sum of the salaries to verify that we are below the maximum budget of 60,000.

(
    all_players
    .query('Name in @optimized_roster_names')
    [['Position','Name','FD Points','Salary']]
    .pivot_table(
        index=['Name','Position','FD Points'],
        margins=True,
        margins_name='Total', 
        aggfunc=sum
    )
)

			Salary
Name	Position	FD Points
Barton, Will^	SF	35.1	4900
George, Paul^	SG	45	7900
James, LeBron^	SF	59	10800
Jokic, Nikola^	C	69.5	11000
Lyles, Trey^	PF	33	4000
Melton, De'Anthony	PG	38.3	4600
Morant, Ja^	PG	55	7100
Quickley, Immanuel	SG	37.8	5100
Saric, Dario	PF	32.5	4100
Total			59500

From Excel to Pandas

Linear Programming (Optimization)

Contents