-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathNBA_eda_report.Rmd
937 lines (683 loc) · 46.9 KB
/
NBA_eda_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
---
title: "NBA Daily Fantasy Exploration"
author: "Ian Whitestone"
date: "January 17, 2017"
output:
html_document:
fig_width: 9
toc: true
toc_depth: 3
toc_float: true
---
```{r setup, include=FALSE}
```
## <a name="introduction"></a>Introduction
[Daily fantasy sports](https://en.wikipedia.org/wiki/Daily_fantasy_sports) (DFS) are a subset of fantasy sports, where participants construct lineups based on the games occuring on a given day. Lineups are subject to various constraints, such as a salary cap and having a minimum number of players at each position. As with other fantasy sports, player's fantasy points are based on their actual performance in real life. As a result, a key component of being a successful DFS player is the ability to project a player's total points for a given night.
This study focuses on the NBA DFS, with a particular emphasis on examining factors that influence a player's score. The NBA data used in the study is sourced from Erik Berg's [API](https://erikberg.com/api). The study covers NBA data from the 2012-2013 season, up until the 2015-2016 season. Fantasy points are calculated using Fanduel's scoring system, described below. Unlike other sports, all positions in the NBA are scored using the same system.
* Point = 1pt
* Rebound = 1.2pts
* Assist = 1.5pts
* Block = 2pts
* Steal = 2pts
* Turnover = -1pt
[Source](https://www.fanduel.com/rules)
## <a name="data_import"></a>Data Import & Cleaning
Two csv files were created using Erik Berg's API. The data used in this projected was extracted using several Python modules I created. The Python modules make the necessary API requests to retrieve NBA events for each day and then the corresponding boxscore data for each event. The data is parsed, cleaned and formatted for injection into a lcoally hosted Postgres database. The Python modules and corresponding database schema can be viewed on my github page under the [XML Data Repo](/~https://github.com/ian-whitestone/xml-data)
The player_data file contains a record for each player that recorded any statistics during a game. The event data table contains a record for each game, identified by a unique gameID. This gameID can be used to join the two data sources.
Before performing producing any plots or analysis, the data is imported and cleaned.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##import packages
library(data.table)
library(dplyr)
library(dtplyr)
library(tidyr)
library(ggplot2)
source("dlin.R")
library(RColorBrewer)
library(reshape2)
library(corrplot)
source("roll_variable.R")
library(RcppRoll)
library(rms)
source("multiplot.R")
library(plyr)
palette = brewer.pal("YlGnBu", n=9)
##read in data
player_data = read.csv("data/player_data.csv")
event_data = read.csv("data/event_data.csv")
##convert to data.tables
event_data = as.data.table(event_data)
player_data = as.data.table(player_data)
#rename and drop unnecessary columns
player_data[ ,c("X","sport") := NULL]
event_data[,X:=NULL]
setnames(player_data,old = c('X3FGA','X3FGM'),new = c('3FGA','3FGM'))
#get rid of duplicate rows in event_data
setkey(event_data,gameID)
event_data = unique(event_data)
#convert positions coded as 'F' to 'SF', 'G' to 'SG'
player_data[position == 'F',position:= 'SF']
player_data[position == 'G',position:= 'SG']
```
## <a name="features"></a>Feature Engineering
Creating powerful predictions relies not only on a good model, but significant features that represent underlying effects.
As the original data only contains base-level statistics, the first step involves creating a variable for the Fanduel fantasy points. Other data cleaning and merging steps are documented below.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##calculate fanduel points
player_data[,fd:=1*points+1.2*rebounds+1.5*assists+2*blocks+2*steals-1*turnovers]
##create a seaosn variable that can be used to distinguish between different seasons
player_data[,date:=as.Date(date)] ##convert string date to actual date
player_data[,date_num:=as.numeric(date)]
player_data[,season_code:=20122013]
player_data[date >= '2013-10-28' & date <= '2014-06-16', season_code:= 20132014]
player_data[date >= '2014-10-28' & date <= '2015-06-16', season_code:= 20142015]
player_data[date >= '2015-10-27' & date <= '2016-06-19', season_code:= 20152016]
event_data = event_data %>% join(player_data[,.N,by=.(gameID,date)][,.(gameID,date)])
event_data[,season_code:=20122013]
event_data[date >= '2013-10-28' & date <= '2014-06-16', season_code:=20132014]
event_data[date >= '2014-10-28' & date <= '2015-06-16', season_code:=20142015]
event_data[date >= '2015-10-27' & date <= '2016-06-19', season_code:=20152016]
##filter out players who didnt play
player_data = filter(player_data,minutes > 0)
##get team info
team_data=event_data[,.(gameID,home_team,away_team)]
setkey(player_data,gameID)
setkey(team_data,gameID)
player_data = player_data[team_data,nomatch = 0] ##merge the team data to the player_data table
##create home/away variable
player_data[,homeaway:= 1]
player_data[team==away_team,homeaway:= 0]
```
Next, a function is created to calculate rolling averages for any statistic. The function takes in a data frame, a target field for the rolling averages, and a window variable in the form seq(starting_window,max_window,increment).
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
roll_variable_mean = function(d, target, windows) {
require(dplyr)
require(lazyeval)
exprl = list()
i = 1
for (x in windows) {
exprl[[i]] = interp(~ lag(roll_mean(tar, w, align = 'right', fill = NA), 1),
tar=as.name(target), w=x)
i = i + 1
}
names = paste(target, windows, sep="_")
exprl = setNames(exprl, names)
d = mutate_(d, .dots = exprl)
for (n in 2:length(names)) {
expr = interp(~ ifelse(is.na(long), short, long), long=as.name(names[n]),
short=as.name(names[n-1]))
exprl = setNames(list(expr), names[n])
d = mutate_(d, .dots = exprl)
}
return(d)
}
##add rolling variables
window_size = seq(5,55,10)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'fd', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'minutes', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'FGA', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'FTA', window_size)
window_size = seq(1,3,1)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'fd', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'minutes', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'FGA', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>%
roll_variable_mean(., 'FTA', window_size)
```
A practical hypothesis is that players/teams that are playing multiple games in a row will perform worse as a result of fatigue. To test this hypothesis, a binary back-to-back feature is created which signals if the team is playing a back-to-back game. A variable is also created to signal if the team's opponent is playing a back-to-back game.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##BACK TO BACK GAME VARIABLE
days_rest = player_data[,.N,by = .(team,date_num)][,.(team,date_num)][
order(team,date_num)]
days_rest$date_diff = ave(days_rest$date_num, days_rest$team,
FUN = function(x) c(10, diff(x)))
days_rest$b2b = ifelse(days_rest$date_diff == 1,1,0)
##add opponent field
player_data[,opponent:= home_team]
player_data[team == home_team,opponent:= away_team]
##join b2b,opp_b2b
setkey(player_data,team,date_num)
setkey(days_rest,team,date_num)
player_data = player_data[days_rest[,.(team,date_num,b2b)],nomatch = 0]
days_rest[,opp_b2b:= b2b][,b2b:= NULL]
setkey(player_data,opponent,date_num)
player_data = player_data[days_rest[,.(team,date_num,opp_b2b)],nomatch = 0]
```
In order to calculate team based statistics, the event_data table must be manipulated to have one record per team per game.
```{r eval=TRUE, message=FALSE, warning=FALSE}
###TEAM BASED DATA
##currently the event data table has one record per game, with statistics for each team
colnames(event_data)
```
As shown in the column names of event_data, team variables are preceded by 'home_' or 'away_'. After splitting the table into one record per team per game, these variable names must be standardized by removing the "home/away" pre-name.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##calculate final scores for each game
event_data[,home_score:= sum(home_Q1,home_Q2,home_Q3,home_Q4,home_Q5,home_Q6,
home_Q7,home_Q8,na.rm= TRUE),
by = 1:NROW(event_data)]
event_data[,away_score:=sum(away_Q1,away_Q2,away_Q3,away_Q4,away_Q5,away_Q6,
away_Q7,away_Q8,na.rm= TRUE),
by = 1:NROW(event_data)]
##to calculate team based statistics, a data frame/table with 2 records per game -
##[cont'd] one for each team, is ideal
##the code below splits the event_data table into two tables, one for each team,
##[cont'd] standardizes the variable names, and then joins the two tables back together
team_variables = c('3FGA','3FGM','FGM','FGA','FTA','FTM',
'Q1','Q2','Q3','Q4','Q5','Q6','Q7',
'Q8','assists','blocks','fouls',
'points','rebounds','steals','turnovers')
away_variables = paste0('away_',team_variables)
home_variables = paste0('home_',team_variables)
##data table for the home team of each game
event_data_1 = event_data[,team:= home_team][,
setdiff(colnames(event_data),away_variables),with = FALSE]
##data table for hte away team of each game
event_data_2 = event_data[,team:= away_team][,
setdiff(colnames(event_data),home_variables),with = FALSE]
##change the column names to generic names (i.e. away_FGM --> FGM,home_FTA --> FTA)
setnames(event_data_1,old = home_variables,new = team_variables)
setnames(event_data_2,old = away_variables,new = team_variables)
##join the two tables back together
team_data = rbind(event_data_1,event_data_2)
```
Using the new team_data table, team based features can be calculated. The total FD points for each team and their opponent are calculated below. These features are then merged back to the player_data table.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##add date to team_data
setkey(team_data,gameID)
setkey(player_data,gameID)
team_data= team_data[player_data[,.N,by = .(gameID,date_num)][,.(gameID,date_num)],
nomatch = 0][order(date_num)]
##define opponent in team_data table
team_data[,opponent:= home_team]
team_data[team == home_team,opponent:= away_team]
###team total FD points
team_tot_fd = player_data[,.(team_fd = sum(fd)),by = .(gameID,team)]
setkey(team_tot_fd,gameID,team)
setkey(player_data,gameID,team)
setkey(team_data,gameID,team)
player_data = player_data[team_tot_fd,nomatch = 0]
team_data = team_data[team_tot_fd,nomatch = 0]
###opponent total FD points
team_tot_fd[,`:=` (opponent = team, team = NULL,opp_fd = team_fd,team_fd = NULL)]
setkey(team_tot_fd,gameID,opponent)
setkey(player_data,gameID,opponent)
setkey(team_data,gameID,opponent)
player_data = player_data[team_tot_fd,nomatch = 0]
team_data = team_data[team_tot_fd,nomatch = 0]
```
Another potentially powerful feature is the amount of FD points a team gives up to certain positions. For example, if a team has weak defense against centers/power forwards, it should be reflected in how many FD points they give up to the opposing teams centers/PFs (on average).
The code below aggregates the FD points by game,team and position. These results are then joined back to team_data, and computed for the opponents. Rolling averages are then calculated, and finally merged back to player_data.
The final result in player_data will look like:
| player | position | team | opponent | ... | opp_fd_5 | opp_p_fd_5 | opp_g_fd_5 | ... |
|------------|----------|------|----------|-----|----------|------------|------------|-----|
| Chris Bosh | PF | MIA | BOS | ... | 3.221 | 5.662 | 6.123 | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
where the last three fields are the average FD points given up by Boston to opposing teams, C/PFs, and PG/SG/SFs respectively.
```{r eval=TRUE, message=FALSE, warning=FALSE,results='hide'}
##position points summary
##calculate the total FD points for each team by position
##in some cases, teams played w/o a center. for those, use the post statistic (p_fd)
posn_sum = player_data[,.(posn_points = sum(fd)),by = .(gameID,team,position)]
posn_sum = posn_sum[,.(pg_fd = sum(ifelse(position == 'PG',posn_points,0)),
sg_fd = sum(ifelse(position == 'SG',posn_points,0)),
sf_fd = sum(ifelse(position == 'SF',posn_points,0)),
pf_fd = sum(ifelse(position == 'PF',posn_points,0)),
c_fd = sum(ifelse(position == 'C',posn_points,0)),
g_fd = sum(ifelse(position %in% c('PG','SG','SF'),posn_points,0)),
p_fd = sum(ifelse(position %in% c('PF','C'),posn_points,0))),
by = .(gameID,team)]
##merge positional points with team_data
setkey(posn_sum,gameID,team)
setkey(team_data,gameID,team)
team_data=team_data[posn_sum,nomatch=0]
##get team opponent data
team_data_opp = team_data[,.(gameID,team,pg_fd,sg_fd,sf_fd,pf_fd,c_fd,g_fd,p_fd,team_fd)]
setnames(team_data_opp,old = c("team","pg_fd","sg_fd","sf_fd","pf_fd","c_fd","g_fd","p_fd","team_fd"),
new = c("opponent","opp_pg_fd","opp_sg_fd","opp_sf_fd","opp_pf_fd","opp_c_fd","opp_g_fd",
"opp_p_fd","opp_fd"))
setkey(team_data,gameID,opponent)
setkey(team_data_opp,gameID,opponent)
team_data = team_data[team_data_opp,nomatch = 0]
##rolling team statistics
##the following stat describes how many fantasy points a team has been giving up to opposing teams,
##...[cont'd] expressed as rolling averages over the last X games
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>%
roll_variable_mean(., 'opp_fd', window_size)
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>%
roll_variable_mean(., 'opp_g_fd', window_size)
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>%
roll_variable_mean(., 'opp_p_fd', window_size)
##join these features to player_data
##join team_data "team" on player_data "opponent"
rolling_team_variables = colnames(team_data)[grepl('fd_\\w*\\d', colnames(team_data))]
player_data = merge(player_data, team_data[,append(rolling_team_variables,
c("gameID","team")),with = FALSE], by.x = c('gameID','opponent'),
by.y = c('gameID','team'), all = FALSE)
```
## <a name="uni_analysis"></a>Univariate Analysis
Prior to exploring the different variables, a quick overview of the dataset is provided.
```{r echo=TRUE, message=FALSE, warning=FALSE}
NROW(player_data)
NROW(event_data)
player_data[, .N, by = .(season_code, gameID)][,.N, by=season_code]
```
Data for a total of 4916 NBA games is provided, spanning the seasons across 2012-2015. Each season contains ~1200 games.
When exploring the impacts on a player's FD points, the first question that comes to mind is about the distribution of FD points.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data,aes(fd)) + geom_histogram(binwidth = 1) + theme_dlin() +
labs(title = 'NBA Fanduel Points Histogram')
summary(player_data$fd)
```
For the NBA, the FD points distribution is positvely skewed. There is a large number of occurences where players score 0 FD points, despite having played more than 0 minutes in the game. This is due to players recording no statisitcs during their time on the court. The mean and median FD points for all players is 18.8 & 17.1, respectively.
Another interesting observation is that a few players actually get negative FD points, due to a high number of turnovers and lack of other offensive production.
Next, we examine a breakdown of the different player positions.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data[position %in% c('PG','SG','SF','PF','C'),],aes(position)) +
geom_bar() + theme_dlin() + labs(title = 'NBA Positions Histogram')
##comments - less centers compared to other positions
player_data[position %in% c('PG','SG','SF','PF','C'),][,
.(avg_per_game=.N/(NROW(event_data) *2)),by = position]
```
In general, there an approximately even number of players across all positions, with centers having slightly fewer numbers. On average, 1.7 centers play per game per team, compared to 2.0-2.3 players at the other positions.
Next, we examine the length of each game. Games that run for extended periods allow players to accumulate more points through increased playing time, leading to higher fantasy production.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# How many games go to OT?
dat = melt(event_data[ , `:=` (Regular = sum(ifelse(is.na(home_Q4),0,1)),
Single_OT = sum(ifelse(is.na(home_Q5),0,1)),
Double_OT = sum(ifelse(is.na(home_Q6),0,1)),
Triple_OT = sum(ifelse(is.na(home_Q7),0,1)),
Quad_OT = sum(ifelse(is.na(home_Q8),0,1)))][,
.(Regular,Single_OT,Double_OT,Triple_OT,Quad_OT)][0:1])
colnames(dat) = c("Game_Length", "count")
dat$fraction = dat$count / sum(dat$count)
dat = dat[order(dat$fraction), ]
dat$ymax = cumsum(dat$fraction)
dat$ymin = c(0, head(dat$ymax, n=-1))
dat$percentage = round(dat$count / sum(dat$count)* 100,2)
ggplot(dat, aes(fill=Game_Length, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect(colour="grey30") +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme_dlin() +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
labs(title="Fraction of NBA Games in OT: 2012-2015")
dat[,.(Game_Length,percentage)]
```
As shown, a very small number of games go to overtime; approximately 7%.
Rolling variables over different time windows were computed for multiple variables. These rolling variables should follow very similar distributions to the underlying, un-rolled variable. As a sanity check, the distributions of minutes and minutes over the past X games is shown below.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##facet grid one of the rolling variable features to show its similar to the un-rolled feature
p1 = ggplot(player_data,aes(minutes_5)) + geom_histogram(binwidth = 1) + theme_dlin()
p2 = ggplot(player_data,aes(minutes_15)) + geom_histogram(binwidth = 1) + theme_dlin()
p3 = ggplot(player_data,aes(minutes_25)) + geom_histogram(binwidth = 1) + theme_dlin()
p4 = ggplot(player_data,aes(minutes)) + geom_histogram(binwidth = 1) + theme_dlin() +
labs(title = 'NBA Rolling Minutes Played Histogram')
multiplot(p4, p1, p2, p3, cols=1)
```
The distributions for the rolling variables look similar to the base minutes distribution. They differ in that the rolling minutes have fewer occurences of low-minute games. Low-minute games (i.e <5 min) are presumably due to cases where a player was injured mid-game, a player was brought in to close a blow-out game, or the player was pulled due to poor play. Intuitevely, these cases should not occur over consecutive games.
## <a name="bi_analysis"></a>Bivariate Analysis
Through the univariate analysis, an understanding of the underlying distributions and structure of the dataset was achieved. Next, the relationships between various feature variables and a player's FD points is explored.
A logical starting point is to look at how a player's previous FD points predicts his points on a given night. The plot below examines one of the rolling features created, the mean of the FD points from the last 5 games.
```{r echo=FALSE, message=FALSE, warning=FALSE}
r2 = summary(lm(player_data$fd ~ player_data$fd_5))$r.squared
p = ggplot(player_data,aes(x = fd_5, y = fd)) + geom_point(colour=palette[5]) +
theme_dlin() +labs(title = 'NBA Fanduel Points versus Previous Points',
x = 'fd points mean of last 5 games', y = 'fanduel points') +
geom_smooth(method = "lm", se = FALSE,colour='black')
p + annotate("text", x = 10, y = 75, parse=TRUE,
label=paste('R^2: ',substr(as.character(r2),0,4)), family="serif",
fontface="italic", colour="black", size=5)
```
There appears to be a strong trend between FD points and the average of previous FD points. In general, players who put up a lot of points continue to do so, with the opposite holding true as well.
While previous FD points is a strong feature itself, it doesn't account for any factors that may impact a players performance on a nightly basis. A powerful model would be able to explain deviations from a player's mean. To explore one of these factors, a player's playing time is plotted against his FD points.
```{r echo=FALSE, message=FALSE, warning=FALSE}
r2 = summary(lm(player_data$fd ~ player_data$minutes))$r.squared
p = ggplot(player_data,aes(x = minutes, y = fd)) + geom_point(colour=palette[5]) +
theme_dlin() + labs(title = 'NBA Fanduel Points versus minutes', x = 'minutes',
y = 'fanduel points') + geom_smooth(method = "lm",
se = FALSE,colour='black')
p + annotate("text", x = 15, y = 55, parse=TRUE,
label=paste('R^2: ',substr(as.character(r2),0,4)), family="serif",
fontface="italic", colour="black", size=5)
```
The plot above shows that there is a very strong relationship between minutes played and FD points, R2 ~ 0.65. Minutes correlates significantly stronger than past FD points. This is intuitive, as more time spent on the court allows players produce more, via shots, rebounds, assists etc. It also does a better job of explaining variation from a player's mean, as some nights a player will play more/less as dictated by the flow of the game.
Some people would argue that a more informative variable is a player's FD points per minute, a measure of efficiency.
```{r echo=TRUE, message=FALSE, warning=FALSE, results = 'hide'}
##define player efficiency feature
player_data[, eff := fd/minutes]
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data,aes(x = eff, y = fd)) + geom_point(colour=palette[5]) + theme_dlin() +
labs(title = 'NBA Fanduel Points versus Efficiency',
x = 'FD Points Per Minute (Efficiency)', y = 'fanduel points')
```
While there appears to be some trend, the data is relatively noisy, due to the presence of some high efficiencies.
```{r echo=TRUE, message=FALSE, warning=FALSE}
#examine some outliers
glimpse(player_data[eff>3,.(gameID,player,minutes,fd,`3FGM`,FGM,FTM,rebounds,
assists,blocks,steals,turnovers)][0:10])
```
All of these outliers appear to be due to scenarios where players recorded multiple points, assists or rebounds in under 3 minutes; clearly un-sustainable production.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##filter and re-plot
ggplot(player_data[minutes>5,],aes(x = eff, y = fd)) +
geom_point(colour=palette[5]) + theme_dlin() +
labs(title = 'NBA Fanduel Points versus Efficiency',
x = 'FD Points Per Minute (Efficiency)', y = 'fanduel points') +
geom_smooth(method = "lm", se = FALSE,colour='black')
```
After filtering on minutes>5 and re-plotting, the positive trend between efficiency and FD points becomes more clear.
When building a predictive model, neither minutes played or efffiency will be available before the game. As a result, these variables cannot be used in a model. In order to get a quick glimpse of the predictive power of various features, a correlation matrix is used.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##corrplot - general
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
"#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))
M=cor(player_data[,.(fd,minutes,starter,homeaway,fouls,FGA)])
corrplot.mixed(M,lower="number",upper='pie',col=col2(4))
```
The homeaway variable has the weakest corrleation with FD points. As shown below, the mean fd score is only 0.6 points higher for players at home.
```{r echo=TRUE, message=FALSE, warning=FALSE}
##mean fd points home versus away
player_data[,.(mean_score=mean(fd)),by = homeaway]
```
An additional feature is created to predict minutes played. Playing time can be impacted by injuries, fouls, blowout games or games that go to overtime. In order to predict such instances, more advanced models would be required. A simple way to account for a potential increase or decrease in playing time is to measure the depth (i.e. number of players at each position).
```{r echo=TRUE, message=FALSE, warning=FALSE}
##build position depth feature
team_depth = player_data[, .(pos_depth=.N - 1) ,by=.(gameID,team,position)]
player_data = player_data %>% inner_join(team_depth)
```
If a starting or backup point guard is injured, the depth at PG will decrease by 1 and signal a potential increase in playing time for the remaining point guards.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##minutes vs depth at pos
ggplot(player_data,aes(factor(pos_depth),minutes)) + geom_boxplot() +
theme_dlin() + labs(title = 'NBA Fanduel Points versus Positional Depth',
x = 'pos_depth', y = 'minutes') + geom_smooth(method = "lm", se = FALSE, colour='black')
```
There appears to be a significant trend between positional depth and minutes played. As expected, greater positional depth leads to less playing time. There appears to be some outliers at pos_depth>5, as it is highly unlikely that a team would carry more than 5 players at a single position.
```{r echo=TRUE, message=FALSE, warning=FALSE}
##INSPECT pos_depth > 5
glimpse(player_data[pos_depth > 5,.(team,player)][order(player)]) ##-->duplicate player records
```
Upon further inspection, these instances appear to be cases where there were duplicate player records in the data.
```{r echo=TRUE, message=FALSE, warning=FALSE}
##get all dupes
glimpse(player_data[, .(count = .N), by = .(gameID,team,player)][count>1, .(gameID,player)])
```
As shown above, a total of 109 duplicates are present. This isn't expected to have a large effect on the data exploration due to the relatively small number of occurences.
Next, the correlation of different rolling minutes played is examined.
```{r echo=FALSE, message=FALSE, warning=FALSE}
M3=cor(player_data[is.na(minutes_1) == FALSE & is.na(minutes_2) == FALSE &
is.na(minutes_3) == FALSE & is.na(minutes_5) == FALSE &
is.na(minutes_15) == FALSE & is.na(minutes_25) == FALSE &
is.na(minutes_35) == FALSE & is.na(minutes_55) == FALSE,
.(minutes,minutes_1,minutes_2,minutes_3,minutes_5,minutes_15,
minutes_25,minutes_35,minutes_55)])
corrplot.mixed(M3,lower="number",upper='circle')
```
The average minutes played in the last 3-5 games & 15 games appears to be the strongest predictor of minutes played on a nightly basis. Minutes_5 and minutes_15 have correlations of 0.77, 0.76, respectively with actual minutes played.
Next, we examine if certain positions tend to record more minutes than others.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##minutes played by position
ggplot(player_data,aes(factor(position),minutes)) + geom_boxplot() + theme_dlin() +
labs(title = 'NBA Minutes Played By Position')
player_data[,.(avg_min_played = mean(minutes)),by=position]
```
Guards & small forwards play more than centers & power forwards on average.
Similar to the rolling minutes variables, we examine which rolling FD points variable correlates the strongest with FD points on a nightly basis.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##corrplot - rolling variables
M2=cor(player_data[is.na(fd_1) == FALSE & is.na(fd_2) == FALSE &
is.na(fd_3) == FALSE & is.na(fd_5) == FALSE &
is.na(fd_15) == FALSE & is.na(fd_25) == FALSE &
is.na(fd_35) == FALSE & is.na(fd_55) == FALSE,
.(fd,fd_1,fd_2,fd_3,fd_5,fd_15,fd_25,fd_35,fd_55)])
corrplot.mixed(M2,lower="number",upper='circle')
```
Again, we see that rolling variables over the last 5+ games have the highest correlation with nightly performance.
Next, we see which players tend to record the most FD points.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##points scored by position
ggplot(player_data,aes(reorder(factor(position),-fd,median),fd)) +
geom_boxplot() + theme_dlin() +
labs(title = 'NBA Fanduel Points By Position',x='position',y='fd')
player_data[,.(mean_fd=mean(fd)),by=position][order(mean_fd)]
```
Point guards and centers tend to record the most FD points on a nightly basis. For the Fanduel site, lineups must have a fixed number of players at each position. However, other DFS sites allow for flex positions, so targeting point guards & centers in those spots could be an advantageous strategy.
The total fanduel points scored by each team last season is shown below.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##points scored by team
ggplot(player_data[season_code==20152016,.N,by=.(gameID,team,team_fd)],
aes(reorder(factor(team),-team_fd,median),team_fd)) + geom_boxplot() + theme_dlin() +
labs(title = 'NBA Fanduel Points By Team for the 2015-2016 Season',x='team',
y='team total fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Golden State & Oklahoma City scored the highest number of fantasy points in each game. Targeting players on these teams is a possible strategy.
The total fanduel points allowed by each team last season is shown below.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##points allowed by team
ggplot(team_data[season_code==20122013,.N,by=.(gameID,team,opp_fd)],
aes(reorder(factor(team),-opp_fd,median),opp_fd)) + geom_boxplot() + theme_dlin() +
labs(title = 'FD Points Allowed by Team for the 2015-2016 Season',
x='team',y='team total allowed fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Sacremento & Charlotte had the most porous defenses from a fantasy standpoint. As a result, players competing against these teams could be targeted.
### Stacking
"Stacking" is a popular concept in daily fantasy sports that involves putting players from the same team on your lineup. The idea is that if one player is performing well, his teammates could also benefit through assists or additional opportunities (via rebounds, steals etc.). To explore whether stacking is a viable strategy in the NBA, it is necessary to examine the correlations between players and their teams.
First, we examine how fantasy points correlate at each position.
```{r echo=FALSE, message=FALSE, warning=FALSE}
col1 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","white",
"cyan", "#007FFF", "blue","#00007F"))
col2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
"#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))
col3 <- colorRampPalette(c("red", "white", "blue"))
col4 <- colorRampPalette(c("#7F0000","red","#FF7F00","yellow","#7FFF7F",
"cyan", "#007FFF", "blue","#00007F"))
#C1 = cor(player_data[,.(fd,team_fd,opp_fd)])
#corrplot.mixed(C1,lower="number",upper='circle',col=col2(4))
C2 = cor(team_data[,.(pg_fd,sg_fd,sf_fd,pf_fd,c_fd,team_fd,opp_fd)])
corrplot.mixed(C2,lower="number",upper='circle',col=col2(4))
```
The cell with -0.38 represents the correlation between the total point guard fantasy points and total shooting guard fantasy points. In general, players on the same team are negatively correlated with each other. Intuitively this makes sense, as the stolen shot attempts from one player outweigh the fantasy points generated by assists. The negative correlation is less significant between perimiter & post players (i.e. PG/C or SG/C).
```{r echo=FALSE, message=FALSE, warning=FALSE}
C3 = cor(team_data[,.(pg_fd,sg_fd,sf_fd,pf_fd,c_fd,opp_pg_fd,
opp_sg_fd,opp_sf_fd,opp_pf_fd,opp_c_fd)])
corrplot.mixed(C3,lower="number",upper='circle',col=col2(4))
```
The correlation between opposing players at each position is less significant.
Next, correlation among the starting players is examined (as opposed to bench + startings player from above)
```{r echo=TRUE, message=FALSE, warning=FALSE, results = 'hide'}
starter_data = dcast(player_data[starter == 1,], gameID + team ~ position,
fun.aggregate = mean, value.var = c('fd'), na.rm=T)
starter_data = as.data.table(starter_data)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
C5 = cor(starter_data[,.(PG,SG,SF,PF,C)],use = 'complete.obs')
corrplot.mixed(C5,lower="number",upper='circle',col=col2(4))
```
Again, the same trend is shown with perimiters players being negatively correlation with each other, as well as posts.
Next, I examine the impact of high scoring games. First, the team scores (i.e. actual total points scored) are joined to the player data table.
```{r echo=TRUE, message=FALSE, warning=FALSE, results = 'hide'}
player_data = player_data %>% inner_join(event_data[,.(gameID,home_score,away_score)])
player_data[, team_score := home_score]
player_data[team == away_team, team_score := away_score]
player_data[, score_bucket := 1]
player_data[team_score > 75, score_bucket := 2]
player_data[team_score > 100, score_bucket := 3]
player_data[team_score > 125, score_bucket := 4]
```
And the results shown below.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data[,.(mean_fd = mean(fd)),by = score_bucket],
aes(x = score_bucket, y = mean_fd)) +
geom_point(colour=palette[5]) + theme_dlin() +
scale_x_continuous(breaks=1:4,labels = c("<75",">75",">100",">125")) +
labs(title = 'NBA Fanduel Points - Effect of High Scoring Games',
x = 'total points scored', y = 'average fanduel points') +
geom_smooth(method = "lm", se = FALSE,colour='black')
```
As expected, high scoring games correlate directly with higher fantasy point production. Oddsmaker publish projection point totals for each NBA game, so players playing in high projected point games should be targeted.
Lastly, I examine the effect of weak rebounding teams and teams with lots of turnovers.
```{r echo=TRUE, message=FALSE, warning=FALSE,results='hide'}
###opponent rebounds and turnovers
opp_stats = team_data[, .(gameID,team,rebounds,turnovers)]
setnames(opp_stats, old=c('team','rebounds','turnovers'),
new=c('opponent','opp_rebounds','opp_turnovers'))
player_data = player_data %>% inner_join(opp_stats)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data,aes(x = opp_turnovers, y = fd)) +
geom_point(colour=palette[5]) + theme_dlin() +
labs(title = 'NBA Fanduel Points - Effect of Opponent Turnovers',
x = 'opponent turnovers', y = 'fanduel points') +
geom_smooth(method = "lm", se = FALSE,colour='black')
```
When opposing teams turn the ball over more frequently is does not necessarily lead to higher fantasy production.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(player_data[position %in% c('C','PF')],aes(x = opp_rebounds, y = fd)) +
geom_point(colour=palette[5]) + theme_dlin() +
labs(title = 'NBA Fanduel Points - Effect of Opponent Rebounds on Centers',
x = 'opponent rebounds', y = 'fanduel points') +
geom_smooth(method = "lm", se = FALSE,colour='black')
```
Centers and power forwards are expected to benefit from weak rebounding teams due to more fantasy points from rebounds and second-chance baskets. The plot above shows a slightly negative trend (i.e. more rebounds by opponents = lower fantsay point production for PFs/Cs).
The formula for FD points was shown earlier. The plots below show how the distributions of relevant statistics vary between positions.
```{r echo=TRUE, message=FALSE, warning=FALSE}
base_stats = player_data[, .(`3FGM` = sum(`3FGM`)/.N , FGM = sum(FGM)/.N,
FTM = sum(FTM)/.N, rebounds = sum(rebounds)/.N,
assists = sum(assists)/.N, blocks = sum(blocks)/.N,
steals = sum(steals)/.N, turnovers = sum(turnovers)/.N,
fouls = sum(fouls)/.N), by = .(gameID, position)]
d = melt(base_stats, id.vars = c("gameID", "position"))
ggplot(d, aes(x = value, y = ..count.., colour = position)) +
facet_wrap(~variable,scales = "free_x") + geom_density() + theme_dlin() +
scale_y_continuous(limits = c(0, 10000)) + theme(legend.position="bottom")
```
Several conclusions can be drawn from the plot above, all intuitive to someone who has watched the sport. Starting at the top left, centers make the fewest three pointers, followed by power forwards, as indicated by their positvely skewed distributions. Field goals and free throws are similar across all positions. Point guards have the highest number of assists, whereas centers have the highest number of fouls.
## <a name="multi_analysis"></a>Multivariate Analysis
As an extension of what was shown above, I plot the total fd points scored/allowed by the teams over each season.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(team_data[,.(mean_opp_fd = mean(opp_fd)),by=.(season_code,team)],
aes(factor(team),mean_opp_fd,colour=factor(season_code))) + geom_point() +
theme_dlin() + facet_grid(season_code~.) +
labs(title = 'FD Points Allowed by Team',x='team',y='average fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Season") + theme(legend.position="bottom")
head(team_data[,.(mean_opp_fd = mean(opp_fd)),by=.(team)][order(mean_opp_fd)],5)
tail(team_data[,.(mean_opp_fd = mean(opp_fd)),by=.(team)][order(mean_opp_fd)],5)
```
The plot above helps identify teams that are consistently weak defensively. Over the past 4 seasons, the Lakers has been one of the weakest defensive teams (from a fantasy points perspective) whereas Memphis has been the strongest.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(team_data[,.(mean_team_fd = mean(team_fd)),by=.(season_code,team)],
aes(factor(team),mean_team_fd,colour=factor(season_code))) + geom_point() +
theme_dlin() + facet_grid(season_code~.) +
labs(title = 'FD Points Scored by Team',x='team',y='average fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Season") + theme(legend.position="bottom")
head(team_data[,.(mean_team_fd = mean(team_fd)),by=.(team)][order(-mean_team_fd)],5)
tail(team_data[,.(mean_team_fd = mean(team_fd)),by=.(team)][order(-mean_team_fd)],5)
```
Golden State and Oklahoma City have been the top fantasy point scoring teams whereas New York & Brooklyn have been the worst across each season.
Next, we see which teams give up the most points to each position.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##points allowed by team to each position, do one coloured dot for each position
pa_pos = team_data[,.(opp_pg_fd = mean(opp_pg_fd), opp_sg_fd = mean(opp_sg_fd),
opp_sf_fd = mean(opp_sf_fd), opp_pf_fd = mean(opp_pf_fd),
opp_c_fd = mean(opp_c_fd)), by = .(season_code,team)]
pa_pos = melt(pa_pos,id.vars = c('season_code','team'))
ggplot(pa_pos[season_code==20152016,],aes(factor(team),value,colour=factor(variable))) +
geom_point() + theme_dlin() +
labs(title = 'NBA Fanduel Points Allowed by Team to Each Position for the 2015-2016 Season',
x='team',y='team mean total allowed fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Position",labels=c("opp_pg_fd" = "PG",
"opp_sg_fd" = "SG","opp_sf_fd" = "SF","opp_pf_fd"="PF","opp_c_fd"="C")) +
theme(legend.position="bottom")
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
tail(pa_pos[season_code==20152016 & variable == 'opp_pg_fd',][order(value)],2)
tail(pa_pos[season_code==20152016 & variable == 'opp_c_fd',][order(value)],2)
```
The Lakers gave up the most points to opposing point guards and centers, followed closely by Phoenix (point guards) and Philadelphia (centers).
```{r echo=FALSE, message=FALSE, warning=FALSE}
head(pa_pos[season_code==20152016 & variable == 'opp_pg_fd',][order(value)],2)
head(pa_pos[season_code==20152016 & variable == 'opp_c_fd',][order(value)],2)
```
Atlanta & Boston gave up the least amount of points to opposing point guards whereas Golden State and Cleveland were the best at containing opposing centers.
As an extension of the figure above, the plot below shows the average fantasy points allowed to each position over all seasons present in the data set.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(pa_pos,aes(factor(team),value,colour=factor(variable))) + geom_point() +
facet_grid(season_code~.) + theme_dlin() +
labs(title = 'NBA Fanduel Points Allowed by Team to Each Position',x='team',
y='team mean total allowed fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Position",labels=c("opp_pg_fd" = "PG",
"opp_sg_fd" = "SG","opp_sf_fd" = "SF","opp_pf_fd"="PF","opp_c_fd"="C")) +
theme(legend.position="bottom")
```
Over all seasons, teams give up the most fantasy points to point guards and power forwards.
The same analysis can be applied at the individual player level. Lebron's performance against teams each season is show.
```{r echo=FALSE, message=FALSE, warning=FALSE}
##lebrons points against certain teams
lebron = player_data[player == 'LeBron James',.(mean_fd = mean(fd), count = .N),
by = .(opponent,season_code)]
ggplot(lebron,aes(reorder(factor(opponent),-mean_fd),mean_fd,colour=factor(season_code))) +
geom_bar(position="dodge", stat="identity") +
theme_dlin() + facet_grid(season_code ~ ., scales="free") +
labs(title = 'LeBron James Average Fanduel Points Against Each Team',
x='team',y='average fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Season") +
theme(legend.position="bottom")
head(lebron[!(opponent %in% c('MIA','CLE')),.(mean_fd=mean(mean_fd)),
by=opponent][order(mean_fd)],5)
tail(lebron[!(opponent %in% c('MIA','CLE')),.(mean_fd=mean(mean_fd)),
by=opponent][order(mean_fd)],5)
```
Lebron scored the most against Charlotte and OKC across his past 4 seasons, and was held to his lowest average fantasy points against Portland and Brooklyn.
## <a name="final"></a>Final Plots and Summary
The first final plot chosen was created using one of the engineered feature variables. The variable created was the amount of fantasy points allowed by each team to each opposing position for every game. The plot below shows the average fantasy points allowed to each position over all seasons present in the data set.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(pa_pos,aes(factor(team),value,colour=factor(variable))) + geom_point() +
facet_grid(season_code~.) + theme_dlin() +
labs(title = 'NBA Fanduel Points Allowed by Team to Each Position',x='team',
y='team mean total allowed fanduel points') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_colour_discrete(name="Position",labels=c("opp_pg_fd" = "PG",
"opp_sg_fd" = "SG","opp_sf_fd" = "SF","opp_pf_fd"="PF","opp_c_fd"="C")) +
theme(legend.position="bottom")
```
Multiple insights can be drawn from this plot. All teams allow the most fantasy points to opposing point guards, which was also reflected in a previous plot that showed average fantasy points at each position. The plot also allows for identification of both friendly and un-friendly teams to target in each matchup. For example, Memphis & San Antonio have been two of the strongest defensive teams, allowing the fewest fantasy points to all positions each season. On the other hand, the LA Lakers have been one of the most porous teams.
Based on this, a potential feature variable in a predictive model could be a rolling average of fantasy points allowed to each position.
The second plot chosen was a combination of the correlation plots shown earlier.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=10}
##corrplot - multiple vairables
M=cor(player_data[is.na(fd_15) == FALSE & is.na(minutes_15) == FALSE &
is.na(FGA_15) == FALSE,
.(fd,fd_15,minutes_15,FGA_15,pos_depth,starter,homeaway,fouls)])
corrplot.mixed(M,lower="number",upper='circle',col=col2(4),
title='Correlation of Potential FD Features',mar=c(0,0,1,0))
```
The plot above does a good job of summarizing the variables that have strong positive or negative relationships with fantasy points.
The third plot chosen was another correlation matrix, this time between various positions. The plot is particularly important for coming up with a stacking strategy.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=10}
C3 = cor(team_data[,.(pg_fd,sg_fd,sf_fd,pf_fd,c_fd,opp_pg_fd,opp_sg_fd,
opp_sf_fd,opp_pf_fd,opp_c_fd)])
corrplot.mixed(C3,lower="number",upper='circle',col=col2(4),
title="Correlation Between Various Positions",mar=c(0,0,1,0))
```
In general, players on the same team are negatively correlated with each other. The negative correlation is less significant between perimiter & post players (i.e. PG/C or SG/C). The correlation between opposing players at each position is less significant. As a result, choosing players on opposite teams should not have a negative impact on fantasy production, unlike other matchup based fantasy spots where opposing players are more negatively correlated (i.e. hockey goalies & opposing skaters, baseball pitchers and opposing batters).
Overall, many of the feature variables explored show strong linear correlations with fantasy points, laying the groundwork for a powerful, predictive model to be built.
## <a name="reflection"></a>Reflection
Overall, using R for the exploratory data analysis was a good experience as the language and development environment in R studio make work quick and easy. Once I learned the syntax of data.table and ggplot, and the useful library reshape, data cleaning, massaging and aggregating became extremely quick and efficient. Being able to quickly group, select and aggregate to produce stats along with plots made the analysis more thorough and informative. Most difficulties encountered were related to formatting with ggplot2 (i.e. changing label title, creating facet grids, custom x-axis labels etc.). Despite these minor difficulties, I find plotting with R & RStudio a much better experience than Python.
For future work, it is recommended to add in other data sources to explore additional variables that may impact fantasy points. For example, vegas odds variables such as win probability or total projected points could be useful. The analysis showed that games that go to overtime or are high scoring yield higher average fantasy point totals, so games with narrow win margins or high projected point totals could be targeted.