-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRmd_code
1140 lines (879 loc) · 53 KB
/
Rmd_code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Predicting 2023 Formula One World Constructors' Championship Standings"
subtitle: "PSTAT 131 Final Project"
author: "Sarah Liang"
date: "2/2/2023"
output:
html_document:
theme: simplex
toc: true
toc_float: true
toc_depth: 4
code_folding: show
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE,
warning = FALSE)
```
## Introduction
The goal of this project is to build a machine learning model to predict the Formula One World Constructors' Championship Standings for the upcoming 2023 season.
\
\
### What is Formula One? What are constructors?
Formula One, or shorthand F1, is open wheel, single-seater car racing at the highest level. There are a total of ten constructors, or racing teams, that are tasked with the goal of constructing a F1 car prior to the annual racing season. During each race season, ten constructors compete with their newly launched F1 cars for a position from first to tenth in the World Constructors' Championship Standings. Each constructor employs two drivers. These drivers also compete for a place from first, i.e. World Champion, to twentieth, in the World Drivers' Championship Standings. The F1 race season is overlooked and regulated by the Fédération Internationale de l'Automobile (FIA).\
\
Racing occurs over a weekend such that Fridays are for free practice, Saturdays are for qualifying, and Sundays are race days. During a race, finishing positions first to tenth earn points from 25, 18, 15, 12, 10, 8, 6, 4, 2, and 1 respectively. The number of races, or grand prix, may vary every year. The upcoming 2023 season has 23 races on the calendar, as opposed to the 2022 F1 calendar consisting of 22 races. Some races are new, and some drivers are new. This may pose some difficulty later on, so best we mention it now.
\
__What is Free Practice?__
\
_On Fridays, constructor teams put out their two drivers on the circuit to collect data about the car, get a feel of how the car handles the track, and in general obtain as much practice and knowledge to prepare and perform during qualifying and race day._
\
__How Qualifying Works:__
\
_On Saturdays, drivers will have allotted time to record a fast lap time for Q1, in which the five slowest drivers are cut off (positions 16 to 20). Then Q2 begins and then the remaining fifteen drivers have allotted time to record another fast lap time, during which the five slowest drivers of the remaining fifteen are cut off (positions 11 to 15). During Q3, the remaining ten drivers will record a final fast lap time to earn a position from 1 to 10. The last position a driver earned before cut off or the end of Q1 determines the position they will start from during Sunday race day._
\
\
\
### Why F1?
Formula One, an inherently European-rooted sport, has gained immense popularity recently in the United States, partially due to a Netflix show "Drive to Survive" covering the motor sport. While the show itself is highly dramatized and an inaccurate representation of actual race events at times, I developed an interest like many others in the sport and began watching archived races, qualifying sessions, and more. It is known amongst fans and followers of the sport that racing teams with a higher budget and access to rich resources tend to fare better in Constructors' Championship Standings. For many years, wealthy and heavily-sponsored constructors like Mercedes, Ferrari, and Red Bull have occupied the higher ranks of the World Championship. However, in recent years the FIA has instated a budget cap and other regulations in hopes of leveling the playing field. With the upcoming 2023 season kicking off, I am inclined to predict the results for the 2023 season, not only for my own fun, but also to see if the FIA's regulations are contributing to the beginning of a new era of competition in F1 racing.
\
\
\
## Exploring and Modifying Data
We are going to begin this project by just taking a general look at the data and consider what we need to do to construct a data set we can work with, i.e. missing data, useful and/or not useful variables, necessary key/ID values, manipulating and/or converting variables, etc.
\
\
### Data Source
To obtain the data, I will be using [Ergast Developer API](http://ergast.com/mrd/db/) which includes documentation of F1 stats since the beginning of World Championships in 1950. However, I will be downloading the [csv files](http://ergast.com/mrd/db/#csv) from the API website instead of pulling data with the actual API, for convenience purposes. I may also get further information for certain variables from the official [F1 website archive](https://www.formula1.com/en/results/archive-1950-2016.html).\
\
The different csv files are all data sets individually, some of which act as lookup tables for others. We are tasked to combine the data sets efficiently into a single one before any analyzing or model building.
\
\
### Loading Packages and Exploring the Data
<font size="3"> __Code in file read_data.rda__ </font>
\
Let's begin by loading our relevant packages, `set.seed`, reading in our csv files that we downloaded from the API website, and taking a quick look at some of our variables.
```{r collapse=TRUE}
library(tidyverse)
library(tidymodels)
library(kknn)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
library(parsnip)
library(recipes)
library(magrittr)
library(workflows)
library(glmnet)
library(themis)
library(ranger)
library(vip)
library(naniar)
library(visdat)
library(dplyr)
library(ISLR)
set.seed(2536)
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//read_data.rda")
```
\
Here is a general summary of what each data set involves:
\
- `circuit` - uniquely identified circuits with their locations\
- `races` - uniquely identified races with their round number and locations since 1950\
- `seasons` - list of years of race seasons\
- `results` - collection of results for each driver at each race\
- `drivers` - uniquely identified drivers since 1950\
- `driver_standings` - uniquely identified record of position in Drivers' Championship Standings at the result of a certain race\
- `constructors` - uniquely identified constructor teams since 1950\
- `constructor_results` - uniquely identified record of points earned for some constructor at a certain race\
- `constructor_standings` - uniquely identified record of position in Constructors' Championship Standings at the result of a certain race\
- `status` - lookup table data set of status key value pairs\
- `lap_times` - all lap times for each driver at each race\
- `pit_stops` - pit stop times for each driver at each race\
- `qualifying` - qualifying results for each race\
- `sprint_results` - sprint qualifying results for each driver at certain races that included sprint qualifying
\
\
Here are some things to consider:
\
1. Since a constructor's standing is determined by how many points they've accumulated throughout the race season relative to other teams, it would be ideal to include a `points` variable as our response. Also, the amount of points a constructor obtains is dependent on how many point their drivers are able to obtain at races. So, the response variable can come from and/or be calculated from the `results`, `constructor_results`, `constructor_standings`, or `driver_standings` data sets.\
2. There may be missing data because there will be three rookie drivers entering the 2023 F1 season: Oscar Piastri, Nyck De Vries, and Logan Sargeant. So, it would be hard to predict the next season with no F1 data regarding these three drivers. I will probably take data from previous Formula Two seasons for these drivers, because F2 races concurrently with the F1 season on the same circuits. We can consider `number of points earned for the drivers` as a measurement of how "good" these drivers are. However, since the sum of number of points two drivers earn is the number of points those drivers' constructor earns, we may have some structural collinearity between driver points and constructor points. Though some drivers often change constructors as well, so it is worth being cautious here.\
3. Some constructor teams have changed their names over the years (ex. Force India changed to Racing Point Force India then changed to Racing Point then changed to Aston Martin), so data will be scattered around the data set. Despite the different names, we are grouping these "different, but same" teams because important factors like car part suppliers, important partnerships, etc. are usually the same regardless of the official name change. So, we will have to align the different `team names`. all under one name, preferably the most recent and relevant one.\
\
\
\
```{r}
levels(factor(constructor_results$status))
```
As we can see the `status` variable of the `constructor_results` data set reflects two levels, `\\N` missing and `D` disqualified.
```{r}
# merge constructor names w/ results data set on Id variable
merge_test <- full_join(constructors[,c(1,3)], constructor_results, by="constructorId")
merge_test <- full_join(races[,c(1:3)], merge_test, by="raceId")
merge_test %>%
filter(status == "D")
```
Here we discover this disqualification is because McLaren was disqualified in the 2007 Championship because of the [2007 Espionage Controversy](https://en.wikipedia.org/wiki/2007_Formula_One_espionage_controversy). This will come into play later.
\
\
### Modifying Data Sets
<font size="3"> __Code in file modify_data.rda__ </font>
\
The first thing we do in `modify_data.rda` is remove the `url` column from the following data sets: `constructors`, `drivers`, `circuit`, `races`, and `season`.
\
#### Constructors Related Data Sets
Let's begin by focusing on the constructors information (points at certain races, position in standings at the result of certain races, etc.). Since the `constructors` data set is largely a lookup table, we will focus on merging `constructor_results` and `constructor_standings` first. Let's do the following modifications:\
1. rename the `points` variables to clarify whether they are points earned at each race or points accumulated from each race so far in the season.\
2. remove the 2007 races of McLaren that were labeled as disqualified due to the 2007 Espionage Controversy, as it can be an outlier denoting performance higher than expected for McLaren during those years.
\
After doing so, our `constructor_results` and `constructor_standings` are now `con_res` and `con_stand` respectively.
```{r collapse=TRUE}
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//modify_data.rda")
head(con_res)
head(con_stand)
```
We will `full_join()` merge `con_res` and `con_stand` by `raceId` and `constructorId`. It should not remove anything, only add/join together rows from both data sets. I am also going to remove the `constructorResultsId` and `constructorStandingsId` columns because there is no need for these identification variables after merging all information together.
\
We finish the constructor's section up by adding in the constructor team name `con_name`, reference `constructorRef`, and nationality `con_nation`, using the `constructors`data set as a lookup table.
\
We also add race information, like circuit name and country, race season year, race name, and more, using the `races` and `circuit` data sets similarly.
```{r collapse=TRUE}
# finished constructors related data set
head(merge_con[order(merge_con$raceId),])
```
\
\
#### Drivers Related Data Sets
Let's look at the `drivers`, `driver_standings`, `results`, `lap_times`, `pit_stops`, `qualifiying`, and `sprint_results` data sets.
\
Let's begin with a break down of the `qualifying` data set:
```{r collapse=TRUE}
dim(qualifying)[1]
dim(results)[1]
```
The number of observations in `qualifying` is much less than that of `results` because the `qualifying` data is only provided starting in 2003. If I were to join the two data sets together, I would have a lot of missing data for the `qualifying` variables.
```{r collapse=TRUE}
# 139 missing q1 times (and q2 and q3 given that these drivers were cut off from making it into q2 and q3)
quali_na
# other than single digit minute timestamps and missing times, we have a double digit q1 timestamp 16:42.640 and more blank missing values
rest_of_q1
# we just have missing q2 values
rest_of_q2
# just missing q3 values
rest_of_q3
```
\
There are some missing `q1` times. This could be due to a car issue that interrupted drivers from finishing their laps. Other than the missing values, we can work with the times. We had to check `q2` and `q3` similarly because I want to work with the three variables in one go. We will want to:\
1. convert the timestamps into numeric variables\
2. only keep the fastest time `q_time` out of `q1`, `q2`, and `q3` to streamline the data and avoid the missing values within the `qualifying` data set\
By obtaining the fastest time out of three, it gives us a better idea of the maximum potential of the car and/or driver at a certain weekend and circuit. After some renaming of variables for distinction, here is our `quali` data set with our `q_time` variable.
```{r collapse=TRUE}
head(quali) # Looks nice!
```
\
Before I merge `quali` to `results`, I want to talk about `sprint_results`.
\
__Sprint qualifying__ is another form of qualifying, except the traditional qualifying with Q1, Q2, and Q3 is done on Friday along with free practice to determine the starting grid positions for a sprint race. A sprint race is essentially a shorter, 100 kilometer race of which the finishing positions for each driver will determine the starting grid positions for the actual race.
\
Since `sprint_results` can affect a race result as much as normal qualifying weekends without a sprint race, I would include `sprint_results` as well.
```{r collapse=TRUE}
races %>%
filter(raceId %in% as.numeric(levels(factor(sprint_results$raceId)))) %>%
select(raceId, year, round, name, date)
```
Here, we can see that sprint qualifying was a relatively new procedure enacted for the first time in the 2021 season and is only recorded at six races. So, I would like to combine the `start_pos` from `quali` with the `positionOrder` from `sprint_results` to complete the drivers' race start position variable. There are also `points` awarded for the sprint races, but I would prefer to use the overall `points` from the `driver_standings` data set as a better representation of the total points each driver accumulates with the progress of each race weekend.
```{r}
# new driv_start_pos after merging quali and sprint_results
head(merge_quali)
```
\
The `results` data set holds a lot of race information especially about individual drivers. The most useful variables (other than the identification variables) seem to be `positionText`, `positionOrder`, `points`, `laps`, `grid` (i.e. starting position), `fastestLapTime`, `rank` (i.e. rank of `fastestLapTime`), and `statusId`. This is the majority of variables in `results`, so I will merge the data sets and combine `grid` and `driv_start_pos` from `merge_quali` as they are the same variable. I will address other missing values later.
\
```{r collapse=TRUE}
count(res, res$time=="\\N")
```
The `time` variable has more missing than not, which can be due to not finishing a race because of car failure, etc. So, I think `statusId` and `positionText` would be a better representation of the position at which drivers finished their race and in what conditions. Let's remove `time` and `milliseconds` (`time` measured in milliseconds).
\
Now let's merge our modified `merge_quali` data set and a modified `results` data set.
```{r collapse=TRUE}
# remove time, driver number, position, convert lap time, rename variables to be driver specific, convert positionText into factor for results data set
# this is now our merged data set
head(merge_res)
```
\
Now let's look at the `lap_times` and `pit_stops` data sets. _* Know that `pit_stops` is only from 2012 onward. `lap_times` is from 1996 onward._
```{r collapse=TRUE}
# example of McLaren at raceId==1
lap_times %>%
filter(raceId==1 & driverId==1) %>%
head()
# example of McLaren pit stops
pit_stops %>%
filter(driverId==1) %>%
head()
```
The `lap_times` data set has the lap time of every single lap driven in every race by each driver. `pit_stops` has a record of every pit stop made at each race. Keep in mind, the pit stop duration recorded in the `pit_stops` data set is the __entire amount of time a car spends in the pit lane__, not just the amount of time it takes to change tyres. In general, a car usually spends twenty-something seconds in the pit lane. Thus, the pit stop duration can have quite an impact on the outcome of a driver's race, especially if a team struggles to pit quickly, which happens quite often.
\
To modify these timestamps, I want to create a variable for __average lap time__, __average position__ (rounded), and __average pit stop duration__ for each race given.
\
_* At each race, drivers are required to pit (typically changing four tyres) at least once. Many times, drivers have to pit more than once. We will be taking the average pit stop duration among all the pits a driver makes at a certain race._
```{r collapse=TRUE}
# lap and pit w/ averages
head(lap)
head(pit)
```
Let's merge `lap` and `pit` to `merge_res`.
```{r}
head(merge_res)
vis_miss(merge_res)
```
\
For now, I won't mess with the missing data. It's due to certain data (ex. qualifying times `q_time`, lap times `avg_lap_time`, pit stop durations `avg_pit`) only being available after a certain date.
\
Let's look at one last data set, `driver_standings`.
```{r collapse=TRUE}
# not too different in dimensions is a good thing
dim(driver_standings)[1]
dim(merge_res)[1]
```
Let's begin with renaming:\
1. `position` and `positionText` to `driv_standing` because it conflicts with race result `position` from before, whereas now this is about the drivers' positions in the Drivers' Championship Standings when accumulating the progress from each race.\
2. `points` to `sum_driv_pnts` as these are points not earned at each race, but points earned overall from that race as well as past races in the season.
```{r collapse=TRUE}
count(races, races$time=="\\N")
```
_* I would have liked distinguishing between day and night races using `time` in `races` as that could have a huge impact on tyre degradation and race result, but there is only about 30% non-missing data which can't help us much._
```{r collapse=TRUE}
head(driver_stand)
# nothing else to change about standings, let's merge our modified drivers related data set
# as well as remove unnecessary variables, rename for more distinction, add information from status and drivers data sets (basically using the following as lookup tables)
head(race)
head(circ)
head(stat)
# new merged driver related data set
head(merge_driv)
```
\
\
#### What about Piastri, Sargeant, and De Vries?
Like I mentioned before, we do have three new rookies drivers for the 2023 F1 season. If we take a look at our data, we will notice that our drivers actually are inputted in the `drivers` data set, with their own `driverId` and other values as well. So, they are also in `merge_driv`, albeit mostly NA missing values.
```{r}
drivers %>%
filter(driverRef == "piastri" | driverRef == "sargeant" | forename=="Nyck")
# Nyck de Vries completed one race as a reserve driver for Williams
merge_driv %>%
filter(driverId %in% c(856:858))
```
\
I did consider inputting F2 and F3 data into our data set to fill out these three drivers. However, I am now against the idea because while there may be similarities between F2, F3, and F1, the race weekends are quite different and I'm afraid adding certain values, but being forced to omit others that don't exist in F2 and F3, would create even more missing data.
\
There isn't any resource that I could pull data neatly and have it combine nicely with the data set we have now, so I think I will continue without adding extra data, for the sake of time.
\
\
#### Change of Plans?
Now let's take a look at our two merged data sets: `merge_con` and `merge_driv`.
```{r collapse=TRUE}
dim(merge_con)[1]
dim(merge_driv)[1]
```
Considering the 20,000+ difference in observations, which is expected as there are more unique drivers than constructors, I am against merging `merge_con` and `merge_driv`. At this point of the project, my plan of action is changing a little. Instead of trying to consolidate all my data into a single data set and modelling of of that, I am seriously considering modelling off of __two__ data sets:\
1. Use `merge_con` to predict response variable `con_pos`, using predictors specific to the effect of each individual constructor\
2. Use `merge_driv` to predict `driv_pnts`, using predictors specific to the effect of each individual driver. We can predict `driv_pnts` of our 2023 drivers and sum the drivers corresponding to their 2023 constructor teams and rank the constructors accordingly.
\
I think this plan is definitely doable, and is not entirely too far off from our original plan. We could just use `merge_con` and proceed with our modelling, but I feel like there is so much potential with the data in `merge_driv` that it would be a shame not to use it. However, with the lack of data around our three rookie drivers, perhaps the `merge_con` method would fare better for our prediction goal.
\
\
### Finalized Data Set(s) and Codebook(s)
Here is our two finalized data sets, as well as a codebook for both:
```{r}
head(merge_con)
```
<font size="3"> __merge_con Codebook__ </font>\
`raceId` - unique race identifier; numeric\
`constructorId` - unique constructor identifier; numeric\
`con_pnts` - number of points a constructor earns at a certain race; numeric\
`con_sum_pnts` - total number of points a constructor earned so far in the race season including that certain race; numeric\
`con_pos` - a constructor's position in the Constructors' Championship Standings at the point of that certain race during the race season; numeric\
`con_posText` - `con_pos`; character\
`con_wins` - number of wins a constructor earned so far in the race season including that certain race; numeric\
`constructorRef` - constructor reference name; character\
`con_name` - constructor name; character\
`con_nation` - nationality of constructor team; character\
`season` - year of race season; numeric\
`round` - race order (1 = first race, 2 = second, etc.) in the race season; numeric\
`circuitId` - unique circuit identifier; numeric\
`race_name` - name of grand prix; character\
`circ_country` - country circuit is located in; character\
`circuitRef` - circuit reference name; character\
`circ_name` - name of circuit; character
\
```{r}
head(merge_driv)
```
<font size="3"> __merge_driv Codebook__ </font>\
`raceId` - unique race identifier; numeric\
`driverId` - unique driver identifier; numeric\
`constructorId` - unique constructor identifier; numeric\
`driv_positionText` - a driver's finishing race position at that certain race; factor; includes the following:\
`D` - disqualified\
`E` - excluded\
`F` - failed to qualify\
`N` - not classified\
`R` - retired\
`W` - withdrawn\
`driv_positionOrder` - a driver's place in the finishing order of that certain race; numeric\
`driv_pnts` - number of points a driver earns at a certain race; numeric\
`laps` - number of laps a driver completed in the race; numeric\
`fastestLap` - the lap number during which a driver drove their fastest lap during the race; character\
`fastestLapTime_rank` - the rank of a driver's fastest lap time among that of the other drivers during the race; character\
`fastestLapTime` - a driver's fastest lap time during the race in seconds; numeric\
`fastestLapSpeed` - fastest speed reached during a driver's fastest lap; character\
`driv_statusId` - unique status identifier for the condition in which a driver finished their race; numeric\
`q_time` - fastest qualifying fast lap time (out of Q1, Q2, and Q3) that a driver completed during that race weekend's qualifying session; numeric\
`driv_start_pos` - a driver's starting grid position for that certain race; numeric\
`avg_lap_time` - average of all of a driver's lap times during a race; numeric\
`avg_lap_pos` - rounded average of all positions a driver was at during a race; numeric\
`avg_pit` - average pit stop duration of a driver during a race; numeric\
`sum_driv_pnts` - total number of points a driver earned so far in the race season including that certain race; numeric\
`driv_standing` - a driver's position in the Drivers' Championship Standings at the point of that certain race during the race season; numeric\
`driv_standingText` - `driv_standing`; character; including the following:\
`D` - disqualified\
`driv_wins` - number of wins a driver earned so far in the race season including that certain race; numeric\
`status` - the condition in which a driver finished their race; factor
\
\
## Exploratory Data Analysis
<font size="3"> __Code in file eda.rda__ </font>
\
### Missing Data
Let's take a look at the missing data in `merge_con`.
```{r}
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//eda.rda")
vis_miss(merge_con)
```
\
Let's arrange each constructor's result in order throughout each race season and fill in `con_pnts` and `con_sum_pnts` using each other (since `con_sum_pnts` is an accumulation of `con_pnts` throughout the season).
```{r eval=FALSE}
merge_con1 <- merge_con %>%
# arrange each constructor's result in order throughout season
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
# replace con_sum_pnts missing values with the actual value using the other con_pnts column
mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
(is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))
# vice versa
merge_con2 <- merge_con1 %>%
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
mutate(con_pnts = case_when((is.na(con_pnts) & season==lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts-lag(con_sum_pnts),
(is.na(con_pnts) & season!=lag(season) & is.na(con_sum_pnts)!=TRUE) ~ con_sum_pnts,
is.na(con_pnts)!=TRUE ~ con_pnts))
# and again the first direction for good measure
merge_con3 <- merge_con2 %>%
arrange(constructorId, season, raceId, round, con_sum_pnts) %>%
mutate(con_sum_pnts = case_when((is.na(con_sum_pnts) & season==lag(season) & is.na(con_pnts)!=TRUE) ~ lag(con_sum_pnts)+con_pnts,
(is.na(con_sum_pnts) & season!=lag(season) & is.na(con_pnts)!=TRUE) ~ con_pnts,
is.na(con_sum_pnts)!=TRUE ~ con_sum_pnts))
```
```{r}
vis_miss(merge_con3)
merge_con3$constructorRef <- factor(merge_con3$constructorRef)
```
\
After fixing `con_pnts` and `con_sum_pnts`, we have a lot less missing data. The missing values in `con_pos`, `con_posText`, and `con_wins` is somewhat minimal. We can deal with these missing values during recipe creation using a linear regression of other variables that measure a constructors progress throughout the race season, like `con_pnts` and `con_sum_pnts`.
\
Other than this, we don't have a lot of missing data and the data set is pretty usable.
\
\
Now let's work on `merge_driv` similarly.
```{r}
vis_miss(merge_driv, warn_large_data = FALSE)
```
\
That's a lot of missing data. There are some drivers with no constructor team! Let's look at those first.
```{r collapse=TRUE}
dim(merge_driv %>%
filter(is.na(constructorId)))[1]
dim(merge_driv)[1]
```
There are 8646 out of 34496 observation with missing values from `constructorId` to `avg_pit` and `status`. My guess is that when I was joining data sets that had information available only after a certain year, like `quali`, `pit`, and even `results`, then that conflicted with `driver_standings` that had a lot of data starting from the beginning of F1, leaving empty spots in variables only available after a certain date.
\
I don't want to just remove these observations, since they still have data for `driv_standing` and `driv_wins`. Some of the missing variables that I would prefer to have, like `driv_positionOrder`, `driv_pnts`, and `driv_start_pos`, are measures of the success of each driver at each race and throughout the season, similar to `driv_standing` and `driv_wins`. However, I think I will be forced to remove the rows with missing `constructorId` because I will have trouble imputing a categorical variable (`constructorId` is a factor).
```{r}
vis_miss(merge_driv1)
```
\
Deal with missing `driv_standing` and `driv_wins` so we can easily impute the other variables in the recipe later on.
```{r collapse=TRUE}
dim(merge_driv1 %>%
filter(is.na(driv_standing)))[1]
```
```{r eval=FALSE}
merge_driv2 <- merge_driv1 %>%
group_by(driverId, round) %>%
# replace missing driv_standing with the median of the standings the driver gets in that round across seasons
summarise(driv_standing_na = as.integer(median(driv_standing,na.rm=TRUE)))
merge_driv3 <- full_join(merge_driv1, merge_driv2, by=c("driverId","round"))
merge_driv3 <- merge_driv3 %>%
mutate(driv_standing =
case_when(is.na(driv_standing) ~ driv_standing_na,
!is.na(driv_standing) ~ driv_standing))
```
```{r collapse=TRUE}
dim(merge_driv3 %>%
filter(is.na(driv_standing)))[1]
```
By replacing missing `driv_standing` with the median of such in that round number across all seasons, we reduce the missing standing values from 469 to 91. This is a pretty good effort to save some of our data. Then we can omit the 91 missing values.
\
```{r eval=FALSE}
# omit the rest of missing driv_standing
merge_driv4 <- merge_driv3 %>%
filter(!is.na(driv_standing))
```
```{r}
vis_miss(merge_driv4, warn_large_data = FALSE)
```
\
We can work with this. It'll get better when imputing during recipe creation.
\
\
### Visual EDA
Let's look at some of the relationships between our variables.
\
#### Constructor and Driver Points
Let's take a look at our possible response variables `con_pos` and `driv_pnts`.
```{r collapse=TRUE}
grid.arrange(plot_con, plot_driv, ncol=2)
```
\
```{r collapse=TRUE}
range((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)
range((merge_con3 %>%
filter(con_pos==tail(unique(sort((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)))) %>%
select(con_pos,con_name, season, race_name, circ_name))$season)
range((merge_con3 %>%
filter(con_pos==tail(unique(sort((merge_con3 %>%
filter(!is.na(con_pos))
)$con_pos)))) %>%
select(con_pos,con_name, season, race_name, circ_name))$season)
range((merge_driv4 %>%
filter(!is.na(driv_pnts))
)$driv_pnts)
```
The `con_pos` data is from 1 to 22. The constructors with very high positions are pre-2000 (from 1963 to 1991), because now there are only ten constructors each season. The range of `driv_pnts` for the data and possibility is from 0 to 50 (first place at 2014 Abu Dhabi).
\
_* At the 2014 Abu Dhabi Grand Prix, the FIA rewarded double points. So, instead of the usual 25 to 1 point range, first place to tenth place earned 50, 36, 30, 24, 20, 16, 12, 8, 4, and 2 respectively. Understandably, double points was never done again._
\
\
#### Correlation Plots
Let's look at some possible relationships between our variables.
```{r}
# remove na, identification variables, only numeric
corrplot(cor(merge_con3 %>%
select_if(is.numeric)%>%
na.omit()), method="shade", diag=FALSE,addCoef.col=1,number.cex = 1, type="lower")
```
\
In the `merge_con3` data set, we have a positive correlation between `con_wins` and `con_pnts` of which makes sense, the more points a constructor earns, the more likely it is the constructor's drivers won races. There is also a negative correlation between `con_wins` and `con_pos` of, which is also obvious because the more positions a constructor is away from the top of the standings, the less likely that constructor is to win races. `round` and `con_sum_pnts` are positively correlated at, so the more races a constructor completes and the more of the race season that passes, the more points a constructor accumulates. On the other hand, `round` and `con_pnts` are not as positively correlated at because actually making quick progress and improvement during the race season to earn more and more points consistently as the season goes on is a difficult thing to do as an entire constructor team, even if it is what every team desires. The positive correlation between `season` and `con_pnts`/`con_sum_pnts` is obvious because over time the FIA increased not only the number of points rewarded, but also the number of race finishing positions they reward points to. We also have a negative correlation of between `con_pos` and `con_pnts`/`con_sum_pnts`. This is due to similar reasons as with `con_wins` and `con_pos`.
```{r out.width = '110%'}
# remove na, q_time has too much na, and identification variables, only numeric
corrplot(cor(merge_driv4 %>%
select_if(is.numeric) %>%
na.omit()), method="shade",addCoef.col = 1,number.cex = 0.5, type="lower",diag=FALSE)
```
\
Most of the correlations between the `merge_driv` variables follow similar interpretations as those of the `merge_con3` data set. What's nice to look at is the variables I modified myself. For example, `avg_lap_pos` is very positively correlated at with `driv_start_pos` which tells me that the average position a driver spends during a race is likely to be the position they started at. This is intuitively accurate as some races, like street races, can make overtaking difficult, so a driver's start position can have a huge impact on how their race plays out. This also applies to `driv_standing`. The very positive correlation of between `driv_standing` and `avg_lap_pos` is because the more races a driver spends as a back-marker (i.e. higher race position), the more drivers there are between them and the top of the standings.
\
\
#### Constructor Position in Standings vs. Constructor Points
```{r}
ggplot(merge_con3, aes(x=con_pnts, y=con_pos)) +
geom_jitter(width=.5, size=1)+
geom_smooth(col="red", stat="smooth", level=FALSE)+
labs(title="Constructor Position in Standings vs. Constructor Points", y="Constructor position in standings throughout season", x="Constructor points earned at each race")
```
\
Understandably so, the lower the points a constructor earns at a race, the more constructors are between your position and the top of the standings. The relationship here also does not seem linear. Once a constructor begins earning points, it is a bit easier to climb positions in the standings, as earning more points for yourself means your competition is earning less points. The trend here looks inverse exponential, but maybe I'm getting ahead of myself.
\
\
#### Accumulated Constructor Points vs. Round
Let's take a closer look at the relationship between `con_sum_pnts` and `round`, something I found interesting as it has to do with the progress and improvement of a constructor team throughout a race season.
```{r}
ggplot(merge_con3, aes(x=round, y=con_sum_pnts)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="red")+
labs(title="Accumulated Constructor Points vs. Round", y="Constructor points accumulated throughout season", x="Round (number of race in order of season schedule)")
```
\
It is interesting to see the majority of points to be somewhat plateauing in `con_sum_pnts` as `round` increases. You can also see the behavior of the handful of championship winning constructor teams that manage to have a steady increase in their accumulated points.
\
\
#### Driver Points vs. Average Lap Position
```{r}
ggplot(merge_driv4, aes(x=avg_lap_pos, y=driv_pnts)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="green")+
labs(title="Driver Points vs. Average Lap Position", x="Average lap position", y="Driver's points earned at each race")
```
\
We can see that the back-markers, i.e. the drivers that happened to spend most of their race at I would say positions fifteen and greater, have very little chance with earning points. Most of the drivers with average lap position of fifteen and greater are lined up at zero points earned. Reasons drivers spend most of their time out of the points can vary from lacking race pace, having issues with the car, to poor tyre management, which are all valid reasons that keep them from making moves and overtaking to get into the points.
\
What's also interesting is that that thick pile of zero point earners extends all the way to the higher positions. It is not a shock for drivers who are consistently in top positions, even race leaders, to suddenly encounter a car failure and DNF from the race. It is totally possible and is part of the charm of F1.
\
\
#### Average Lap Position vs. Driver Start Position
```{r}
ggplot(merge_driv4, aes(x=driv_start_pos, y=avg_lap_pos)) +
geom_jitter(width=.5, size=1)+
geom_smooth(method="lm", col="blue")+
labs(title="Average Lap Position vs. Driver Start Position", x="Driver's starting position", y="Average lap position")
```
\
Like the correlation plot suggested, we have a clear positive relationship between average lap position and driver start position. Since, qualifying sessions determine driver start position, this also supports the idea that qualifying and sprint qualifying have incredible influence on the course of a race.
\
\
## Model Set-Up
<font size="3"> __Code in file model_building.rda__ </font>
\
Now we can start fitting models to see if we can predict `con_pos` and `driv_pnts`.
\
### Data Split
We will be splitting our data sets into a training set and testing set. A majority of the data should be in the training set, which is what we build our model off of. Having separate training and testing sets allows us to see a truer test of how well our model performs. We need to test our model on data it has not been trained on to obtain a realistic idea of where our model is at. This also avoids over-fitting. If we were to test our model on data its been trained on, the results will be very accurate, but falsely so because it is not a true prediction at that point. Our testing data should be akin to a blind taste test for our model, but we build the model's connections and understandings of how to determine what it will be tasting.
\
We will split our data so that 70% of it goes to training and 30% of it is testing. We also stratify on the outcome variables `con_pos` and `driv_pnts` to make sure our data is split with an equal distribution of the response variables in our training and testing data.
```{r eval=FALSE}
# split into training and testing, strata on response variables
con_split <- initial_split(merge_con3, prop = 0.7, strata = con_pos)
con_train <- training(con_split)
con_test <- testing(con_split)
driv_split <- initial_split(merge_driv4, prop = 0.7, strata =driv_pnts)
driv_train <- training(driv_split)
driv_test <- testing(driv_split)
# check split proportions
nrow(con_train)/nrow(merge_con3)
nrow(con_test)/nrow(merge_con3)
nrow(driv_train)/nrow(merge_driv4)
nrow(driv_test)/nrow(merge_driv4)
```
\
\
### Recipe
We have to decide on a recipe, or formula, made up of our response variable and predictors, that can be used for many types of models later. From our `merge_con3` data set, we are using `con_pos` as a response variable. I want to include measures of points, home races (nationality), race order, year, and the specific circuit/track have influence on `con_pos`. So, for predictors, I have decided on: `constructorRef`, `con_pnts`, `con_wins`, `con_nation`, `season`, `round`, `circuitRef`, and `circ_country`.
\
For `merge_driv`, our response variable is `driv_pnts`. For predictors, I want to include: `driverRef`, `constructorRef`, `driv_positionText`, `driv_positionOrder`, `laps`, `fastestLapTime_rank`, `fastestLapSpeed`, `driv_start_pos`, `avg_lap_time`, `avg_lap_pos`, `driv_standing`, `driv_wins`, `season`, `round`, `circuitRef`, `circ_country`, `dob`, `nationality`, and `status`.
\
I made sure to include only one of `pnts` or `sum_pnts` variables because they are calculated dependently. Same thing with `fastestLapTime_rank` and `fastestLap` since one variable is a ranking of the other. Since `driv_start_pos` is inherently a result of qualifying sessions, I feel that `q_time` is less useful. Especially since `q_time` is a decimal amount of seconds and is more missing than not, I think excluding `q_time` is preferred.
\
We will impute the missing values in continuous and dummy coded categorical variables.
```{r eval=FALSE}
# omit the small amount of missing values left
merge_con3 <- merge_con3 %>%
na.omit()
# CONSTRUCTOR RECIPE
con_rec <-
recipe(con_pos ~ constructorRef+con_pnts+con_wins+con_nation+season+
round+circuitRef+circ_country, data = con_train) %>%
# omit the small amount of missing values left
step_naomit(all_predictors())%>%
# there are a hundreds of different constructors and circuits, so we need to collapse the less common ones into an other category
step_other(constructorRef, threshold=.1) %>%
step_other(con_nation, threshold=.1) %>%
step_other(circuitRef, threshold=.1) %>%
step_other(circ_country, threshold=.1) %>%
step_naomit()%>%
# dummy code categorical variables
step_dummy(all_nominal_predictors()) %>%
# remove variables (likely the dummy coded ones) that only contain a single value
step_zv(all_predictors())%>%
# step_novel(all_nominal_predictors())%>%
# step_unknown(all_nominal_predictors())%>%
# normalizing
step_center(all_predictors()) %>%
step_scale(all_predictors())
# prep and bake
prep(con_rec) %>%
bake(new_data = con_train)
```
\
\
### K-fold Cross Validation
We will also use k-fold cross validation. K-fold cross-validation is a process of setting aside a portion of the training data to split into "k folds" or k number of sections, which within each k fold takes a turn in acting as the validation set while the rest of the k folds are training data. This is done in a manner so that each k times we train and assess, each training set and validation set is mutually exclusive. So, the process should be fitting, then assessing k times, before one final assessment on the actual testing split portion of data. Let's do 10 folds.
```{r eval=FALSE}
con_folds <- vfold_cv(con_train, v = 10)
con_folds2 <- vfold_cv(con_train, v = 5)
# for random forest for the sake of time
```
\
\
## Model Building
At this point I decided to only work on the constructor data set for the sake of time.
\
Our process for building our models follows the same steps. The process is as follows:
\
1. set up model with tuning parameters, the engine, and mode (either regression or classification)
```{r eval=FALSE}
# LINEAR MODEL
lm_model<- linear_reg(engine="lm")
# POLYNOMIAL REGRESSION
# tune degree on the numerical variables (degree on these variables will amplify their effect)
# ex. a change in points and wins will affect constructor position more than normal
con_polyrec <- con_rec %>%
step_poly(con_pnts,con_wins,season,round,degree=tune())
poly_model <- linear_reg(mode="regression",
engine="lm")
# K NEAREST NEIGHBORS
# tune neighbors
knn_model <- nearest_neighbor(neighbors = tune(),
mode="regression",
engine="kknn")
# ELASTIC NET LINEAR REGRESSION
# tune penalty and mixture
en_model <- linear_reg(mixture=tune(),
penalty=tune(),
mode="regression",
engine="glmnet")
# ELASTIC NET W/ LASSO
# tune penalty, set mixture to 1 for lasso penalty
en_lasso <- linear_reg(penalty=tune(),
mixture=1,
mode="regression",
engine="glmnet")
# ELASTIC NET W/ RIDGE
# tune penalty, set mixture to 0 for ridge penalty
en_ridge <- linear_reg(penalty=tune(),
mixture=0,
mode="regression",
engine="glmnet")
# RANDOM FOREST
# tune number of predictors (mtry), trees, and minimum number of values in each node (min_n)
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
```
\
2. set up work flow with model and recipe
```{r eval=FALSE}
# LINEAR MODEL
con_lmwflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(con_rec)
# POLYNOMIAL REGRESSION
con_polywflow <- workflow() %>%
add_model(poly_model) %>%
add_recipe(con_polyrec)
# K NEAREST NEIGHBORS
con_knnwflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(con_rec)
# ELASTIC NET LINEAR REGRESSION
con_enwflow <- workflow() %>%
add_model(en_model) %>%
add_recipe(con_rec)
# ELASTIC NET W/ LASSO
con_lassowflow <- workflow() %>%
add_model(en_lasso) %>%
add_recipe(con_rec)
# ELASTIC NET W/ RIDGE
con_ridgewflow <- workflow() %>%
add_model(en_ridge) %>%
add_recipe(con_rec)
# RANDOM FOREST
con_rfwflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(con_rec)
```
\
3. set up tuning grid with ranges and levels for the tuning of parameters
```{r eval=FALSE}
# POLYNOMIAL REGRESSION
# range degree from 1 to 5
poly_grid <- grid_regular(degree(range=c(1,5)),
levels=5)
# K NEAREST NEIGHBORS
# range neighbors from 1 to 15
knn_grid <- grid_regular(neighbors(range=c(1,15)),
levels=5)
# ELASTIC NET LINEAR REGRESSION
en_grid <- grid_regular(penalty(range=c(0.01,3), trans=identity_trans()),
mixture(range=c(0,1)),
levels=10)
# ELASTIC NET W/ LASSO and
# ELASTIC NET W/ RIDGE
lasso_ridge_grid <- grid_regular(penalty(range=c(0.01,3),
trans=identity_trans()), levels=10)
# RANDOM FOREST
# predictors (mtry) range depend on the recipe
con_rfgrid <- grid_regular(mtry(range=c(1,9)),
trees(range=c(200,400)),
min_n(range=c(10,20)),
levels=5)
```
\
4. tune model using workflow with k-fold cross validation, and tuning grid
```{r eval=FALSE}
# POLYNOMIAL REGRESSION
con_polytune <- tune_grid(
con_polywflow,
resamples = con_folds,
grid = poly_grid)
# K NEAREST NEIGHBORS
con_knntune <- tune_grid(
con_knnwflow,
resamples = con_folds,
grid = knn_grid)
# ELASTIC NET
con_entune <- tune_grid(
con_enwflow,
resamples = con_folds,
grid = en_grid)
# RIDGE REGRESSION
con_ridgetune <- tune_grid(
con_ridgewflow,
resamples = con_folds,
grid = lasso_ridge_grid)
# LASSO REGRESSION
con_lassotune <- tune_grid(
con_lassowflow,
resamples = con_folds,
grid = lasso_ridge_grid)
# RANDOM FOREST
con_rftune <- tune_grid(
con_rfwflow,
resamples = con_folds2,
grid = con_rfgrid)
```
\
5. Collect metrics of tuned models. We want to minimize __root mean squared error (RMSE)__, so find the lowest RMSE for each tuned model.
```{r}
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_building_final.rda")
load("C://Users//waliang//Documents//UCSB//third year//pstat 131//model_results_final.rda")
```
```{r eval=FALSE}
lm_rmse <- collect_metrics(con_lmfit) %>%
slice(1)
ridge_rmse <- collect_metrics(con_ridgetune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# LASSO REGRESSION
lasso_rmse <- collect_metrics(con_lassotune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)
# POLYNOMIAL REGRESSION
poly_rmse <- collect_metrics(con_polytune) %>%
arrange(mean) %>%
filter(.metric=="rmse") %>%
slice(1)