Using R to study tennis players
Looking at winning records of ATP top players and Brazilian tennists
In the previous post about tennis, we studied how changes in ball’s composition in hard and grass courts affected the game back in 2000. In this post, we will analyse a different dataset from the same repository and look at the players winning records in ATP matches.
The data
I’m again using the great repository of tennis data of Jeff Sackmann. In this case, however, I’m using the ATP repository that contains ATP match data since 1968 until today. Again, I thank Jeff Sackmann for making this dataset publicly available.
First, let’s download and unzip the dataset. This is a large file with 35MB and it takes some time.
zip.file <- 'TennisData_ATP.zip'
# download file
if (!file.exists(zip.file)){
download.file('https://github.com/JeffSackmann/tennis_atp/archive/master.zip', destfile = zip.file)
}
# unzip it
dir.out <- 'tennis_atp-master/'
if (!dir.exists(dir.out)) unzip(zip.file)
Lets have a look at the contents of the zip file.
my.f <- list.files(dir.out, pattern = '*.csv', full.names = T)
print(my.f)
## [1] "tennis_atp-master//atp_matches_1968.csv"
## [2] "tennis_atp-master//atp_matches_1969.csv"
## [3] "tennis_atp-master//atp_matches_1970.csv"
## [4] "tennis_atp-master//atp_matches_1971.csv"
## [5] "tennis_atp-master//atp_matches_1972.csv"
## [6] "tennis_atp-master//atp_matches_1973.csv"
## [7] "tennis_atp-master//atp_matches_1974.csv"
## [8] "tennis_atp-master//atp_matches_1975.csv"
## [9] "tennis_atp-master//atp_matches_1976.csv"
## [10] "tennis_atp-master//atp_matches_1977.csv"
## [11] "tennis_atp-master//atp_matches_1978.csv"
## [12] "tennis_atp-master//atp_matches_1979.csv"
## [13] "tennis_atp-master//atp_matches_1980.csv"
## [14] "tennis_atp-master//atp_matches_1981.csv"
## [15] "tennis_atp-master//atp_matches_1982.csv"
## [16] "tennis_atp-master//atp_matches_1983.csv"
## [17] "tennis_atp-master//atp_matches_1984.csv"
## [18] "tennis_atp-master//atp_matches_1985.csv"
## [19] "tennis_atp-master//atp_matches_1986.csv"
## [20] "tennis_atp-master//atp_matches_1987.csv"
## [21] "tennis_atp-master//atp_matches_1988.csv"
## [22] "tennis_atp-master//atp_matches_1989.csv"
## [23] "tennis_atp-master//atp_matches_1990.csv"
## [24] "tennis_atp-master//atp_matches_1991.csv"
## [25] "tennis_atp-master//atp_matches_1992.csv"
## [26] "tennis_atp-master//atp_matches_1993.csv"
## [27] "tennis_atp-master//atp_matches_1994.csv"
## [28] "tennis_atp-master//atp_matches_1995.csv"
## [29] "tennis_atp-master//atp_matches_1996.csv"
## [30] "tennis_atp-master//atp_matches_1997.csv"
## [31] "tennis_atp-master//atp_matches_1998.csv"
## [32] "tennis_atp-master//atp_matches_1999.csv"
## [33] "tennis_atp-master//atp_matches_2000.csv"
## [34] "tennis_atp-master//atp_matches_2001.csv"
## [35] "tennis_atp-master//atp_matches_2002.csv"
## [36] "tennis_atp-master//atp_matches_2003.csv"
## [37] "tennis_atp-master//atp_matches_2004.csv"
## [38] "tennis_atp-master//atp_matches_2005.csv"
## [39] "tennis_atp-master//atp_matches_2006.csv"
## [40] "tennis_atp-master//atp_matches_2007.csv"
## [41] "tennis_atp-master//atp_matches_2008.csv"
## [42] "tennis_atp-master//atp_matches_2009.csv"
## [43] "tennis_atp-master//atp_matches_2010.csv"
## [44] "tennis_atp-master//atp_matches_2011.csv"
## [45] "tennis_atp-master//atp_matches_2012.csv"
## [46] "tennis_atp-master//atp_matches_2013.csv"
## [47] "tennis_atp-master//atp_matches_2014.csv"
## [48] "tennis_atp-master//atp_matches_2015.csv"
## [49] "tennis_atp-master//atp_matches_2016.csv"
## [50] "tennis_atp-master//atp_matches_2017.csv"
## [51] "tennis_atp-master//atp_matches_2018.csv"
## [52] "tennis_atp-master//atp_matches_2019.csv"
## [53] "tennis_atp-master//atp_matches_futures_1991.csv"
## [54] "tennis_atp-master//atp_matches_futures_1992.csv"
## [55] "tennis_atp-master//atp_matches_futures_1993.csv"
## [56] "tennis_atp-master//atp_matches_futures_1994.csv"
## [57] "tennis_atp-master//atp_matches_futures_1995.csv"
## [58] "tennis_atp-master//atp_matches_futures_1996.csv"
## [59] "tennis_atp-master//atp_matches_futures_1997.csv"
## [60] "tennis_atp-master//atp_matches_futures_1998.csv"
## [61] "tennis_atp-master//atp_matches_futures_1999.csv"
## [62] "tennis_atp-master//atp_matches_futures_2000.csv"
## [63] "tennis_atp-master//atp_matches_futures_2001.csv"
## [64] "tennis_atp-master//atp_matches_futures_2002.csv"
## [65] "tennis_atp-master//atp_matches_futures_2003.csv"
## [66] "tennis_atp-master//atp_matches_futures_2004.csv"
## [67] "tennis_atp-master//atp_matches_futures_2005.csv"
## [68] "tennis_atp-master//atp_matches_futures_2006.csv"
## [69] "tennis_atp-master//atp_matches_futures_2007.csv"
## [70] "tennis_atp-master//atp_matches_futures_2008.csv"
## [71] "tennis_atp-master//atp_matches_futures_2009.csv"
## [72] "tennis_atp-master//atp_matches_futures_2010.csv"
## [73] "tennis_atp-master//atp_matches_futures_2011.csv"
## [74] "tennis_atp-master//atp_matches_futures_2012.csv"
## [75] "tennis_atp-master//atp_matches_futures_2013.csv"
## [76] "tennis_atp-master//atp_matches_futures_2014.csv"
## [77] "tennis_atp-master//atp_matches_futures_2015.csv"
## [78] "tennis_atp-master//atp_matches_futures_2016.csv"
## [79] "tennis_atp-master//atp_matches_futures_2017.csv"
## [80] "tennis_atp-master//atp_matches_futures_2018.csv"
## [81] "tennis_atp-master//atp_matches_futures_2019.csv"
## [82] "tennis_atp-master//atp_matches_qual_chall_1978.csv"
## [83] "tennis_atp-master//atp_matches_qual_chall_1979.csv"
## [84] "tennis_atp-master//atp_matches_qual_chall_1980.csv"
## [85] "tennis_atp-master//atp_matches_qual_chall_1981.csv"
## [86] "tennis_atp-master//atp_matches_qual_chall_1982.csv"
## [87] "tennis_atp-master//atp_matches_qual_chall_1983.csv"
## [88] "tennis_atp-master//atp_matches_qual_chall_1984.csv"
## [89] "tennis_atp-master//atp_matches_qual_chall_1985.csv"
## [90] "tennis_atp-master//atp_matches_qual_chall_1986.csv"
## [91] "tennis_atp-master//atp_matches_qual_chall_1987.csv"
## [92] "tennis_atp-master//atp_matches_qual_chall_1988.csv"
## [93] "tennis_atp-master//atp_matches_qual_chall_1989.csv"
## [94] "tennis_atp-master//atp_matches_qual_chall_1990.csv"
## [95] "tennis_atp-master//atp_matches_qual_chall_1991.csv"
## [96] "tennis_atp-master//atp_matches_qual_chall_1992.csv"
## [97] "tennis_atp-master//atp_matches_qual_chall_1993.csv"
## [98] "tennis_atp-master//atp_matches_qual_chall_1994.csv"
## [99] "tennis_atp-master//atp_matches_qual_chall_1995.csv"
## [100] "tennis_atp-master//atp_matches_qual_chall_1996.csv"
## [101] "tennis_atp-master//atp_matches_qual_chall_1997.csv"
## [102] "tennis_atp-master//atp_matches_qual_chall_1998.csv"
## [103] "tennis_atp-master//atp_matches_qual_chall_1999.csv"
## [104] "tennis_atp-master//atp_matches_qual_chall_2000.csv"
## [105] "tennis_atp-master//atp_matches_qual_chall_2001.csv"
## [106] "tennis_atp-master//atp_matches_qual_chall_2002.csv"
## [107] "tennis_atp-master//atp_matches_qual_chall_2003.csv"
## [108] "tennis_atp-master//atp_matches_qual_chall_2004.csv"
## [109] "tennis_atp-master//atp_matches_qual_chall_2005.csv"
## [110] "tennis_atp-master//atp_matches_qual_chall_2006.csv"
## [111] "tennis_atp-master//atp_matches_qual_chall_2007.csv"
## [112] "tennis_atp-master//atp_matches_qual_chall_2008.csv"
## [113] "tennis_atp-master//atp_matches_qual_chall_2009.csv"
## [114] "tennis_atp-master//atp_matches_qual_chall_2010.csv"
## [115] "tennis_atp-master//atp_matches_qual_chall_2011.csv"
## [116] "tennis_atp-master//atp_matches_qual_chall_2012.csv"
## [117] "tennis_atp-master//atp_matches_qual_chall_2013.csv"
## [118] "tennis_atp-master//atp_matches_qual_chall_2014.csv"
## [119] "tennis_atp-master//atp_matches_qual_chall_2015.csv"
## [120] "tennis_atp-master//atp_matches_qual_chall_2016.csv"
## [121] "tennis_atp-master//atp_matches_qual_chall_2017.csv"
## [122] "tennis_atp-master//atp_matches_qual_chall_2018.csv"
## [123] "tennis_atp-master//atp_matches_qual_chall_2019.csv"
## [124] "tennis_atp-master//atp_players.csv"
## [125] "tennis_atp-master//atp_rankings_00s.csv"
## [126] "tennis_atp-master//atp_rankings_10s.csv"
## [127] "tennis_atp-master//atp_rankings_70s.csv"
## [128] "tennis_atp-master//atp_rankings_80s.csv"
## [129] "tennis_atp-master//atp_rankings_90s.csv"
## [130] "tennis_atp-master//atp_rankings_current.csv"
Again, same as with the other post, we see a lot of files. The names are quite suggesting and it is clear that not all files contains matches data. Let’s restrict the analysis just for the files with the string atp_matches
. This includes main matches, qualifications and futures games.
library(stringr)
# restrict just for main atp matches
my.f <- my.f[str_detect(my.f, 'matches')]
my.f
## [1] "tennis_atp-master//atp_matches_1968.csv"
## [2] "tennis_atp-master//atp_matches_1969.csv"
## [3] "tennis_atp-master//atp_matches_1970.csv"
## [4] "tennis_atp-master//atp_matches_1971.csv"
## [5] "tennis_atp-master//atp_matches_1972.csv"
## [6] "tennis_atp-master//atp_matches_1973.csv"
## [7] "tennis_atp-master//atp_matches_1974.csv"
## [8] "tennis_atp-master//atp_matches_1975.csv"
## [9] "tennis_atp-master//atp_matches_1976.csv"
## [10] "tennis_atp-master//atp_matches_1977.csv"
## [11] "tennis_atp-master//atp_matches_1978.csv"
## [12] "tennis_atp-master//atp_matches_1979.csv"
## [13] "tennis_atp-master//atp_matches_1980.csv"
## [14] "tennis_atp-master//atp_matches_1981.csv"
## [15] "tennis_atp-master//atp_matches_1982.csv"
## [16] "tennis_atp-master//atp_matches_1983.csv"
## [17] "tennis_atp-master//atp_matches_1984.csv"
## [18] "tennis_atp-master//atp_matches_1985.csv"
## [19] "tennis_atp-master//atp_matches_1986.csv"
## [20] "tennis_atp-master//atp_matches_1987.csv"
## [21] "tennis_atp-master//atp_matches_1988.csv"
## [22] "tennis_atp-master//atp_matches_1989.csv"
## [23] "tennis_atp-master//atp_matches_1990.csv"
## [24] "tennis_atp-master//atp_matches_1991.csv"
## [25] "tennis_atp-master//atp_matches_1992.csv"
## [26] "tennis_atp-master//atp_matches_1993.csv"
## [27] "tennis_atp-master//atp_matches_1994.csv"
## [28] "tennis_atp-master//atp_matches_1995.csv"
## [29] "tennis_atp-master//atp_matches_1996.csv"
## [30] "tennis_atp-master//atp_matches_1997.csv"
## [31] "tennis_atp-master//atp_matches_1998.csv"
## [32] "tennis_atp-master//atp_matches_1999.csv"
## [33] "tennis_atp-master//atp_matches_2000.csv"
## [34] "tennis_atp-master//atp_matches_2001.csv"
## [35] "tennis_atp-master//atp_matches_2002.csv"
## [36] "tennis_atp-master//atp_matches_2003.csv"
## [37] "tennis_atp-master//atp_matches_2004.csv"
## [38] "tennis_atp-master//atp_matches_2005.csv"
## [39] "tennis_atp-master//atp_matches_2006.csv"
## [40] "tennis_atp-master//atp_matches_2007.csv"
## [41] "tennis_atp-master//atp_matches_2008.csv"
## [42] "tennis_atp-master//atp_matches_2009.csv"
## [43] "tennis_atp-master//atp_matches_2010.csv"
## [44] "tennis_atp-master//atp_matches_2011.csv"
## [45] "tennis_atp-master//atp_matches_2012.csv"
## [46] "tennis_atp-master//atp_matches_2013.csv"
## [47] "tennis_atp-master//atp_matches_2014.csv"
## [48] "tennis_atp-master//atp_matches_2015.csv"
## [49] "tennis_atp-master//atp_matches_2016.csv"
## [50] "tennis_atp-master//atp_matches_2017.csv"
## [51] "tennis_atp-master//atp_matches_2018.csv"
## [52] "tennis_atp-master//atp_matches_2019.csv"
## [53] "tennis_atp-master//atp_matches_futures_1991.csv"
## [54] "tennis_atp-master//atp_matches_futures_1992.csv"
## [55] "tennis_atp-master//atp_matches_futures_1993.csv"
## [56] "tennis_atp-master//atp_matches_futures_1994.csv"
## [57] "tennis_atp-master//atp_matches_futures_1995.csv"
## [58] "tennis_atp-master//atp_matches_futures_1996.csv"
## [59] "tennis_atp-master//atp_matches_futures_1997.csv"
## [60] "tennis_atp-master//atp_matches_futures_1998.csv"
## [61] "tennis_atp-master//atp_matches_futures_1999.csv"
## [62] "tennis_atp-master//atp_matches_futures_2000.csv"
## [63] "tennis_atp-master//atp_matches_futures_2001.csv"
## [64] "tennis_atp-master//atp_matches_futures_2002.csv"
## [65] "tennis_atp-master//atp_matches_futures_2003.csv"
## [66] "tennis_atp-master//atp_matches_futures_2004.csv"
## [67] "tennis_atp-master//atp_matches_futures_2005.csv"
## [68] "tennis_atp-master//atp_matches_futures_2006.csv"
## [69] "tennis_atp-master//atp_matches_futures_2007.csv"
## [70] "tennis_atp-master//atp_matches_futures_2008.csv"
## [71] "tennis_atp-master//atp_matches_futures_2009.csv"
## [72] "tennis_atp-master//atp_matches_futures_2010.csv"
## [73] "tennis_atp-master//atp_matches_futures_2011.csv"
## [74] "tennis_atp-master//atp_matches_futures_2012.csv"
## [75] "tennis_atp-master//atp_matches_futures_2013.csv"
## [76] "tennis_atp-master//atp_matches_futures_2014.csv"
## [77] "tennis_atp-master//atp_matches_futures_2015.csv"
## [78] "tennis_atp-master//atp_matches_futures_2016.csv"
## [79] "tennis_atp-master//atp_matches_futures_2017.csv"
## [80] "tennis_atp-master//atp_matches_futures_2018.csv"
## [81] "tennis_atp-master//atp_matches_futures_2019.csv"
## [82] "tennis_atp-master//atp_matches_qual_chall_1978.csv"
## [83] "tennis_atp-master//atp_matches_qual_chall_1979.csv"
## [84] "tennis_atp-master//atp_matches_qual_chall_1980.csv"
## [85] "tennis_atp-master//atp_matches_qual_chall_1981.csv"
## [86] "tennis_atp-master//atp_matches_qual_chall_1982.csv"
## [87] "tennis_atp-master//atp_matches_qual_chall_1983.csv"
## [88] "tennis_atp-master//atp_matches_qual_chall_1984.csv"
## [89] "tennis_atp-master//atp_matches_qual_chall_1985.csv"
## [90] "tennis_atp-master//atp_matches_qual_chall_1986.csv"
## [91] "tennis_atp-master//atp_matches_qual_chall_1987.csv"
## [92] "tennis_atp-master//atp_matches_qual_chall_1988.csv"
## [93] "tennis_atp-master//atp_matches_qual_chall_1989.csv"
## [94] "tennis_atp-master//atp_matches_qual_chall_1990.csv"
## [95] "tennis_atp-master//atp_matches_qual_chall_1991.csv"
## [96] "tennis_atp-master//atp_matches_qual_chall_1992.csv"
## [97] "tennis_atp-master//atp_matches_qual_chall_1993.csv"
## [98] "tennis_atp-master//atp_matches_qual_chall_1994.csv"
## [99] "tennis_atp-master//atp_matches_qual_chall_1995.csv"
## [100] "tennis_atp-master//atp_matches_qual_chall_1996.csv"
## [101] "tennis_atp-master//atp_matches_qual_chall_1997.csv"
## [102] "tennis_atp-master//atp_matches_qual_chall_1998.csv"
## [103] "tennis_atp-master//atp_matches_qual_chall_1999.csv"
## [104] "tennis_atp-master//atp_matches_qual_chall_2000.csv"
## [105] "tennis_atp-master//atp_matches_qual_chall_2001.csv"
## [106] "tennis_atp-master//atp_matches_qual_chall_2002.csv"
## [107] "tennis_atp-master//atp_matches_qual_chall_2003.csv"
## [108] "tennis_atp-master//atp_matches_qual_chall_2004.csv"
## [109] "tennis_atp-master//atp_matches_qual_chall_2005.csv"
## [110] "tennis_atp-master//atp_matches_qual_chall_2006.csv"
## [111] "tennis_atp-master//atp_matches_qual_chall_2007.csv"
## [112] "tennis_atp-master//atp_matches_qual_chall_2008.csv"
## [113] "tennis_atp-master//atp_matches_qual_chall_2009.csv"
## [114] "tennis_atp-master//atp_matches_qual_chall_2010.csv"
## [115] "tennis_atp-master//atp_matches_qual_chall_2011.csv"
## [116] "tennis_atp-master//atp_matches_qual_chall_2012.csv"
## [117] "tennis_atp-master//atp_matches_qual_chall_2013.csv"
## [118] "tennis_atp-master//atp_matches_qual_chall_2014.csv"
## [119] "tennis_atp-master//atp_matches_qual_chall_2015.csv"
## [120] "tennis_atp-master//atp_matches_qual_chall_2016.csv"
## [121] "tennis_atp-master//atp_matches_qual_chall_2017.csv"
## [122] "tennis_atp-master//atp_matches_qual_chall_2018.csv"
## [123] "tennis_atp-master//atp_matches_qual_chall_2019.csv"
As you can see, I used string matching function str_detect
from stringr
to find and keep just the csv files with the string matches in its name. Now, let’s load all files into a single dataframe.
library(readr)
library(dplyr)
# set cols (missing some)
my.cols <- cols(
.default = col_integer(),
tourney_id = col_character(),
tourney_name = col_character(),
surface = col_character(),
tourney_level = col_character(),
winner_entry = col_character(),
winner_name = col_character(),
winner_hand = col_character(),
winner_ioc = col_character(),
winner_age = col_double(),
loser_entry = col_character(),
loser_name = col_character(),
loser_hand = col_character(),
loser_ioc = col_character(),
loser_age = col_double(),
score = col_character(),
round = col_character()
)
# load all files with lapply and do.call (some cols don't match in all files)
df.matches <- do.call(bind_rows,lapply(my.f, read_csv, col_types = my.cols))
# create year column
df.matches$Date <- as.Date(as.character(df.matches$tourney_date),'%Y%m%d')
df.matches$Year <- format(df.matches$Date,'%Y')
Let’s see what the data offers.
str(df.matches)
## tibble [752,552 × 51] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ tourney_id : chr [1:752552] "1968-580" "1968-580" "1968-580" "1968-580" ...
## $ tourney_name : chr [1:752552] "Australian Chps." "Australian Chps." "Australian Chps." "Australian Chps." ...
## $ surface : chr [1:752552] "Grass" "Grass" "Grass" "Grass" ...
## $ draw_size : int [1:752552] 64 64 64 64 64 64 64 64 64 64 ...
## $ tourney_level : chr [1:752552] "G" "G" "G" "G" ...
## $ tourney_date : int [1:752552] 19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 19680119 ...
## $ match_num : int [1:752552] 1 2 3 4 5 6 7 8 9 10 ...
## $ winner_id : int [1:752552] 110023 109803 100257 100105 109966 107759 100101 100025 108519 109799 ...
## $ winner_seed : int [1:752552] NA NA NA 5 NA NA 12 3 NA NA ...
## $ winner_entry : chr [1:752552] NA NA NA NA ...
## $ winner_name : chr [1:752552] "Richard Coulthard" "John Brown" "Ross Case" "Allan Stone" ...
## $ winner_hand : chr [1:752552] "R" "R" "R" "R" ...
## $ winner_ht : int [1:752552] NA NA NA NA NA NA NA 173 NA NA ...
## $ winner_ioc : chr [1:752552] "AUS" "AUS" "AUS" "AUS" ...
## $ winner_age : num [1:752552] NA 27.5 16.2 22.3 29.9 ...
## $ winner_rank : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ winner_rank_points: int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ loser_id : int [1:752552] 107760 106964 110024 110025 110026 110027 110028 108430 110029 110030 ...
## $ loser_seed : int [1:752552] NA NA 15 NA NA NA NA NA NA NA ...
## $ loser_entry : chr [1:752552] NA NA NA NA ...
## $ loser_name : chr [1:752552] "Max Senior" "Ernie Mccabe" "Gondo Widjojo" "Robert Layton" ...
## $ loser_hand : chr [1:752552] "R" "R" "R" "R" ...
## $ loser_ht : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ loser_ioc : chr [1:752552] "AUS" "AUS" "INA" "AUS" ...
## $ loser_age : num [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ loser_rank : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ loser_rank_points : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ score : chr [1:752552] "12-10 7-5 4-6 7-5" "6-3 6-2 6-4" "6-4 3-6 6-3 7-5" "6-4 6-2 6-1" ...
## $ best_of : int [1:752552] 5 5 5 5 5 5 5 5 5 5 ...
## $ round : chr [1:752552] "R64" "R64" "R64" "R64" ...
## $ minutes : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_ace : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_df : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_svpt : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_1stIn : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_1stWon : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_2ndWon : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_SvGms : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_bpSaved : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ w_bpFaced : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_ace : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_df : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_svpt : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_1stIn : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_1stWon : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_2ndWon : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_SvGms : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_bpSaved : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ l_bpFaced : int [1:752552] NA NA NA NA NA NA NA NA NA NA ...
## $ Date : Date[1:752552], format: "1968-01-19" "1968-01-19" ...
## $ Year : chr [1:752552] "1968" "1968" "1968" "1968" ...
Again, lots of information. Looks like the games are separated by rows, where the columns have a lot of information about the matches. This dataset is not as detailed as the last one with point by point data, but still quite impressive.
Looking at top players winning percentages
For our first analysis, lets look at the winning percentages of the top 10 players from the ATP rankings. For that, I will use package rvest
to scrape the information about the top ATP players from the ATP website.
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(stringr)
n.players <- 10
my.url <- 'http://www.atpworldtour.com/en/rankings/singles'
atp.players <- read_html(my.url) %>%
html_nodes(".player-cell") %>%
html_text() %>%
str_replace_all('\\n','') %>%
str_trim()
name.player <- atp.players[1:n.players]
print(name.player)
## [1] "Novak Djokovic" "Rafael Nadal" "Dominic Thiem"
## [4] "Roger Federer" "Daniil Medvedev" "Stefanos Tsitsipas"
## [7] "Alexander Zverev" "Matteo Berrettini" "Gael Monfils"
## [10] "David Goffin"
As you can see, we have to usual suspects. Novak Djokovic is now leading in the first position, which was held previously by Djokovic for a long period of time. I really like to see Dominic Thiem in the top ten as he is the youngest with 23 year, a solid game, and lots of potential. Federer is the oldest of the list, with 35 years old, but the most talented. It is amazing how he can still stay competitive given the age difference.
Back to the R analysis, a problem with this approach of scraping the names from the ATP website is that they don’t necessarily match the names in the ATP files. As an example, the swiss player Stan Wawrinka is named Stanislas Wawrinka in the csv files, while in the ATP website is Stan Wawrinka. This means that directly trying to match the names in name.player
with the names in df.matches
may not work. This is classic case of data from difference sources that don’t share the same unique identifiers.
I usually solve this problem using a lookup table built manually in a csv file, where I store the matching identifiers in the different datasets. If the number of unique cases is small, it is quite doable. For this analysis, however, we have 13491 players, which is definitely too much. Manually creating a lookup table would be very demanding.
The solution I often in this case is to make the computer look for the best imperfect match among the different identifiers (names). This is also known as calculating string distances - the number of require edits so that a string matches other, and can be accomplished in many different ways. In this example I’ll use function amatch
from package stringdist
, with the basic options.
library(stringdist)
# get unique names from losers and winners
unique.players <- unique(c(df.matches$winner_name,
df.matches$loser_name ))
# find the index of the closest string for each name in name.player
idx <- amatch(name.player, unique.players, maxDist = 5)
Lets look at the result:
print(data.frame(name.from.web = name.player,
name.from.atp = unique.players[idx] ))
## name.from.web name.from.atp
## 1 Novak Djokovic Novak Djokovic
## 2 Rafael Nadal Rafael Nadal
## 3 Dominic Thiem Dominic Thiem
## 4 Roger Federer Roger Federer
## 5 Daniil Medvedev Daniil Medvedev
## 6 Stefanos Tsitsipas Stefanos Tsitsipas
## 7 Alexander Zverev Alexander Zverev
## 8 Matteo Berrettini Matteo Berrettini
## 9 Gael Monfils Gael Monfils
## 10 David Goffin David Goffin
Beautiful! For our case of just 10 players, it worked perfectly. The name of Stan Wawrinka_ was matched correctly. Be aware, however, that this procedure of string matching is not guaranteed to work 100% of the cases. You could also set NA values using the options in amatch
but it is still not guaranteed to work well. You should always manually check the result.
Going forward, let’s copy the players name from one to the other and use a loop and dplyr
to calculate yearly winning percentages of each player, also separating by the type of surface. Using a loop here might look odd, but this was the simplest solution I’ve found as you cannot use dplyr
directly. The dataframe df.matches
is unique in the sense that it has two informations in each row, you have a win for someone and a loss to another player (see columns winner_name
and loser_name
). If I use dplyr
directly, I would miss the games where two players in the list played each other. There is probably a way to adjust the dataframe, but I just find it easier to loop over the players.
# copy players name
name.player <- unique.players[idx]
# create table with players
my.tab <- tibble()
for (i.player in name.player) {
temp <- filter(df.matches, (winner_name==i.player)|(loser_name==i.player))
temp$Name <- i.player
temp.tab <- temp %>%
group_by(Name, Year,surface) %>%
summarise(`Percentage of wins` = sum(winner_name == Name)/n(),
`Number of matches` = n()) %>%
filter(`Number of matches` > 0 ) %>%
filter(surface %in% c('Hard','Clay'))
my.tab <- bind_rows(my.tab,
temp.tab)
}
# use atp ranking in names e.g. Andy Murray (1)
my.tab$Name <- paste0(my.tab$Name,
' (',
match(my.tab$Name, name.player),
')' )
my.tab <- my.tab[complete.cases(my.tab), ]
Lets plot it!
library(ggplot2)
p <- ggplot(my.tab, aes(x = as.numeric(Year), y = `Percentage of wins`, color=surface))
p <- p + geom_point() + geom_smooth()
p <- p + facet_grid(surface~Name)
p <- p + ylim(c(0.25,1)) + xlim(c(2005,2016))
print(p)
We can see a few patterns. First, the winning consistency of top ten player is higher in hard courts. For example, look at Murray and Djokovic. It is easier for them to keep the same proportion of wins in hard courts than it is in clay. My best guess is that hard courts are more predictable and the ball is faster and bounces lower, making it easier for top players to hit winners and other point-finishing strokes such as drop shots. Be aware that this could also be due to the different number of matches in each court. We could test the difference of volatility statistically but I rather keep the analysis simple and visual.
Also, not surprisingly, Nadal, the king of clay, presents the highest winning percentage in clay, before 2012. His game strategy uses strength and heavy top spin balls that are difficult to counter. See here for a more detailed analysis on why he is so dominant on clay. The bad side of his strategy and type of play is that it is based heavily on his muscular strength. As his age increases, it is difficult to maintain the same level. This explains the increased proportion of defeats after 2012.
The effect of age is also perceived for Roger Federer. But, given that his style of play is not based just on strength, I think that he can still beat a lot of players. Just look at the his recent title win at the 2017 Australian open.
Now, look how Dojokovic is consistent in clay and hard courts. He is definitely the player with the best stats. I predict that, if he works for it, he will have no problem in taking the #1 spot once again from Murray.
Looking at Brazilian tennis players
In this section we will analyse the stats of Brazilian tennis players in ATP rankings. The following code can be easily executed for any country as the data is scraped from the ATP website. Just change the object my.country
for your case.
my.country <- 'BRA'
n.players <- 5
my.url <- paste0('http://www.atpworldtour.com/en/rankings/singles?rankDate=2017-02-06&rankRange=0-500&countryCode=',my.country)
atp.players <- read_html(my.url) %>%
html_nodes(".player-cell") %>%
html_text() %>%
str_replace_all('\\n','') %>%
str_trim()
name.player <- atp.players[1:n.players]
name.player
## [1] "Thiago Monteiro" "Rogerio Dutra Silva" "Thomaz Bellucci"
## [4] "Joao Souza" "Andre Ghem"
The top five Brazilian players in ATP are Thiago Monteiro, Rogerio Dutra Silva, Thomas Belluci, João Souza and Andre Ghem, in that order. We should be proud of all of them. I can only imagine the dedication and effort necessary to be a professional tennis player. You can see more details about each here.
At this point, it makes sense to create a function to analyse the dataframe, as we will use it again later.
plot.winning.perc <- function(name.player,
df.matches,
min.games.per.year = 0,
min.year = 2005,
max.year = 2016){
require(stringdist)
require(stringr)
require(dplyr)
require(ggplot2)
# create year column
df.matches$Date <- as.Date(as.character(df.matches$tourney_date),'%Y%m%d')
df.matches$Year <- format(df.matches$Date,'%Y')
# get unique names from losers and winners
unique.players <- unique(c(df.matches$winner_name,
df.matches$loser_name ))
# find the index of the closest string for each name in name.player
idx <- amatch(name.player, unique.players, maxDist = 5)
#print(data.frame(name.from.web = name.player,
# name.from.atp = unique.players[idx] ))
my.tab <- tibble()
for (i.player in name.player) {
temp <- filter(df.matches, (winner_name==i.player)|(loser_name==i.player))
temp$Name <- i.player
temp.tab <- temp %>%
group_by(Name, Year,surface) %>%
summarise(`Percentage of wins` = sum(winner_name == Name)/n(),
`Number of matches` = n()) %>%
filter(`Number of matches` > min.games.per.year ) %>%
filter(surface %in% c('Hard','Clay'))
my.tab <- bind_rows(my.tab,
temp.tab)
}
p <- ggplot(my.tab, aes(x = as.numeric(Year), y = `Percentage of wins`, color=surface))
p <- p + geom_point() + geom_smooth()
p <- p + facet_grid(surface~Name)
p <- p + ylim(c(0.25,1)) + xlim(c(min.year,max.year))
return(p)
}
Now, we use the previous function for the new names of players.
# plot winnign percentages
p <- plot.winning.perc(name.player, df.matches)
print(p)
From this list, the youngest Brazilian player, and with most potential in my opinion, is Thiago Monteiro. He is 22 and has lots of time to improve his game. In 2016 he beated Tsonga in a great match here in Brazil.
The records of Guga (Gustavo Kurten)
For last, lets look at stats of the the best and most cherished singles player from Brazil, Gustavo Kuerten - Guga. Guga is internationally famous and reached #1 in the ATP ranking in 2000. A very charismatic player, with a fantastic backhand. It is not far fetched to say that Guga brought Tennis to Brazil, making it a more popular sport.
Guga had a hip injury back in 2001. He tried surgery, but it didn’t worked out and he wasn’t able to maintain the higher level of tennis that is necessary in the professional circuit. He retired in february of 2008. His story is very inspiring, growing up in the state of Santa Catarina, in a time where tennis was not a popular sport in Brazil and funding opportunities were scarce. His biography (in portuguese) is an excellent read.
Let’s have a look in the data.
name.player <- 'Gustavo Kuerten'
p <- plot.winning.perc(name.player, df.matches, min.year = 1990)
print(p)
As expected, a winning journey until 2001, when losses start to accumulate.
Conlusion
The data repository of Jeff Sackmann is great. I really only scratched the surface with some simple graphical analysis. I think there is a lot of space for modeling as well, that is, creating a model that can predict the winning records of a player. I’m thinking about an econometric model using historical statistics of players such as past winning records, forehand/backhand winners, and so on. But I’m sure you can use other approaches as well.
From the computational side, it would be interesting to create a CRAN package to analyse this dataset. Just like in this post, all data could be fetched from the ATP website and the repository of Jeff. One could check the winning records of players with just a simple function call. I just checked and there is nothing like it in CRAN. A shiny app would also be interesting. I’ll keep it in mind for the future.
I hope you liked this post. Any comments or suggestions, please share in the comments section.