Following this very short tutorial (https://decisionstats.com/2012/04/07/cricinfo-statsguru-database-for-statistical-and-graphical-analysis/), I managed to download some cricket data from Cricinfo.

So far, it is limited to the 2006-07 Ashes series. As it turns out, all tests are assigned with a particular number in the html pages. Luckily for this Ashes series, those numberings are sequential, making the extraction easier.

Some problems along the way: the data is quite un-structured. It has empty entires which was relatively easy to deal with. Thanks to the consistencies in html tables, most of the data extraction was rather uniform (i.e. the data frames are roughly the same size, apart from some limited exceptions).

I will put the R codes at the end. The further aim of this project is to able to extract larger quantity of cricket data, not limited to just an Ashes series, but hopfully a whole collection of data throughout a player’s career or even the entire cricket history.

The input: urls to pages with Test Innings. For example: www.espncricinfo.com/ausveng/engine/match/249222.html

The output: a set of tables with batting and bowling statistics for all players in that Ashes series, separated by Test, Innings and teams.

==========================================================================

#Ashes2006-2007

“`{r}

library(XML)

getwd()

urls = paste0(“http://www.espncricinfo.com/ausveng/engine/match/”, 249222:249226, “.html”)

testMatchTables = lapply(as.list(urls), readHTMLTable, stringAsFactors = F, as.data.frame = T)

names(testMatchTables) = paste0(c(“1st”, “2nd”, “3rd”, “4th”, “5th”), “Test”)

sapply(testMatchTables,length) ## Each test match has a set of tables. The number of tables are listed.

tableType = rep(c(“Batting”,”Bowling”),4)

tableNames = paste0(rep(paste0(c(“1st”, “2nd”, “3rd”, “4th”), “Innings”), rep(2,4)), c(“Batting”,”Bowling”))

testMatchTables = lapply(testMatchTables, function(thisTestMatchTables){

numInnings = (length(thisTestMatchTables) – 2)/2 ## The extraction always give two null table at the end. Remove these. And each Innings is constructed from a batting table and a bowling table.

result = thisTestMatchTables[1:(2*numInnings)] ## Due to the structure of the HTML tables, the last two tables are null.

names(result) = tableNames[1:(2*numInnings)] ## The oder of the tables goes: 1stInnBatting, 1stInnBowling, 2ndInnBatting, etc.

result = lapply(result, na.omit) ## Remove all NA values.

return(result)

})

sapply(testMatchTables,length)

testMatchTables[[1]][tableType==”Batting”]

testMatchTables[[1]][tableType==”Bowling”]

battingVarNames = c(“Batsmam”, “Dismissal”, “Runs”, “Minutes”, “Balls”, “4s”, “6s”,”SR”)

length(battingVarNames)

bowlingVarNames = c(“Bowler”, “Over”, “Maidens”, “Runs”, “Wickets”, “Econ”, “0s”, “4s”, “6s”, “Extras”)

length(bowlingVarNames)

##########################################################################################

testMatchByBatting = lapply(testMatchTables, function(thisTestMatch){

result = thisTestMatch[tableType == “Batting”] ## Result is all the batting innings in the Test

result = lapply(result, function(thisBattingInnings){

if(is.null(thisBattingInnings)){

return(NULL) ## If there is a batting Innings missing (i.e. when a team won by an Innings.), then we will return NULL, since there is no data to process.

} else {

formattedTable = thisBattingInnings[,-1] ## The first column of the tables is always empty.

colnames(formattedTable) = battingVarNames[1:ncol(formattedTable)] ## The names of the batting statistic variables. Some missing data means the table might not be of the same dimensions.

}

return(formattedTable) ## Return the fulling processed table.

})

return(result) ## Return all processed Batting Tables.

})

##########################################################################################

testMatchByBowling = lapply(testMatchTables, function(thisTestMatch){

result = thisTestMatch[tableType == “Bowling”] ## Result is all the bowling innings in the Test

result = lapply(result, function(thisBowlingInnings){

if(is.null(thisBowlingInnings)){

return(NULL) ## If there is a bowling Innings missing (i.e. when a team won by an Innings.), then we will return NULL, since there is no data to process.

} else {

formattedTable = thisBowlingInnings[,-1] ## The first column of the tables is always empty.

colnames(formattedTable) = bowlingVarNames[1:ncol(formattedTable)] ## The names of the bowling statistic variables. Some missing data means the table might not be of the same dimensions.

}

return(formattedTable) ## Return the fulling processed table.

})

return(result) ## Return all processed Bowling Tables.

})

##########################################################################################

for(i in 1:length(testMatchByBatting)){

thisTest = testMatchByBatting[[i]]

for (j in 1:length(thisTest)){

write.csv(x = thisTest[[j]],

file = paste0(“./cricket_2007Ashes/”, names(testMatchByBatting)[i], names(thisTest)[j], “.csv”),

row.names = F)

} ## End for-j-loop

} ## End for-i-loop

“`