-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpitchRx-overview.Rmd
9 lines (5 loc) · 2.36 KB
/
pitchRx-overview.Rmd
1
2
3
4
5
6
7
8
9
# Taming PITCHf/x Data with XML2R and pitchRx
Pitch f/x refers a massive, publicly available baseball dataset hosted on the web in XML and JSON format. Since this data is large, increases on a daily basis, and only licensed for individual use, the __pitchRx__ package provides a simple interface to download, parse, clean, and transform the data from its source (instead of directly distributing the data). If acquiring large amounts of data, to avoid memory limitations, users may divert incoming data in chunks to a database using any valid R database connection [@DBI]. It also provides a convenient function to update an existing database with the most recently available data without re-downloading anything.
The __openWAR__ package also provides high-level access to Pitch f/x data, but it is currently more limited in the data it can acquire [@openWAR]. It also currently depends on the difficult to install __Sxslt__ package, impeding portability [@Sxslt]. __openWAR__ depends on __Sxslt__ to help transform XML files to R data frames via XSL Transformations (XSLT). Without advanced knowledge of XSLT, one must define transformations by hard coding assumptions about the XML format, such as the names of fields of interest. New variables have been added into Pitch f/x several times, and __pitchRx__ automatically picks them up, thanks to functionality provided by __XML2R__.
__XML2R__ makes it easy to wrangle relational data stored as a collection of XML files into a list of data frames. Its interface satisfies principles from pure functional programming: the output of each function can be completely determined from the input. The interface is also predictable: each function inputs and outputs a list of observations (an observation is a matrix with one row). It also represents XML content as a list of observations (matrices with one row), allowing each function to operate on native R data structures, making it more intuitive for R programmers to work with compared to the non-native XMLDocumentContent. This new representation is slightly less computationally efficient in some cases, but it has also made it much easier to implement and maintain higher-level interfaces to specific XML data sources, such as __pitchRx__ and __bbscrapeR__ [@bbscrapeR].
To see the fully published article "Taming PITCHf/x Data with XML2R and pitchRx", see <http://rjournal.github.io/archive/2014-1/sievert.pdf>