diff --git a/posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd b/posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd new file mode 100644 index 00000000..d524391f --- /dev/null +++ b/posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd @@ -0,0 +1,246 @@ +--- +title: "Visualizing performance regression of data.table with atime" +author: "Doris Afriyie Amoakohene" +date: "2024-02-04" +output: html_document +--- + + +Since August 2023, I have been working on performance testing, which could be useful for expanding the open-source ecosystem around data.table package in R. +This could increase confidence in code contributions by ensuring the sustained efficiency of the data.table package. + +In data.table, the term “performance regression” usually means an increase in both time metrics and memory metrics, and that is how we will be using the term here. + +It is important that we prevent significant performance regression from reaching the data.table package. Slowness or big memory usage can be frustrating. Any performance regression that makes it into a version release will degrade user experience. In this blog post, I will demonstrate the use of benchmarking techniques to verify whether reported issues on data.table have been successfully resolved. + + +# **Understanding performance in data.table** + +data.table is an extension of R's data.frame, designed to handle large datasets efficiently. It provides a syntax that is both concise and expressive, allowing users to perform complex data manipulations with ease. Its efficiency is particularly evident when dealing with tasks like filtering, grouping, aggregating, and joining data. + +The development team behind data.table is committed to continuously improving its performance. Over the years, several major version changes have been introduced, aiming to enhance speed and efficiency. These changes include algorithmic optimizations, memory management improvements, and enhancements to parallel processing capabilities. Upgrading to the latest version ensures that users can leverage the most recent performance enhancements. + +# **Benchmarking for performance evaluation** + +To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against time and memory usage and giving a graphical visualization of the results. + + +# **Why do we run performance tests on commits?** + +Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, ensure consistent performance over time and to encourage confidence in code contributions from new people. + +It is an essential practice to deliver a performant and reliable package to end-users. + + +# **What are the performance tests?** +Our atime performance tests aims to assess the data.table repository by benchmarking its performance and gathering information on memory and time usage. + +By conducting these tests, we can gain insights into the package's performance efficiency. + +## When using atime_versions, there are six main arguments: + + +1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package. + +2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed. + +3. `N`: This argument determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation. + +4. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process, the setup is determined by the value of `N`. + +5. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the operation on the dataset. + +In the given syntax `` data.table::`[.data.table` ``, the first part `data.table::` installs and loads different versions of the data.table package based on the specified commit ids. Hence, data.table:: will be translated to data.table.SHA1:: for some version hash SHA1. +Following that, the expression specified within `` `[.data.table`` `` is executed on each installed version. This process is repeated for all the specified commit IDs in the code. + +For example: +``` +data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE) +``` + +In this example, the expression `[.data.table` is executed on the `DT` dataset using the specified commit ID (`ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4`) of the data.table package. The expression calculates the mean of the `v3` column (ignoring missing values) grouped by `id3`, and the `verbose=TRUE` argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package. + +6. `...` : This specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id. + + +## We run the full performance regression: +1. Before the performance regression is made (Before) +2. When the performance regression is first submitted (Regression) +3. Pull Request (PR) which fixes the performance regression (Fixed) + + +# **Approach** +1. To begin, conduct the atime test for the different code branches (before regression, regression, fix regression) to identify potential performance issues. Here is an example of how to perform the [atime test](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table) + +Note: Set up the necessary environment and dependencies, ensuring that the data.table package and the atime package are installed and loaded. + +2. Generate a plot to showcase the fixes made in the data.table package using the atime package. + +3. Utilize the atime_versions function to track the fixes across different versions. + +4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime). + +5. Use the plot function to visually present the execution times of the expression evaluated across different versions of the data.table package. + + +# Lets run some examples to see how this works. + +The first example we will discuss is an issue reported on a performance regression when performing group computations, specifically when running R's C eval on each group (q7 and q8) in the db-benchmark, indicating a slowness in the implementation of the code. [link to comment that reported Regression](https://github.com/Rdatatable/data.table/issues/4200) +This regression was caused by [The occurrence of this regression can be attributed to the inclusion of the specified code within the `#if` block.](https://github.com/Rdatatable/data.table/commit/c0052964694a4c618ab182aa474f924d40576d94) + +This is the [PR]( https://github.com/Rdatatable/data.table/pull/4558) that discusses the +[The regression was specifically related to the evaluation of C code within each group of data, specifically q7 and q8 in the "db-benchmark"](https://github.com/Rdatatable/data.table/issues/4200#issue-555186870) which appears that the regression occurred during the evaluation of C code within these particular groups, indicating a performance issue or slowness in the implementation of the code. + +[The regression was fixed Regression by the addition of const int nth = getDTthreads](https://github.com/Rdatatable/data.table/pull/4558/files) + +```{r,warning=FALSE,message=FALSE} +library(atime) +library(ggplot2) +library(data.table) +``` + +```{r,warning=FALSE, message=FALSE} + tdir <- tempfile() + dir.create(tdir) + git2r::clone("https://github.com/Rdatatable/data.table", tdir ) +``` + +```{r} +pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){ + pkg_find_replace <- function(glob, FIND, REPLACE){ + atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE) + } + Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE) + Package_ <- gsub(".", "_", old.Package, fixed=TRUE) + new.Package_ <- paste0(Package_, "_", sha) + pkg_find_replace( + "DESCRIPTION", + paste0("Package:\\s+", old.Package), + paste("Package:", new.Package)) + pkg_find_replace( + file.path("src","Makevars.*in"), + Package_regex, + new.Package_) + pkg_find_replace( + file.path("R", "onLoad.R"), + Package_regex, + new.Package_) + pkg_find_replace( + file.path("R", "onLoad.R"), + sprintf('packageVersion\\("%s"\\)', old.Package), + sprintf('packageVersion\\("%s"\\)', new.Package)) + pkg_find_replace( + file.path("src", "init.c"), + paste0("R_init_", Package_regex), + paste0("R_init_", gsub("[.]", "_", new.Package_))) + pkg_find_replace( + "NAMESPACE", + sprintf('useDynLib\\("?%s"?', Package_regex), + paste0('useDynLib(', new.Package_)) + } +``` + +```{r, warning=FALSE, message=FALSE} +atime.list.4200 <- atime::atime_versions( +pkg.path=tdir, +pkg.edit.fun= pkg.edit.fun, + N=10^seq(1,20), + setup={ + set.seed(108) + d <- data.table( + id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))), + v1 = sample(5L, N, TRUE), + v2 = sample(5L, N, TRUE)) + }, + expr=data.table:::`[.data.table`(d, , (max(v1)-min(v2)), by = id3), + "Before"="793f8545c363d222de18ac892bc7abb80154e724",#parent of the PR that introduced the regression(https://github.com/Rdatatable/data.table/commit/4aadde8f5a51cd7c8f3889964e7280432ec65bbc) as stated here (https://github.com/Rdatatable/data.table/issues/4200#issuecomment-646111420) https://github.com/Rdatatable/data.table/commit/793f8545c363d222de18ac892bc7abb80154e724 + "Regression"="c152ced0e5799acee1589910c69c1a2c6586b95d", #praent of the first commit in the PR (https://github.com/Rdatatable/data.table/commit/15f0598b9828d3af2eb8ddc9b38e0356f42afe4f) + "Fixed"="f750448a2efcd258b3aba57136ee6a95ce56b302")#second commit in the PR that fixes the regression(https://github.com/Rdatatable/data.table/pull/4558/commits) +) +``` + +Please Note: Commit IDs or SHAs ("Before", "Regression", and "Fixed") refer to different versions of the data.table package. These ids are unique identifiers associated with specific changes made to the package's codebase. They allow for comparing performance or behavior differences between versions and identifying and verifying fixes for regression issues. + + +## Title: “Groupby with dogroups (R expression) performance regression.” + +![Plot showing the 3 branches(Regression, Fixed and Before) of the issues#4200](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.4200.png) + +The graph above visualizes the groupby operation with dogroups issue. + +Comparison: The graph compares the median time in milliseconds required to execute the operation before and after fixing a regression issue. + +1. X-Axis (N): Represents the size of the data (N) on a logarithmic scale. + +2. Y-Axis: Represents the median time in milliseconds (logarithmic scale). + +*Lines:* + +3. “Before”: Indicates performance before fixing the regression. + +4. “Fixed”: Shows improved performance after fixing. + +5. “Regression”: Represents an ideal or target performance level. + +In the graph, as data size (N) increases, there’s an initial increase in median time, but after addressing the regression issue, there is a significant reduction in the median time, indicating improved performance(fix). +The regression issue was successfully addressed. + + +# Example Two + +In the specific case of issue *#5366*, there was a significant slowdown in the performance of data.table's time-based rolling aggregation compared to pandas' rolling aggregation. The cause of this regression was identified to be related to the addition of the `snprintf` function in the `assign.c` file. To address this issue, a fix was implemented by creating the `targetDesc` function and adding the `snprintf` function in the `assign.c` file. This fix resolved the regression and improved the performance of the time-based rolling aggregation in data.table. + +```{r,warning = FALSE, message = FALSE} +atime.list.5366 <- atime::atime_versions( +pkg.path=tdir, +pkg.edit.fun = pkg.edit.fun, + N= 10^seq(1,4), + setup={ + set.seed(123L) + dt <- data.table( + id = seq_len(N), + val = rnorm(N)) + dt + }, + expr=data.table:::`[.data.table`(dt, , .(vs = (sum(val))), by = .(id)), + Before = "be2f72e6f5c90622fe72e1c315ca05769a9dc854", # Parent of the regression causing commit (https://github.com/Rdatatable/data.table/commit/e793f53466d99f86e70fc2611b708ae8c601a451) in the PR that introduced the issue (https://github.com/Rdatatable/data.table/pull/4491/commits) + Regression = "e793f53466d99f86e70fc2611b708ae8c601a451", # Commit responsible for regression in the PR that introduced the issue (https://github.com/Rdatatable/data.table/pull/4491/commits) + Fixed = "58409197426ced4714af842650b0cc3b9e2cb842") # Last commit in the PR that fixed the regression (https://github.com/Rdatatable/data.table/pull/5463/commits)) +``` + + +## Significantly slower performance time-based rolling + +![Plot showing the 3 branches(Regression, Fixed and Before) of Issues5366](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.5366.png) +Graph overview: +The graph depicts the performance improvement over time after fixing a regression issue +Y-axis: Represents time in milliseconds. +X-axis: Represents the sample size N (ranging from 10 to 10,000). + +*Three lines:* + +“Before Fixed”: Indicates performance before fixing the regression. +“Fixed”: Represents improved performance after fixing. +“Regression”: Shows the regression issue. +The Regression line initially increases steeply with larger N, indicating slow performance. +After fixing, line shows significantly reduced processing time. + + +In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes. + +# **Github Action** +The data.table project has implemented a GitHub Action to automatically run performance tests. +The process is automated using a GitHub action implemented by @anirban166. This action runs the `atime` package for every pull request and generates plots of the results in a comment within the pull request. [See an example in this pull request](https://github.com/Rdatatable/data.table/pull/5427#issuecomment-2075471806) +This action allows the maintainers to easily determine if a pull request has any impact on the time or memory usage of the data.table package. To learn more you can visit [Anirban's documentation](https://github.com/Anirban166/Autocomment-atime-results) or this [link](https://github.com/tdhock/atime?tab=readme-ov-file#github-action-for-continuous-performance-testing) + +# **Conclusion** + +In this blog post, we have delved into the use of the atime code to compare the asymptotic time and memory usage of different versions of the data.table package. Specifically, we explored the comparisons between the "Before," "Regression," and "Fixed" versions, as well as different versions implementing the same computation. + +By employing benchmarking methodologies like atime, we gained valuable insights into the performance characteristics of the data.table package. This allowed us to identify and address performance regressions, ensuring that each new version of the package has indeed solved the particular issue reported. + +Consider doing the exercises below, if you want practice using atime. + +Consider visiting this [link](https://github.com/Rdatatable/data.table/issues/4311) to gain a comprehensive understanding of regression and attempt to interpret it using atime. +The issue was fixed [here](https://github.com/Rdatatable/data.table/pull/4440) +By doing so, you will be able to replicate graphs similar to the ones mentioned earlier. diff --git a/posts/2024-03-05-Performance Comparison: Data.table in R and Pandas in Python-Doris Amoakohene/index.qmd b/posts/2024-03-05-Performance Comparison: Data.table in R and Pandas in Python-Doris Amoakohene/index.qmd new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/posts/2024-03-05-Performance Comparison: Data.table in R and Pandas in Python-Doris Amoakohene/index.qmd @@ -0,0 +1 @@ +