rdatatable-community · kbodwin · Aug 12, 2024 · Jan 31, 2024 · Feb 13, 2024 · Feb 17, 2024
diff --git a/posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd b/posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd
@@ -0,0 +1,248 @@
+---
+title: "Visualizing Performance Regression of data.table with Atime"
+author: "Doris Afriyie Amoakohene"
+date: "2024-02-04"
+output: html_document
+---
+
+
+Since August 2023, I have been working on performance testing, which could be useful for expanding the open-source ecosystem around data.table package in R.
+This could increase confidence in code contributions by ensuring the sustained efficiency of the data.table package. Consequently, it becomes essential to enhance the package's performance and deliver reliable results to its users.
+
+It is important that we prevent significant performance regression from reaching the data.table package. Slowness or big memory usage can be frustrating, any performance regression that makes it into a version release will degrade user experience. In this blog post, I will demonstrate the use of benchmarking techniques to verify whether reported issues on data.table have been successfully resolved.
+
+
+# **Understanding performance in data.table?**
+
+data.table is an extension of R's data.frame, designed to handle large datasets efficiently. It provides a syntax that is both concise and expressive, allowing users to perform complex data manipulations with ease. Its efficiency is particularly evident when dealing with tasks like filtering, grouping, aggregating, and joining data.
+
+The development team behind data.table is committed to continuously improving its performance. Over the years, several major version changes have been introduced, aiming to enhance speed and efficiency. These changes include algorithmic optimizations, memory management improvements, and enhancements to parallel processing capabilities. Upgrading to the latest version ensures that users can leverage the most recent performance enhancements.
+
+# **Benchmarking for performance evaluation**
+
+To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against time and memory usage and giving a graphical visualization of the results.
+
+In data.table, the term “performance regression” usually means a drop in both time metrics and memory metrics, and that is how we will be using the term here.
+
+
+# **Why do we run performance tests on commits?**
+
+Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, ensure consistent performance over time and to encourage confidence in code contributions from new people.
+
+It is an essential practice to deliver a performant and reliable package to end-users.
+
+# **What are the performance tests?**
+Our atime performance tests aims to assess the data.table repository by benchmarking its performance and gathering information on memory and time usage.  
+
+By conducting these tests, we can gain insights into the package's performance efficiency.
+
+## When using atime_versions, there are six main arguments:
+
+1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package.
+
+2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed.
+
+3. `N`: This argument determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation.
+
+4. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process, the setup is determined by the value of `N`. 
+
+5. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the  operation on the dataset. 
+
+In the given syntax `data.table::`[.data.table``, the first part `data.table::` installs and loads different versions of the data.table package based on the specified commit IDs. Following that, the expression specified within `[.data.table`` is executed on each installed version. This process is repeated for all the specified commit IDs in the code.
+
+For example:
+
+```
+data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE)
+
+```
+
+In this example, the expression `[.data.table` is executed on the `DT` dataset using the specified commit ID (`ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4`) of the data.table package. The expression calculates the mean of the `v3` column (ignoring missing values) grouped by `id3`, and the `verbose=TRUE` argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package.
+
+6. `...` : This section specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.
+
+
+
+## We run the full performance regression:
+1. Before the performance regression is made (Before)
+2. When the performance regression is first submitted (Regression)
+3. Pull Request (PR) which fixes the performance regression (Fixed)
+
+# **Approach**
+1. To begin, conduct the atime test for the different code branches (before regression, regression, fix regression) to identify potential performance issues. Here is an example of how to perform the [atime test](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table)
+
+Note: Set up the necessary environment and dependencies, ensuring that the data.table package and the atime package are installed and loaded.
+
+2. Generate a plot to showcase the fixes made in the data.table package using the atime package.
+
+3. Utilize the atime_versions function to track the fixes across different versions.
+
+4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime). 
+
+5. Use the plot function to visually present the execution times of the expression evaluated across different versions of the data.table package.
+
+
+
+# Lets run some examples to see how this works.
+
+The first example we will discuss is an issue reported on a performance regression when performing group computations, specifically when running R's C eval on each group (q7 and q8) in the db-benchmark, indicating a slowness in the implementation of the code. [link to comment that reported Regression](https://github.com/Rdatatable/data.table/issues/4200)
+This regression was caused by [The occurrence of this regression can be attributed to the inclusion of the specified code within the `#if` block.](https://github.com/Rdatatable/data.table/commit/c0052964694a4c618ab182aa474f924d40576d94)
+
+This is the [PR]( https://github.com/Rdatatable/data.table/pull/4558) that discusses the 
+[The regression was specifically related to the evaluation of C code within each group of data, specifically q7 and q8 in the "db-benchmark"](https://github.com/Rdatatable/data.table/issues/4200#issue-555186870) which appears that the regression occurred during the evaluation of C code within these particular groups, indicating a performance issue or slowness in the implementation of the code.
+
+[The regression was fixed Regression by the addition of const int nth = getDTthreads]( https://github.com/Rdatatable/data.table/pull/4558/files)
+
+```{r,warning=FALSE,message=FALSE}
+
+library(atime)
+library(ggplot2)
+library(data.table)
+
+```
+
+```{r,warning=FALSE, message=FALSE}
+  tdir <- tempfile()
+  dir.create(tdir)
+  git2r::clone("https://github.com/Rdatatable/data.table", tdir )
+
+```
+
+
+```{r}
+pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
+      pkg_find_replace <- function(glob, FIND, REPLACE){
+        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
+      }
+      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
+      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
+      new.Package_ <- paste0(Package_, "_", sha)
+      pkg_find_replace(
+        "DESCRIPTION", 
+        paste0("Package:\\s+", old.Package),
+        paste("Package:", new.Package))
+      pkg_find_replace(
+        file.path("src","Makevars.*in"),
+        Package_regex,
+        new.Package_)
+      pkg_find_replace(
+        file.path("R", "onLoad.R"),
+        Package_regex,
+        new.Package_)
+      pkg_find_replace(
+        file.path("R", "onLoad.R"),
+        sprintf('packageVersion\\("%s"\\)', old.Package),
+        sprintf('packageVersion\\("%s"\\)', new.Package))
+      pkg_find_replace(
+        file.path("src", "init.c"),
+        paste0("R_init_", Package_regex),
+        paste0("R_init_", gsub("[.]", "_", new.Package_)))
+      pkg_find_replace(
+        "NAMESPACE",
+        sprintf('useDynLib\\("?%s"?', Package_regex),
+        paste0('useDynLib(', new.Package_))
+    }
+```
+
+
+```{r, warning=FALSE, message=FALSE}
+
+atime.list.4200 <- atime::atime_versions(
+pkg.path=tdir,
+pkg.edit.fun= pkg.edit.fun,
+
+  N=10^seq(1,20),
+  setup={ 
+    set.seed(108)
+    d <- data.table(
+      id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))),
+      v1 = sample(5L, N, TRUE),
+      v2 = sample(5L, N, TRUE))
+  },
+  expr=data.table:::`[.data.table`(d, , (max(v1)-min(v2)), by = id3),
+"Before"="15f0598b9828d3af2eb8ddc9b38e0356f42afe4f",
+  "Regression"="6f360be0b2a6cf425f6df751ca9a99ec5d35ed93", 
+  "Fixed"="ba32f3cba38ec270587e395f6e6c26a80be36be6")
+```
+
+Please Note: Commit IDs or SHAs ("Before", "Regression", and "Fixed") refer to different versions of the data.table package. These IDs are unique identifiers associated with specific changes made to the package's codebase. They allow for comparing performance or behavior differences between versions and identifying and verifying fixes for regression issues.
+
+## Title: “Groupby with dogroups (R expression) performance regression.”
+
+![Plot showing the 3 branches(Regression, Fixed and Before) of the issues#4200](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.4200.png)
+
+The graph above  visualizes the  groupby operation with dogroups  issue. 
+
+Comparison: The graph compares the median time in milliseconds required to execute the operation before and after fixing a regression issue.
+
+1. X-Axis (N): Represents the size of the data (N) on a logarithmic scale.
+
+2.Y-Axis: Represents the median time in milliseconds (logarithmic scale).
+
+*Lines:*
+
+3.“Before”: Indicates performance before fixing the regression.
+
+4.“Fixed”: Shows improved performance after fixing.
+
+5. “Regression”: Represents an ideal or target performance level.
+
+In the graph, as data size (N) increases, there’s an initial increase in median time, but after addressing the regression issue, there is a significant reduction in the median time, indicating improved performance(fix). 
+The regression issue was successfully addressed.
+
+
+# Example Two
+
+In the specific case of issue *#5366*, there was a significant slowdown in the performance of data.table's time-based rolling aggregation compared to pandas' rolling aggregation. The cause of this regression was identified to be related to the addition of the snprintf function in the assign.c file. To address this issue, a fix was implemented by creating the targetDesc function and adding the snprintf function in the assign.c file. This fix resolved the regression and improved the performance of the time-based rolling aggregation in data.table. 
+
+```{r,warning = FALSE, message = FALSE}
+atime.list.5366 <- atime::atime_versions(
+pkg.path=tdir,
+pkg.edit.fun = pkg.edit.fun,
+  N= 10^seq(1,4),
+  setup={ 
+    set.seed(123L)
+    dt <- data.table(
+      id = seq_len(N),
+      val = rnorm(N))
+    dt
+  },
+ expr=data.table:::`[.data.table`(dt, , .(vs = (sum(val))), by = .(id)),
+  "Before"="be2f72e6f5c90622fe72e1c315ca05769a9dc854",
+  "Regression"="e793f53466d99f86e70fc2611b708ae8c601a451", 
+  "Fixed"="58409197426ced4714af842650b0cc3b9e2cb842") 
+```
+
+
+## Significantly slower performance time-based rolling
+
+![Plot showing the 3 branches(Regression, Fixed and Before) of Issues5366](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.5366.png)
+Graph overview:
+The graph depicts the performance improvement over time after fixing a regression issue
+Y-axis: Represents time in milliseconds.
+X-axis: Represents the sample size N (ranging from 10 to 10,000).
+
+*Three lines:*
+
+“Before Fixed”: Indicates performance before fixing the regression.
+“Fixed”: Represents improved performance after fixing.
+“Regression”: Shows the regression issue.
+The Regression line initially increases steeply with larger N, indicating slow performance.
+After fixing, line shows significantly reduced processing time.
+
+
+In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes. 
+
+
+
+# **Conclusion**
+
+In this blog post, we have delved into the use of the atime code to compare the asymptotic time and memory usage of different versions of the data.table package. Specifically, we explored the comparisons between the "Before," "Regression," and "Fixed" versions, as well as different versions implementing the same computation.
+
+By employing benchmarking methodologies like atime, we gained valuable insights into the performance characteristics of the data.table package. This allowed us to identify and address performance regressions, ensuring that each new version of the package has indeed solved the particular issue reported.
+
+Consider doing the exercises below, if you want practice using atime.
+
+Consider visiting this [link](https://github.com/Rdatatable/data.table/issues/4311) to gain a comprehensive understanding of regression and attempt to interpret it using atime.
+The issue was fixed [here](https://github.com/Rdatatable/data.table/pull/4440) 
+By doing so, you will be able to replicate graphs similar to the ones mentioned earlier.