Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Test on data.table Issues.qmd file #16

Merged
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
---
title: "Visualizing Performance Regression of data.table with Atime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have everything without caps except for V / Visualizing since it's the first word

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

author: "Doris Afriyie Amoakohene"
date: "2024-02-04"
output: html_document
---


Since August 2023, I have been working on performance testing, which could be useful for expanding the open-source ecosystem around data.table package in R.
This could increase confidence in code contributions by ensuring the sustained efficiency of the data.table package. Consequently, it becomes essential to enhance the package's performance and deliver reliable results to its users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please clarify "Consequently, it becomes essential to enhance the package's performance and deliver reliable results to its users."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say that its necessary to improve or enhance the performance of the package so that it can consistently provide reliable and satisfactory results to its users, but I have already spoke about it, so i will leave that out. @tdhock


It is important that we prevent significant performance regression from reaching the data.table package. Slowness or big memory usage can be frustrating, any performance regression that makes it into a version release will degrade user experience. In this blog post, I will demonstrate the use of benchmarking techniques to verify whether reported issues on data.table have been successfully resolved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be broken into two sentences "Slowness or big memory usage can be frustrating, any performance regression that makes it into a version release will degrade user experience."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure



# **Understanding performance in data.table?**

data.table is an extension of R's data.frame, designed to handle large datasets efficiently. It provides a syntax that is both concise and expressive, allowing users to perform complex data manipulations with ease. Its efficiency is particularly evident when dealing with tasks like filtering, grouping, aggregating, and joining data.

The development team behind data.table is committed to continuously improving its performance. Over the years, several major version changes have been introduced, aiming to enhance speed and efficiency. These changes include algorithmic optimizations, memory management improvements, and enhancements to parallel processing capabilities. Upgrading to the latest version ensures that users can leverage the most recent performance enhancements.

# **Benchmarking for performance evaluation**

To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against time and memory usage and giving a graphical visualization of the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence which introduces atime/versions should be moved closer to the section below where you explain its usage


In data.table, the term “performance regression” usually means a drop in both time metrics and memory metrics, and that is how we will be using the term here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change drop to increase? drop means go down, and less time/memory is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay



# **Why do we run performance tests on commits?**

Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, ensure consistent performance over time and to encourage confidence in code contributions from new people.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this section redudant with the first section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move sections/sentences together if they are very similar


It is an essential practice to deliver a performant and reliable package to end-users.

# **What are the performance tests?**
Our atime performance tests aims to assess the data.table repository by benchmarking its performance and gathering information on memory and time usage.

By conducting these tests, we can gain insights into the package's performance efficiency.

## When using atime_versions, there are six main arguments:

1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package.

2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain why we need a more complicated pkg.edit.fun for data.table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this explain need a more complicated pkg.edit.fun for data.table?

The data.table package needs a complex pkg.edit.fun function due to its use of Rcpp, versioning, and naming considerations. Thus, the pkg.edit.fun function plays a crucial role in addressing these challenges and ensuring the smooth operation of the data.table package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.table does not use Rcpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use consistent capitalization - 'ids' here but 'IDs' below like in line 169 (and probably just use 'commit SHA' instead of 'commit SHA/ids'

Also, I agree with Toby that pkg.edit.fun needs more explaining. I do not clearly understand why it is used from that and I would not use 'Thus, the pkg.edit.fun function plays a crucial role in addressing these challenges and ensuring the smooth operation of the data.table package.' without explaining in detail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or in comparison to line 51 and 52 below (ids vs IDs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright


3. `N`: This argument determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation.

4. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process, the setup is determined by the value of `N`.

5. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the operation on the dataset.

In the given syntax `data.table::`[.data.table``, the first part `data.table::` installs and loads different versions of the data.table package based on the specified commit IDs. Following that, the expression specified within `[.data.table`` is executed on each installed version. This process is repeated for all the specified commit IDs in the code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if your code contains backticks, then you need to use one more backticks.
data.table::`[.data.table`
``data.table::`[.data.table` ``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you should write that data.table:: will be translated to data.table.SHA1:: for some version hash SHA1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


For example:

```
data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the blank line at the end for each code segment (including the ones below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

```

In this example, the expression `[.data.table` is executed on the `DT` dataset using the specified commit ID (`ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4`) of the data.table package. The expression calculates the mean of the `v3` column (ignoring missing values) grouped by `id3`, and the `verbose=TRUE` argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package.

6. `...` : This section specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can be labelled as a 'section', as it looks like an ellipsis (used for varying arguments usually) if not just a placeholder for continuation or incomplete code/output (also where is it used below?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is how the named versions is defined in the ?atime::atime_versions(),

... : named versions.




## We run the full performance regression:
1. Before the performance regression is made (Before)
2. When the performance regression is first submitted (Regression)
3. Pull Request (PR) which fixes the performance regression (Fixed)

# **Approach**
1. To begin, conduct the atime test for the different code branches (before regression, regression, fix regression) to identify potential performance issues. Here is an example of how to perform the [atime test](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table)

Note: Set up the necessary environment and dependencies, ensuring that the data.table package and the atime package are installed and loaded.

2. Generate a plot to showcase the fixes made in the data.table package using the atime package.

3. Utilize the atime_versions function to track the fixes across different versions.

4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime).

5. Use the plot function to visually present the execution times of the expression evaluated across different versions of the data.table package.



# Lets run some examples to see how this works.

The first example we will discuss is an issue reported on a performance regression when performing group computations, specifically when running R's C eval on each group (q7 and q8) in the db-benchmark, indicating a slowness in the implementation of the code. [link to comment that reported Regression](https://github.com/Rdatatable/data.table/issues/4200)
This regression was caused by [The occurrence of this regression can be attributed to the inclusion of the specified code within the `#if` block.](https://github.com/Rdatatable/data.table/commit/c0052964694a4c618ab182aa474f924d40576d94)

This is the [PR]( https://github.com/Rdatatable/data.table/pull/4558) that discusses the
[The regression was specifically related to the evaluation of C code within each group of data, specifically q7 and q8 in the "db-benchmark"](https://github.com/Rdatatable/data.table/issues/4200#issue-555186870) which appears that the regression occurred during the evaluation of C code within these particular groups, indicating a performance issue or slowness in the implementation of the code.

[The regression was fixed Regression by the addition of const int nth = getDTthreads]( https://github.com/Rdatatable/data.table/pull/4558/files)

```{r,warning=FALSE,message=FALSE}

library(atime)
library(ggplot2)
library(data.table)

```

```{r,warning=FALSE, message=FALSE}
tdir <- tempfile()
dir.create(tdir)
git2r::clone("https://github.com/Rdatatable/data.table", tdir )

```


```{r}
pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain what is pkg.edit.fun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg_find_replace <- function(glob, FIND, REPLACE){
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src","Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
```


```{r, warning=FALSE, message=FALSE}

atime.list.4200 <- atime::atime_versions(
pkg.path=tdir,
pkg.edit.fun= pkg.edit.fun,

N=10^seq(1,20),
setup={
set.seed(108)
d <- data.table(
id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))),
v1 = sample(5L, N, TRUE),
v2 = sample(5L, N, TRUE))
},
expr=data.table:::`[.data.table`(d, , (max(v1)-min(v2)), by = id3),
"Before"="15f0598b9828d3af2eb8ddc9b38e0356f42afe4f",
"Regression"="6f360be0b2a6cf425f6df751ca9a99ec5d35ed93",
"Fixed"="ba32f3cba38ec270587e395f6e6c26a80be36be6")
```

Please Note: Commit IDs or SHAs ("Before", "Regression", and "Fixed") refer to different versions of the data.table package. These IDs are unique identifiers associated with specific changes made to the package's codebase. They allow for comparing performance or behavior differences between versions and identifying and verifying fixes for regression issues.

## Title: “Groupby with dogroups (R expression) performance regression.”

![Plot showing the 3 branches(Regression, Fixed and Before) of the issues#4200](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.4200.png)

The graph above visualizes the groupby operation with dogroups issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, use consistent spacing and avoid multiple spaces like these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Comparison: The graph compares the median time in milliseconds required to execute the operation before and after fixing a regression issue.

1. X-Axis (N): Represents the size of the data (N) on a logarithmic scale.

2.Y-Axis: Represents the median time in milliseconds (logarithmic scale).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between '2.' and 'Y-Axis'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


*Lines:*

3.“Before”: Indicates performance before fixing the regression.

4.“Fixed”: Shows improved performance after fixing.

5. “Regression”: Represents an ideal or target performance level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces again for the two above (make it consistent with one space like this line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


In the graph, as data size (N) increases, there’s an initial increase in median time, but after addressing the regression issue, there is a significant reduction in the median time, indicating improved performance(fix).
The regression issue was successfully addressed.


# Example Two

In the specific case of issue *#5366*, there was a significant slowdown in the performance of data.table's time-based rolling aggregation compared to pandas' rolling aggregation. The cause of this regression was identified to be related to the addition of the snprintf function in the assign.c file. To address this issue, a fix was implemented by creating the targetDesc function and adding the snprintf function in the assign.c file. This fix resolved the regression and improved the performance of the time-based rolling aggregation in data.table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use ticks for have file and function names, e.g.: assign.c, snprintf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay


```{r,warning = FALSE, message = FALSE}
atime.list.5366 <- atime::atime_versions(
pkg.path=tdir,
pkg.edit.fun = pkg.edit.fun,
N= 10^seq(1,4),
setup={
set.seed(123L)
dt <- data.table(
id = seq_len(N),
val = rnorm(N))
dt
},
expr=data.table:::`[.data.table`(dt, , .(vs = (sum(val))), by = .(id)),
"Before"="be2f72e6f5c90622fe72e1c315ca05769a9dc854",
"Regression"="e793f53466d99f86e70fc2611b708ae8c601a451",
"Fixed"="58409197426ced4714af842650b0cc3b9e2cb842")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to explain where these commit ids come from

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need comments or explanation in text

```


## Significantly slower performance time-based rolling

![Plot showing the 3 branches(Regression, Fixed and Before) of Issues5366](https://github.com/DorisAmoakohene/Efficiency-and-Performance-Test.RData.table/blob/main/atime.list.5366.png)
Graph overview:
The graph depicts the performance improvement over time after fixing a regression issue
Y-axis: Represents time in milliseconds.
X-axis: Represents the sample size N (ranging from 10 to 10,000).

*Three lines:*

“Before Fixed”: Indicates performance before fixing the regression.
“Fixed”: Represents improved performance after fixing.
“Regression”: Shows the regression issue.
The Regression line initially increases steeply with larger N, indicating slow performance.
After fixing, line shows significantly reduced processing time.


In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spacing (again)

Overall this post looks better than the other one, but both could do with more attention to small things when writing (especially being consistent). Please convert the PR from draft to regular mode when done with the revisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks


# **Conclusion**

In this blog post, we have delved into the use of the atime code to compare the asymptotic time and memory usage of different versions of the data.table package. Specifically, we explored the comparisons between the "Before," "Regression," and "Fixed" versions, as well as different versions implementing the same computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a description of the github action, and how data.table is now using this to review PRs like this Rdatatable/data.table#5427 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


By employing benchmarking methodologies like atime, we gained valuable insights into the performance characteristics of the data.table package. This allowed us to identify and address performance regressions, ensuring that each new version of the package has indeed solved the particular issue reported.

Consider doing the exercises below, if you want practice using atime.

Consider visiting this [link](https://github.com/Rdatatable/data.table/issues/4311) to gain a comprehensive understanding of regression and attempt to interpret it using atime.
The issue was fixed [here](https://github.com/Rdatatable/data.table/pull/4440)
By doing so, you will be able to replicate graphs similar to the ones mentioned earlier.