Performance Test on data.table Issues.qmd file #16

DorisAmoakohene · 2024-01-31T16:30:20Z

Working on a github action blog that explain the how to run an action for every pull request

DorisAmoakohene · 2024-02-13T21:27:40Z

@tdhock @Anirban166 can you please proofread this. Thanks

tdhock · 2024-02-13T22:44:45Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+output: html_document
+---
+
+Since August 2023, I have been working on Expanding the open-source Ecosystem around data.table In R. data.table has become a widely adopted tool for data manipulation tasks, especially in scenarios involving large datasets. Its popularity stems from its remarkable speed and memory efficiency.


I have been working on expanding... -> I think it would be more appropriate to say that you have been working on performance testing, which could be useful for expanding the ecosystem, because it could increase confidence that code contributions are maintaining the efficiency of data.table

tdhock · 2024-02-13T22:45:51Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+data.table is an extension of R's data.frame, designed to handle large datasets efficiently. It provides a syntax that is both concise and expressive, allowing users to perform complex data manipulations with ease. Its efficiency is particularly evident when dealing with tasks like filtering, grouping, aggregating, and joining data.
+
+The development team behind data.table is committed to continuously improving its performance. Over the years, several major version changes have been introduced, aiming to enhance speed and efficiency. These changes include algorithmic optimizations, memory management improvements, and enhancements to parallel processing capabilities. Upgrading to the latest version ensures that users can leverage the most recent performance enhancements.
+ most recent performance enhancements.


please delete repeated most recent performance enhancements

tdhock · 2024-02-13T22:46:18Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# **Benchmarking for Performance Evaluation**
+
+To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against realistic use cases and giving a graphical visualization of the results.


time and memory usage

tdhock · 2024-02-13T22:47:08Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# **Why do we run Performance Tests on commits?**
+
+Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, and ensure consistent performance over time. It is an essential practice to deliver a performant and reliable package to end-users.


and to encourage confidence in code contributions from new people

tdhock · 2024-02-13T22:47:33Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, and ensure consistent performance over time. It is an essential practice to deliver a performant and reliable package to end-users.
+
+# **What are the Performance Tests?**
+The goal of our atime Performance Tests is to gather memory and responsiveness time metrics while simulating the full range of member interactions with the data.table repository


responsiveness? member interactions?

tdhock · 2024-02-13T22:47:59Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+# **What are the Performance Tests?**
+The goal of our atime Performance Tests is to gather memory and responsiveness time metrics while simulating the full range of member interactions with the data.table repository
+
+## In atime code, there are five main parts:


When using atime_versions, there are five main arguments

thanks, updated

tdhock · 2024-02-13T22:48:05Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+## In atime code, there are five main parts:
+
+1. `pkg.path`: This variable represents the path to the package being benchmarked. It specifies the location of the `data.table` package on your system.


variable -> argument

also it should be a git clone, not just the package

tdhock · 2024-02-13T22:48:49Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+2. `N`: This variable determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation.
+
+3. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process. 


it should depend on N

tdhock · 2024-02-13T22:50:04Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+3. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process. 
+
+4. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the  operation on the dataset. 


please clarify why data.table::[.data.table is necessary

tdhock · 2024-02-13T22:50:45Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+5. `...` : This section specifies the different versions of the data.table packages  that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.
+
+The result from running the atime versions will be a list of the seconds.limit (numeric input param) and timings (data table of results).


either delete seconds.limit, or explain it better please

tdhock · 2024-02-13T22:51:54Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+The result from running the atime versions will be a list of the seconds.limit (numeric input param) and timings (data table of results).
+
+Lastly, I run a github action. The action defines test.list as a list with names corresponding to different tests. Each element of the test.list should be a list with named arguments N, setup, expr, which was  passed as arguments in your atime::atime_versions test. For further elaboration on the process of performing asymptotic time testing using the atime package, please refer to [this ](https://github.com/marketplace/actions/r-asymptotic-testing)


is the github action working already? on which repo did you set it up?

if not, maybe delete?

Not yet, its still being worked in, so I will delete it for now and later i can write on that.

tdhock · 2024-02-13T22:52:12Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+Lastly, I run a github action. The action defines test.list as a list with names corresponding to different tests. Each element of the test.list should be a list with named arguments N, setup, expr, which was  passed as arguments in your atime::atime_versions test. For further elaboration on the process of performing asymptotic time testing using the atime package, please refer to [this ](https://github.com/marketplace/actions/r-asymptotic-testing)
+
+## We run the full performance, Pull Request (PR):
+1. Before the issue is made (Before)


issue -> performance regression

tdhock · 2024-02-13T22:52:29Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+## We run the full performance, Pull Request (PR):
+1. Before the issue is made (Before)
+2. when the PR is first submitted (Regression)


performance regression

tdhock · 2024-02-13T22:52:50Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+## We run the full performance, Pull Request (PR):
+1. Before the issue is made (Before)
+2. when the PR is first submitted (Regression)
+3. when the PR is merged to the destination branch(Fixed)


PR which fixes the performance regression

tdhock · 2024-02-13T22:53:28Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+3. when the PR is merged to the destination branch(Fixed)
+
+# **APPROACH**
+1. To begin, conduct the atime test for the different code branches (before regression, regression, fix regression) to identify potential performance issues. Here is an example of how to perform the [atime test](https://github.com/DorisAmoakohene/Efficiency-and-Preformance-Test.RData.table)


Preformance? typo?

Typo, fixed

tdhock · 2024-02-13T22:53:59Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+# **APPROACH**
+1. To begin, conduct the atime test for the different code branches (before regression, regression, fix regression) to identify potential performance issues. Here is an example of how to perform the [atime test](https://github.com/DorisAmoakohene/Efficiency-and-Preformance-Test.RData.table)
+
+NB: Set up the necessary environment and dependencies, ensuring that the data.table package and the atime package are installed and loaded.


please use full English words instead of abbreviations like NB

tdhock · 2024-02-13T22:54:56Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+3. Utilize the atime_versions function to track the fixes across different versions.
+
+4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime/tree/compare-dt-tidy). 


please change link to main github https://github.com/tdhock/atime instead of branch

tdhock · 2024-02-13T22:55:26Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime/tree/compare-dt-tidy). 
+
+5. Use the plot function to visually present the execution times of the expression evaluated across different versions of the data.table package.
+Run the GitHub Action by writing tests in inst/atime/tests.R. 


why inst/atime/tests.R here?

deleted. updated.

tdhock · 2024-02-13T22:55:34Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+Run the GitHub Action by writing tests in inst/atime/tests.R. 
+
+
+# Lets run some examples to see how this work.


tdhock · 2024-02-13T22:56:57Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# Lets run some examples to see how this work.
+
+The first example we will discuss is an issue reported on a performance regression when performing group computations, specifically when running R's C eval on each group (q7 and q8) in the db-benchmark, indicating a  slowness in the implementation of the code.[link to comment that reported Regression](https://github.com/Rdatatable/data.table/issues/4200)


which PR caused the regression?

tdhock · 2024-02-13T22:57:19Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+atime.list.4200 <- atime::atime_versions(
+pkg.path=tdir,
+pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){


please explain what is pkg.edit.fun

tdhock · 2024-02-13T22:57:44Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+png("atime.list.4200.png")
+plot(atime.list.4200)+
+  labs(title = "groupby with dogroups (R expression) performance regression")
+dev.off()


probably delete for blog

tdhock · 2024-02-13T22:59:41Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+3. Red Line (“Before”): Indicates performance before fixing the regression.
+
+4. Green Line (“Fixed”): Shows improved performance after fixing.


please change colors (red and green should not be used on same plot, for color blind people), try https://r-graph-gallery.com/38-rcolorbrewers-palettes.html or https://colorbrewer2.org/#type=qualitative&scheme=Accent&n=3

tdhock · 2024-02-13T23:00:23Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+5. Blue Line (“Regression”): Represents an ideal or target performance level.
+
+In both graphs, as data size (N) increases, there’s an initial increase in median time, but it significantly reduces fix, indicating performance improvement. The regression issue was successfully addressed


what are both graphs? isn't there just one with three lines?
please clarify "significantly reduces fix"

tdhock · 2024-02-13T23:01:05Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+```{r,warning = FALSE, message = FALSE}
+atime.list.5366 <- atime::atime_versions(
+pkg.path=tdir,
+pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){


if this is the same pkg.edit.fun, please define it once above and use it in both calls to atime_versions. If it is different, please explain why

its the same, so I called it just ones

tdhock · 2024-02-13T23:01:26Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+  "Before"="be2f72e6f5c90622fe72e1c315ca05769a9dc854",
+  "Regression"="e793f53466d99f86e70fc2611b708ae8c601a451", 
+  "Fixed"="58409197426ced4714af842650b0cc3b9e2cb842") 


would be good to explain where these commit ids come from

need comments or explanation in text

tdhock · 2024-02-26T19:00:22Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# **Why do we run performance tests on commits?**
+
+Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, ensure consistent performance over time and to encourage confidence in code contributions from new people.


isn't this section redudant with the first section?

please move sections/sentences together if they are very similar

tdhock · 2024-02-26T19:02:04Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# **Benchmarking for performance evaluation**
+
+To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against time and memory usage and giving a graphical visualization of the results.


this sentence which introduces atime/versions should be moved closer to the section below where you explain its usage

tdhock · 2024-02-26T19:03:58Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+5. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the  operation on the dataset. 
+
+In the given syntax `data.table::`[.data.table``, the first part `data.table::` installs and loads different versions of the data.table package based on the specified commit IDs. Following that, the expression specified within `[.data.table`` is executed on each installed version. This process is repeated for all the specified commit IDs in the code.


if your code contains backticks, then you need to use one more backticks.
data.table::`[.data.table`
``data.table::`[.data.table` ``

and you should write that data.table:: will be translated to data.table.SHA1:: for some version hash SHA1

tdhock · 2024-02-26T19:06:18Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package.
+
+2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed.


please explain why we need a more complicated pkg.edit.fun for data.table

Does this explain need a more complicated pkg.edit.fun for data.table?

The data.table package needs a complex pkg.edit.fun function due to its use of Rcpp, versioning, and naming considerations. Thus, the pkg.edit.fun function plays a crucial role in addressing these challenges and ensuring the smooth operation of the data.table package.

data.table does not use Rcpp

…andas in Python directory

…thon

Anirban166 · 2024-04-09T19:50:46Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

@@ -0,0 +1,249 @@
+---
+title: "Visualizing Performance Regression of data.table with Atime"


I'd have everything without caps except for V / Visualizing since it's the first word

Anirban166 · 2024-04-09T20:00:45Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+```
+data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE)
+


Remove the blank line at the end for each code segment (including the ones below)

Anirban166 · 2024-04-09T20:17:22Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package.
+
+2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed.


Use consistent capitalization - 'ids' here but 'IDs' below like in line 169 (and probably just use 'commit SHA' instead of 'commit SHA/ids'

Also, I agree with Toby that pkg.edit.fun needs more explaining. I do not clearly understand why it is used from that and I would not use 'Thus, the pkg.edit.fun function plays a crucial role in addressing these challenges and ensuring the smooth operation of the data.table package.' without explaining in detail

or in comparison to line 51 and 52 below (ids vs IDs)

Anirban166 · 2024-04-09T20:23:22Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+In this example, the expression `[.data.table` is executed on the `DT` dataset using the specified commit ID (`ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4`) of the data.table package. The expression calculates the mean of the `v3` column (ignoring missing values) grouped by `id3`, and the `verbose=TRUE` argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package.
+
+6. `...` : This section specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.


I'm not sure if this can be labelled as a 'section', as it looks like an ellipsis (used for varying arguments usually) if not just a placeholder for continuation or incomplete code/output (also where is it used below?)

that is how the named versions is defined in the ?atime::atime_versions(),

... : named versions.

Anirban166 · 2024-04-09T20:24:40Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+1. X-Axis (N): Represents the size of the data (N) on a logarithmic scale.
+
+2.Y-Axis: Represents the median time in milliseconds (logarithmic scale).


Space between '2.' and 'Y-Axis'

Anirban166 · 2024-04-09T20:25:27Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+4.“Fixed”: Shows improved performance after fixing.
+
+5. “Regression”: Represents an ideal or target performance level.


Spaces again for the two above (make it consistent with one space like this line)

Anirban166 · 2024-04-09T20:26:27Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# Example Two
+
+In the specific case of issue *#5366*, there was a significant slowdown in the performance of data.table's time-based rolling aggregation compared to pandas' rolling aggregation. The cause of this regression was identified to be related to the addition of the snprintf function in the assign.c file. To address this issue, a fix was implemented by creating the targetDesc function and adding the snprintf function in the assign.c file. This fix resolved the regression and improved the performance of the time-based rolling aggregation in data.table. 


I would use ticks for have file and function names, e.g.: assign.c, snprintf

Anirban166 · 2024-04-09T20:34:22Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes. 
+
+


Spacing (again)

Overall this post looks better than the other one, but both could do with more attention to small things when writing (especially being consistent). Please convert the PR from draft to regular mode when done with the revisions.

tdhock · 2024-05-01T16:54:07Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+
+# **Conclusion**
+
+In this blog post, we have delved into the use of the atime code to compare the asymptotic time and memory usage of different versions of the data.table package. Specifically, we explored the comparisons between the "Before," "Regression," and "Fixed" versions, as well as different versions implementing the same computation.


please add a description of the github action, and how data.table is now using this to review PRs like this Rdatatable/data.table#5427 (comment)

tdhock · 2024-06-12T03:42:54Z

hi @DorisAmoakohene @Anirban166 can you please revise and tell me when you think it is ready to merge?

DorisAmoakohene · 2024-06-12T19:49:28Z

yes @tdhock, this is ready for review

tdhock · 2024-06-13T15:08:28Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

 In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes. 

+# **Github Action**
+The data.table project has implemented a GitHub Action to automatically review pull requests.


review -> run performance tests

tdhock · 2024-06-13T15:09:04Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

 In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes. 

+# **Github Action**
+The data.table project has implemented a GitHub Action to automatically review pull requests.
+The process is automated using a GitHub action implemented by @anirban166. This action runs the "atime" package for every pull request and generates plots of the results in a comment within the pull request. [See an example in this pull request](https://github.com/Rdatatable/data.table/pull/5427#issuecomment-2075471806)


"atime" -> `atime`

tdhock · 2024-06-13T15:09:39Z

posts/2024-01-31-Performance-Doris_Amoakohene/index.qmd

+# **Github Action**
+The data.table project has implemented a GitHub Action to automatically review pull requests.
+The process is automated using a GitHub action implemented by @anirban166. This action runs the "atime" package for every pull request and generates plots of the results in a comment within the pull request. [See an example in this pull request](https://github.com/Rdatatable/data.table/pull/5427#issuecomment-2075471806)
+This action allows the maintainers to easily determine if a pull request has any impact on the time or memory usage of the data.table package. To learn more you can visit [Anirban page](https://github.com/Anirban166/Autocomment-atime-results) or this [link](https://github.com/tdhock/atime?tab=readme-ov-file#github-action-for-continuous-performance-testing)


Anirban page -> Anirban's documentation?

tdhock

hi @DorisAmoakohene please proof-read one more time from start to finish
then you can ask Kelly for review / merge

DorisAmoakohene · 2024-06-20T20:03:32Z

@tdhock @Anirban166

hi @DorisAmoakohene please proof-read one more time from start to finish then you can ask Kelly for review / merge

alright

DorisAmoakohene · 2024-06-21T15:06:18Z

@kbodwin Could you please review this blog for merge? This blog provides a performance testing analysis using atime on the performance of different versions of the data.table package.

DorisAmoakohene · 2024-08-01T15:18:41Z

@kbodwin Could you please review this blog for merge? This blog provides a performance testing analysis using atime on the performance of different versions of the data.table package.

DorisAmoakohene and others added 2 commits January 31, 2024 09:25

copied an index.qmd file

0e0d4da

Update index.qmd

b946795

tdhock reviewed Feb 13, 2024

View reviewed changes

Update index.qmd

429a8b4

tdhock reviewed Feb 26, 2024

View reviewed changes

DorisAmoakohene added 8 commits February 26, 2024 17:30

Update index.qmd

c1ef562

Update index.qmd

451e160

Update index.qmd

ed48531

Create Performance Comparison: Data.table in R and Pandas in Python

ecb325c

Create index.qmd

e2c63ad

Create index.qmd

e32f245

Delete posts/2024-03-05-Performance Comparison: Data.table in R and P…

6d7028a

…andas in Python directory

Delete posts/Performance Comparison: Data.table in R and Pandas in Py…

469c491

…thon

Anirban166 reviewed Apr 9, 2024

View reviewed changes

Update index.qmd

e19c952

tdhock reviewed May 1, 2024

View reviewed changes

Update index.qmd

e68175e

DorisAmoakohene marked this pull request as ready for review June 12, 2024 19:48

tdhock reviewed Jun 13, 2024

View reviewed changes

Update index.qmd

a2a1398

tdhock approved these changes Jun 20, 2024

View reviewed changes

Update index.qmd

eec10a7

kbodwin changed the base branch from main to dev August 12, 2024 17:35

kbodwin merged commit f0d6c89 into rdatatable-community:dev Aug 12, 2024


		# Benchmarking for Performance Evaluation

		To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against realistic use cases and giving a graphical visualization of the results.


		# Why do we run Performance Tests on commits?

		Running performance tests on commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, and ensure consistent performance over time. It is an essential practice to deliver a performant and reliable package to end-users.


		## In atime code, there are five main parts:

		1. `pkg.path`: This variable represents the path to the package being benchmarked. It specifies the location of the `data.table` package on your system.


		2. `N`: This variable determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation.

		3. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process.


		3. `setup`: This section contains the setup code for generating the dataset used in the benchmarking process.

		4. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the operation on the dataset.


		5. `...` : This section specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.

		The result from running the atime versions will be a list of the seconds.limit (numeric input param) and timings (data table of results).


		The result from running the atime versions will be a list of the seconds.limit (numeric input param) and timings (data table of results).

		Lastly, I run a github action. The action defines test.list as a list with names corresponding to different tests. Each element of the test.list should be a list with named arguments N, setup, expr, which was passed as arguments in your atime::atime_versions test. For further elaboration on the process of performing asymptotic time testing using the atime package, please refer to [this ](https://github.com/marketplace/actions/r-asymptotic-testing)


		3. Utilize the atime_versions function to track the fixes across different versions.

		4. Pass the following named arguments to atime::atime_versions: N, setup, expr, and the different code branches. More documentation of the atime package can be found [here](https://github.com/tdhock/atime/tree/compare-dt-tidy).

		Run the GitHub Action by writing tests in inst/atime/tests.R.


		# Lets run some examples to see how this work.


		# Lets run some examples to see how this work.

		The first example we will discuss is an issue reported on a performance regression when performing group computations, specifically when running R's C eval on each group (q7 and q8) in the db-benchmark, indicating a slowness in the implementation of the code.[link to comment that reported Regression](https://github.com/Rdatatable/data.table/issues/4200)

		3. Red Line (“Before”): Indicates performance before fixing the regression.

		4. Green Line (“Fixed”): Shows improved performance after fixing.


		5. Blue Line (“Regression”): Represents an ideal or target performance level.

		In both graphs, as data size (N) increases, there’s an initial increase in median time, but it significantly reduces fix, indicating performance improvement. The regression issue was successfully addressed


		5. `expr`: This section contains the expression that represents the operation being benchmarked. It uses the `data.table::`[.data.table`` syntax to perform the operation on the dataset.

		In the given syntax `data.table::`[.data.table``, the first part `data.table::` installs and loads different versions of the data.table package based on the specified commit IDs. Following that, the expression specified within `[.data.table`` is executed on each installed version. This process is repeated for all the specified commit IDs in the code.


		1. `pkg.path`: This argument specifies the location on your system where you have stored a git clone of the `data.table` package.

		2. `pkg.edit.fun`: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of "PKG" with "PKG.SHA" in the package code. Any occurrences of the string "PKG" within the package code will be replaced with "PKG.SHA", where "SHA" represents the commit SHA/ids associated with the version being installed.

		@@ -0,0 +1,249 @@
		---
		title: "Visualizing Performance Regression of data.table with Atime"


		```
		data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE)


		In this example, the expression `[.data.table` is executed on the `DT` dataset using the specified commit ID (`ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4`) of the data.table package. The expression calculates the mean of the `v3` column (ignoring missing values) grouped by `id3`, and the `verbose=TRUE` argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package.

		6. `...` : This section specifies the different versions of the data.table packages that will be tested. It includes three versions: "Before," "Regression," and "Fixed." Each version is associated with a specific commit id.


		1. X-Axis (N): Represents the size of the data (N) on a logarithmic scale.

		2.Y-Axis: Represents the median time in milliseconds (logarithmic scale).


		4.“Fixed”: Shows improved performance after fixing.

		5. “Regression”: Represents an ideal or target performance level.


		# Example Two

		In the specific case of issue #5366, there was a significant slowdown in the performance of data.table's time-based rolling aggregation compared to pandas' rolling aggregation. The cause of this regression was identified to be related to the addition of the snprintf function in the assign.c file. To address this issue, a fix was implemented by creating the targetDesc function and adding the snprintf function in the assign.c file. This fix resolved the regression and improved the performance of the time-based rolling aggregation in data.table.

Performance Test on data.table Issues.qmd file #16

Performance Test on data.table Issues.qmd file #16

Conversation

DorisAmoakohene commented Jan 31, 2024

DorisAmoakohene commented Feb 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment


		In summary, the graph visually demonstrates how fixing the regression issue led to improved performance in time-based rolling operations. The Fixed line represents the desired outcome, showing faster processing times for larger sample sizes.


		# Conclusion

		In this blog post, we have delved into the use of the atime code to compare the asymptotic time and memory usage of different versions of the data.table package. Specifically, we explored the comparisons between the "Before," "Regression," and "Fixed" versions, as well as different versions implementing the same computation.