Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memrecycle no snprintf overhead #5463

Merged
merged 7 commits into from
Oct 14, 2022
Merged

memrecycle no snprintf overhead #5463

merged 7 commits into from
Oct 14, 2022

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Sep 14, 2022

closes #5424, closes #5366, closes #5371

issue:  1.14.2 vs  master vs  branch
#5424:  0.939s vs  1.725s vs  0.958s
#5366: 14.786s vs 26.515s vs 14.621s
#5371:  0.058s vs  0.050s vs  0.025s

code:

library(data.table)
set.seed(123L)
n = 1e4L
dt = data.table(id = seq_len(n),
                val = rnorm(n))

system.time(
  dt[, .(vs = (sum(val))), by = .(id)]
)


library(data.table)
N <- 1e8
n <- 1e6
set.seed(1L)
dt <- data.table(g = sample(seq_len(n), N, TRUE),
                 x = runif(N),
                 key = "g")
dt_mod <- copy(dt)
system.time(
  dt_mod[, N := .N, by = g]
)


library(data.table)
set.seed(1)
n <- 1e6
d1 <- abs(rnorm(n, sd = 4))
d2 <- as.integer(cumsum(d1))
tm <- as.POSIXct("2020-01-01 09:30:00") + d2
nIds <- 3
tmCol <- rep(tm, nIds)
idCol <- rep(c("a", "b", "c"), n)
f1 <- function() {
    dt <- data.table(tm = tmCol, v = 1, id = idCol)
    dt[, tm1 := tm - 40]
    dt[, tm2 := tm]
    dt[, rowNum := .I]
    dt[dt, .(vs = sum(v)), on = .(id, rowNum <= rowNum, tm >= tm1, tm < tm2), by = .EACHI]
}
system.time(
  f1()
)

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

Merging #5463 (5840919) into master (052f8da) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5463   +/-   ##
=======================================
  Coverage   99.51%   99.51%           
=======================================
  Files          80       80           
  Lines       14771    14774    +3     
=======================================
+ Hits        14699    14702    +3     
  Misses         72       72           
Impacted Files Coverage Δ
src/assign.c 99.85% <100.00%> (+<0.01%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@ColeMiller1 ColeMiller1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Jan. Simple yet effective approach to rollback the performance issues.

@ben-schwen
Copy link
Member

Why do we not need targetDesc in the CHECK_RANGE macro?

data.table/src/assign.c

Lines 867 to 881 in ade6e57

#define CHECK_RANGE(STYPE, RFUN, COND, FMT, TO, FMTVAL) {{ \
const STYPE *sd = (const STYPE *)RFUN(source); \
for (int i=0; i<slen; ++i) { \
const STYPE val = sd[i+soff]; \
if (COND) { \
const char *sType = sourceIsI64 ? "integer64" : type2char(TYPEOF(source)); \
const char *tType = targetIsI64 ? "integer64" : type2char(TYPEOF(target)); \
snprintf(memrecycle_message, MSGSIZE, \
"%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (%s)", \
FMTVAL, sType, i+1, tType, targetDesc); \
/* string returned so that rbindlist/dogroups can prefix it with which item of its list this refers to */ \
break; \
} \
} \
} break; }

or is created in every possible branch before?

@jangorecki
Copy link
Member Author

12 lines above is code which is still valid for the macro. Macro doesn't have own scope, it just substitutes code in existing scope, so the code 12 lines above is still used there.

@MichaelChirico
Copy link
Member

great stuff!

can we blow up the examples for the fixes to #5424 and #5371? I worry about benchmarks lasting <1s

src/assign.c Outdated
@@ -845,6 +851,8 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
warning(_("Coercing 'list' RHS to '%s' to match the type of %s."), type2char(TYPEOF(target)), targetDesc);
source = PROTECT(coerceVector(source, TYPEOF(target))); protecti++;
} else if ((TYPEOF(target)!=TYPEOF(source) || targetIsI64!=sourceIsI64) && !isNewList(target)) {
static char targetDesc[501];
snprintf(targetDesc, 500, colnum==0 ? _("target vector") : _("column %d named '%s'"), colnum, colname);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this one be declared inside the GetVerbose()>=3 branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think macro below needs it as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the macro only needs it inside its if(COND). So isn't this snprintf() indeed wasteful? This snprintf could be inside this GetVerbose()>=3 and then appear inside the CHECK_RANGE macro as well inside its if(COND). When I didn't see the snprintf inside the CHECK_RANGE that's why I wrote #5463 (review). So it would read better if targetDesc was populated right there in CHECK_RANGE next to where it is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I moved it then inside the if branch.

@MichaelChirico
Copy link
Member

Hey Jan, IIUC the issue is we are declaring a string buffer inside memrecycle, but it's inefficient because memrecycle might get called many (100s?) of times, so the fix is to only declare the buffer just as it's needed. This makes a bit of code duplication because we can't just re-use the one declaration of the buffer, but increases efficiency since the buffer is not typically needed.

Is that right? It would be helpful to summarize your fix in the PR description, this will especially make it easier when referring back to the PR in the future.

src/assign.c Outdated
@@ -730,6 +728,8 @@ const char *memrecycle(const SEXP target, const SEXP where, const int start, con
for (int i=0; i<slen; ++i) {
const int val = sd[i+soff];
if ((val<1 && val!=NA_INTEGER) || val>nlevel) {
static char targetDesc[501]; // from 1.14.1 coerceAs reuses memrecycle for a target vector, PR#4491
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe wrap this into a helper function to save some duplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, but I am fine either way

@jangorecki
Copy link
Member Author

jangorecki commented Sep 19, 2022

Yes, your are correct. Personally I could even remove this declaration all together by keeping verbose message more obscure.
You are welcome to follow up here improvements, but keeping change minimal is also a plus.
I couldn't really reproduce time issue on #5371. As for other two, could be bigger, but this is good enough to confirm that regression introduced got fixed.

@mattdowle mattdowle added this to the 1.14.5 milestone Oct 12, 2022
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you identified this, all the testing and benchmarks fit with the fix.

However, regarding the thought that it's the allocation?

Is that right?
Yes, your are correct.

No that's wrong. The allocation is declared static. Please look up what that is. We do that quite a lot in data.table and needs to be understood. To copy that down and repeat conveys to readers of the code that the static keyword isn't understood. So that needs to be put back at the top just once. The slow down will all be in the snprintf call itself.

Personally I could even remove this declaration all together by keeping verbose message more obscure.

Yes remove the declaration and snprintf, agree, but not make the message more obscure. Could repeat the ternary in each of the messages, or wrap that ternary in a macro. With the ternary I think I was trying to match previous messages maybe. Leaving implementation details aside do we agree that the column %d named '%s' part is better and more useful to provide to user when we can? It would be a shame to lose that part.

@jangorecki
Copy link
Member Author

jangorecki commented Oct 12, 2022

Yes, static wasn't understood.

As for removing snprintf

snprintf(targetDesc, 500, colnum==0 ? _("target vector") : _("column %d named '%s'"), colnum, colname);
error(_("Assigning factor numbers to %s. But %d is outside the level range [1,%d]"), targetDesc, val, nlevel);

If colnum would be a char, then we could set it to "", together with colname="" and then we could turn"target vector" into "target vector%s%s" (so two %s do nothing here). But as the colnum is not a char, we cannot really do it, and we are left with variable number of values to substitute in string. That leads to a temporary targetDesc object, or branching on top and duplicating error() call.


I started to removing targetDesc and extra snprintf calls but it doesn't look better IMO. Code is much less clear by having those big ternary operators everywhere.

colnum==0 ? Rprintf(_("Zero-copy coerce when assigning '%s' to '%s' target vector.\n"),
                          sourceIsI64 ? "integer64" : type2char(TYPEOF(source)),
                          targetIsI64 ? "integer64" : type2char(TYPEOF(target))) :
                  Rprintf(_("Zero-copy coerce when assigning '%s' to '%s' column %d named '%s'.\n"),
                          sourceIsI64 ? "integer64" : type2char(TYPEOF(source)),
                          targetIsI64 ? "integer64" : type2char(TYPEOF(target)),
                          colnum, colname);
colnum==0 ? snprintf(memrecycle_message, MSGSIZE,                                                                             \
        "%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (target vector)",                                  \
        FMTVAL, sType, i+1, tType) : snprintf(memrecycle_message, MSGSIZE,                                                                             \
        "%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (column %d named '%s')",                                  \
        FMTVAL, sType, i+1, tType, colnum, colname);                                                                         \

As the performance regression is fixed, maybe we can leave it as is? declaration of static char TargetDesc has been taken out from memrecycle.

Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ This comment is supposed to be attached to the targetDesc inside the CHECK_RANGE definition. GH won't give me "+" to comment on that line because it's in a block that hasn't changed it seems? ]

How does this targetDesc get populated now? If I'm right that it doesn't, does that mean we're missing a test or the test is insufficient?

@jangorecki
Copy link
Member Author

snprintf calls are still there to populate it.

@mattdowle
Copy link
Member

mattdowle commented Oct 13, 2022

Btw, the commit message static declares once would have been better as declare static once. The commit messages are suppose to describe what the commit does and if I look back in future at that, I won't know what that means. Perhaps in your mind you're describing a fact in the commit message, whereas I'm looking for a description of what the commit does. Or maybe it was just English.

@mattdowle
Copy link
Member

mattdowle commented Oct 13, 2022

I see now how that targetDesc in CHECK_RANGE gets populated, thanks. (Aside: maybe the #define CHECK_RANGE should be indented so that it's clearer that's part of that if() branch.) But now I reopened comment above and added further up there: #5463 (comment). Adding this new comment to be sure you see the reopened comment up there.

@mattdowle
Copy link
Member

mattdowle commented Oct 14, 2022

Assuming I'm right this time that that snprintf should be moved inside those if()s, then to avoid confusion in the future and to avoid the need to repeat the snprintf and in the correct way too, wherever targetDesc is passed as an argument currently to error()/warning()/Rprintf()/snprintf(), it could be a function call targetDesc() instead. That function call would populate the targetDesc storage and return targetDesc. That way there can be no doubt that the snprintf() to make targetDesc is only computed when needed, and is also consistent across all usage.

@mattdowle mattdowle added the dev label Oct 14, 2022
@jangorecki
Copy link
Member Author

I pushed improvement for now to address your comment about if condition in CHECK_RANGE.
targetDesc() is a good idea, but till Monday I won't have time to touch that.

@mattdowle
Copy link
Member

Thanks Jan. I did the targetDesc(). All good?

@mattdowle mattdowle merged commit 19b7866 into master Oct 14, 2022
@mattdowle mattdowle deleted the memrecycle-no-snprintf branch October 14, 2022 19:41
@tdhock tdhock restored the memrecycle-no-snprintf branch September 26, 2023 17:31
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Anirban166 added a commit to Anirban166/data.table that referenced this pull request Mar 19, 2024
Anirban166 added a commit to Anirban166/data.table that referenced this pull request Apr 6, 2024
GitHub Action + atime test to observe the performance regression introduced by PR Rdatatable#4491 and fixed by PR Rdatatable#5463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants