Description
This is based on an issue found by @bhive01 and discussed with him
and @clauswilke in wilkelab/gridtext#9.
Running the code included at the end in R versions up through 4.0.0
with the C stack set to 1M will segfault with a C stack overflow. The
same happens (for me on Ubuntu at least) with the default 8M stack if
you change the value of N to 1000.
Running this code as is (i.e. N == 500) and with an 8M stack will not
fail, but takes 14 minutes due to several multi-minute-long pauses in
the second loop.
The problem is that this code, via calls to R_PreserveObject
in
Rcpp, puts around 300K objects into the preserved object list. This is
a simple linked list, so has to be searched linearly to remove objects
pushed on early. That is the reason for the long pauses. The segfault
happens because the code for R_ReleaseObject
was using recursion to
work down the list. R-devel and R-patched now use iteration and no
longer segfault, but that doesn't help anyone running R 4.0.0 or
earlier.
The R Extensions Manual says about the
R_PreserveObject
/R_ReleaseObject
mechanism:
It is less efficient than the normal protection mechanism, and
should be used sparingly.
The linked list data structure, and even the recursive delete, are
reasonable for this usage but are not suited for handling 300K objects
that may need to be removed in random or first in/first-out order.
Assuming you need to keep alive R objects, a better way to do this is
to create an appropriate data structure that you control, and protect
only that data structure with R_PreserveObject
(and maybe release it
on unload with R_ReleseObject
). That way you can use a more
appropriate data structure, such as a hash table. Using a hash table
the code that currently takes 14 minutes would only take about 1.5
minutes. The improvement is even larger for N == 1000.
Even if you don't want to go to a hash table now it would be a good
idea to switch to using your own list so users of R 4.0.0 or earlier
don't run the risk of a segfault.
For using your own list you need something like this
static SEXP Rcpp_precious = NULL;
void R_init_Rcpp(DllInfo* dllinfo) {
Rcpp_precious = CONS(R_NilValue, R_NilValue);
R_PreserveObject(Rcpp_precious);
...
}
Then replace R_PreserveObject
and R_ReleaseObject
with
void Rcpp_PreserveObject(SEXP object)
{
SETCDR(Rcpp_precious, CONS(object, CDR(Rcpp_precious)));
}
void Rcpp_ReleaseObject(SEXP object)
{
SETCDR(Rcpp_precious, DeleteFromList(object, CDR(Rcpp_precious)));
}
where (this is now in R_devel/R_patched)
static SEXP DeleteFromList(SEXP object, SEXP list)
{
if (CAR(list) == object)
return CDR(list);
else {
SEXP last = list;
for (SEXP head = CDR(list); head != R_NilValue; head = CDR(head)) {
if (CAR(head) == object) {
SETCDR(last, CDR(head));
return list;
}
else last = head;
}
return list;
}
}
This maintains your own list, which is protected by being in the CDR
field of the protected Rcpp_precious
list cell. This will give you
the same reliability and performance across all R versions. [Code is
not tested, so there may be typos, but that is design I recommend].
If you want better performance for a use pattern as exhibited in the
example here you could look at using a hash table stored in a VECSXP
placed in one of the cells of Rcpp_precious
. In a quick and dirty
experiment a simple hash table did much better, but the code is of
course more complicated.
All that said, I have a feeling that it should be possible to do
better by using a combination of the protected fields in external
pointer objects and weak references. If you have a document describing
the design of your memory management I could take a look and see if I
can suggest something.
Here is the code to run, from @clauswilke in wilkelab/gridtext#9.
library(ggplot2)
library(ggtext)
library(grid)
plot_grob <- function(df) {
p <- ggplot(data.frame(x = 1, y = 1)) +
geom_point(aes(x, y)) +
labs(caption = "<br>Pink dots represent outliers and are removed from downstream analyses.<br>Error bars represent the standard error of the mean.<br>ANOVA model significance *p* < 0.05.<br>Treatments with the same letter are not significantly different at *α* = 0.05 according to Tukey's HSD. ") +
theme(plot.caption = element_textbox_simple())
ggplotGrob(p)
}
N <- 500
l <- list()
for (i in 1:N) {
cat(i, " ")
g <- plot_grob()
grid.newpage()
grid.draw(g)
l[[i]] <- g
}
l <- NULL
l <- list()
for (i in 1:N) {
cat(i, " ")
g <- plot_grob()
grid.newpage()
grid.draw(g)
l[[i]] <- g
}