[Bug] R CI flaky #2906

ryan-williams · 2024-08-16T17:14:26Z

Update: #2926 seems to have mitigated the failures significantly (though brute-force re-running R CI on that PR still failed 4 of 22 attempts: #5814, #5815).

R CI currently shows mostly successes, though run #5864 (from #2919, commit 9cd956f, which includes #2926) failed.

We can continue using this issue to track failures as we see them. My understanding is we don't expect #2926 to have actually fixed the issue, it has just made the failures less frequent.

@jp-dark also flagged recently that ASAN was inadvertently broken in #2773; fixing that (and running it as part of CI) may help root-cause the issue here. Related internal tracking: sc-34306, sc-53780.

Previous write-up

r-ci.yml has been failing on main since mid-June.

The first 4 failures (#5418, #5419, #5422, #5423; Jun 17-18) began with #2732
#5426 and #5427 then passed (Jun 19-20)
Since #5442 (Jun 28), all 81 runs have failed (as of 2024-08-26T09:01:06Z: #14)

Example output

The last output is typically 70 | ScalarMap, before a segfault:

✔ |         70 | ScalarMap                                                      

 *** caught segfault ***
address 0x1019, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/c0e3d832-8dfe-4[52](https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9713375174/job/26810031898#step:14:53)3-ac62-7a65da1f7dad.sh: line 1:  9642 Segmentation fault      (core dumped) Rscript testthat.R
Error: Process completed with exit code 139.

The segfault messages vary; memory not mapped, bad_function_call, invalid permissions, address (nil), cause 'unknown'. ryan-williams/tdbs-r-ci-dbg contains some analysis of the logs of failing runs, including frequencies of various error messages.

Breakdown by error message

Updated at 2024-08-29T09:01:20Z (by #17)

"memory not mapped"

Seen 52x, most recently #5816 (at 2024-08-23T03:41):

 *** caught segfault ***
address <ADDRESS>, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.

All runs

#5419, #5422, #5423, #5442, #5444, #5449, #5451, #5452, #5464, #5476, #5486, #5492, #5512, #5521, #5523, #5537, #5538, #5552, #5559, #5563, #5571, #5582, #5586, #5588, #5602, #5603, #5604, #5605, #5618, #5621, #5623, #5634, #5635, #5640, #5641, #5644, #5646, #5648, #5686, #5690, #5693, #5696, #5698, #5706, #5711, #5727, #5736, #5743, #5784, #5785, #5808, #5816

"bad_function_call"

Seen 12x, most recently #5820 (at 2024-08-23T16:04):

terminate called after throwing an instance of 'std::bad_function_call'
  what():  bad_function_call
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Aborted                 (core dumped)
##[error]Process completed with exit code 134.

All runs

#5467, #5527, #5560, #5615, #5637, #5654, #5673, #5707, #5738, #5764, #5786, #5820

"invalid permissions"

Seen 12x, most recently #5790 (at 2024-08-20T13:45):

 *** caught segfault ***
address <ADDRESS>, cause 'invalid permissions'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.

All runs

#5418, #5459, #5518, #5556, #5568, #5616, #5688, #5714, #5724, #5729, #5737, #5790

"address (nil), cause 'unknown'"

Seen 7x, most recently #5787 (at 2024-08-19T16:45):

 *** caught segfault ***
address (nil), cause 'unknown'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.

All runs

#5511, #5532, #5550, #5657, #5746, #5750, #5787

Occasional successful runs

A few runs have passed on PR branches (e.g. 1, 2). However, I tried re-running the first one, and it failed, so it seems like it's just non-deterministic, and usually fails. The dark/images branch did pass 3x in a row, 8/8-11.

Plan for debugging / fixing?

Several folks on the team have made some progress debugging it, but a fix remains elusive.

From discussions with @johnkerl, ongoing work to remove our dependency on TileDB-R may resolve it.

Change workflow to 24hr cron?

It's been causing "spurious" ❌'s on PRs (and main) for 2 months, which can mask other failures.

Disabling it altogether feels too aggressive; it's good to have some visibility into the state of the R tests.

Setting it to run against main every 24hrs might be a good compromise.

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-08-16T17:43:46Z

I want to acknowledge significant work over time by @nguyenv @mojaveazure @eddelbuettel and myself here. There has been significant investigation in this direction. I would note that these tend to be CI-only failures, not occurring in sandbox development, and thus more difficult to debug. Non-trivial time investment has already been done; this is a well-thought-out proposal.

I also add that the end goal of CI is to find bugs before users do. Meanwhile the failures being tracked here are CI-only, not affecting end users, which not only makes them difficult to debug -- they function as a distraction from our true goals which are (a) delivering reliable software to our end users, and (b) completing the refactor work which we have reasonable confidence will resolve these CI-only failures.

The proposal to temporarily move to a 24-hour cadence is that -- temporary -- and we will revisit once the R/cppification work is complete.

eddelbuettel · 2024-08-17T12:04:29Z

There is a Heisenbug that has driven a few of us to madness. But when you e.g. do (in the apis/r directory)

cd tests
CI=true Rscripts testthat.R

then all 3500+ tests pass. No skip, no fail. But it is frail, there is a possible interaction with the testrunner software (the widely-used testthat package, but it does some things behind the scence so @mojaveazure and I have planned to eventually replace it with tinytest (which eg tiledb-r uses, and which has a simpler setup). So when developing we adhere to a testing process. But CI is has been throwing a consistent spanner for years as also seen in the fact that macOS runs fewer tests as does coverage.

ryan-williams · 2024-08-21T15:13:41Z

I ran the following on an Ubuntu EC2 instance (m6a.4xlarge, AMI ami-0b33ebbed151cf740), mimicking r-ci.yml:

git clone https://github.com/single-cell-data/TileDB-SOMA
cd TileDB-SOMA/apis/r
tools/r-ci.sh bootstrap

Rscript <(cat <<EOF
rversion <- paste(strsplit(as.character(getRversion()), split = '\\.')[[1L]][1:2], collapse = '.')
codename <-  system('. /etc/os-release; echo ${VERSION_CODENAME}', intern = TRUE)
repo <- "https://tiledb-inc.r-universe.dev"
(opt <- sprintf('options(repos = c("%s/bin/linux/%s/%s", "%s", getOption("repos")), timeout = 300L)', repo, codename, rversion, repo))
cat(opt, "\n", file = "~/.Rprofile", append = TRUE)
EOF
)

Rscript tools/install-tiledb-r.R
tools/r-ci.sh install_bioc SingleCellExperiment
Rscript -e "remotes::install_deps(dependencies = TRUE, upgrade = FALSE)"
R CMD build --no-build-vignettes --no-manual .
R CMD INSTALL $(ls -1tr *.tar.gz | tail -1)

cd tests
Rscript testthat.R  # non-CI mode: OK ✅
export CI=true; Rscript testthat.R  # ❌
export CI=true; Rscript testthat.R  # ❌
export CI=true; Rscript testthat.R  # ❌

The 3 failing test runs had identical output (modulo timings):

Failing test output

tiledbsoma:    1.13.99.2
tiledb-r:      0.29.0
tiledb core:   2.25.0
libtiledbsoma: 2.25.0
R:             R version 4.4.1 (2024-06-14)
OS:            Ubuntu 22.04.4 LTS
✔ | F W  S  OK | Context
✔ |         25 | Arrow-utils
✔ |        140 | Blockwise [5.8s]
✔ |         14 | ConfigList
✔ |        115 | CoordsStrider
✔ |         26 | EphemeralCollection
✔ |         14 | EphemeralExperiment
✔ |         17 | EphemeralMeasurement
✔ |          5 | example-datasets
✔ |         49 | Factory
✔ |         35 | IntIndexer
✔ |          9 | membership-caching
✔ |         11 | OrderedAndFactor
✔ |         26 | PlatformConfig
✔ |        277 | reopen [1.4s]
✔ |         70 | ScalarMap
✔ |        172 | SCEOutgest [14.4s]
✔ |        379 | SeuratIngest [24.2s]
✔ |         99 | SeuratOutgest-assay [5.0s]
✔ |        169 | SeuratOutgest-command [9.2s]
✔ |         49 | SeuratOutgest-graph [2.9s]
✔ |        134 | SeuratOutgest-object [10.1s]
✔ |        180 | SeuratOutgest-reduction [5.5s]
✔ |         29 | SingleCellExperimentIngest [11.5s]
✔ |         13 | SOMAArrayReader-Arrow
✔ |         80 | SOMAArrayReader-Iterated [7.2s]
✔ |         16 | SOMAAxisQuery
✔ |         22 | SOMACollection [3.3s]
✔ |        161 | SOMADataFrame [9.9s]
✔ |         52 | SOMADenseNDArray [1.5s]
✔ |         80 | SOMAExperiment-query-m-p [4.7s]
✔ |         33 | SOMAExperiment-query-matrix-outgest [2.6s]
✔ |         62 | SOMAExperiment-query [5.8s]
✔ |         12 | SOMAExperiment [1.3s]
✔ |         11 | SOMAMeasurement
✔ |          9 | SOMAOpen
✔ |        156 | SOMASparseNDArray [2.7s]
✔ |         57 | SOMATileDBContext
✔ |          2 | Stats
✔ |         26 | SummarizedExperimentIngest [3.3s]
✔ |         16 | TileDBArray
✔ |         52 | TileDBCreateOptions
✔ |         37 | TileDBGroup
✔ |         21 | TileDBURI
✔ |         50 | utils-matrixZeroBasedView
✔ |         24 | utils-uris
✔ |         13 | utils
✔ |          3 | version
✔ |        190 | write-soma-objects [5.7s]
✖ | 2      252 | write-soma-resume [33.8s]
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-write-soma-resume.R:493:3): Resume-mode Seurat
<<not available>/C++Error/error/condition>
Error: [TileDB::C++API] Error: Failed to create context
Backtrace:
     ▆
  1. ├─base::suppressWarnings(...) at test-write-soma-resume.R:493:3
  2. │ └─base::withCallingHandlers(...)
  3. └─SOMAExperimentAxisQuery$new(expc, "RNA")$to_seurat(...)
  4.   ├─base::withCallingHandlers(...)
  5.   ├─base::tryCatch(...)
  6.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  7.   │   ├─base (local) tryCatchOne(...)
  8.   │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  9.   │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
 10.   │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 11.   │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 12.   └─tiledbsoma:::.load_seurat_command(uns, ms_names = private$.measurement_name)
 13.     └─xdf$read(column_names = "values")
 14.       └─tiledbsoma:::sr_setup(...)

Error (test-write-soma-resume.R:636:3): Resume-mode SingleCellExperiment
Error in `(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "[TileDB::C++API] Error: Failed to create context",
    call = NULL, cppstack = NULL), class = c("<not available>",
"C++Error", "error", "condition")))`: error in evaluating the argument 'x' in selecting a method for function 't': [TileDB::C++API] Error: Failed to create context
Backtrace:
    ▆
 1. ├─SOMAExperimentAxisQuery$new(expr, "RNA")$to_single_cell_experiment(...) at test-write-soma-resume.R:636:3
 2. │ └─base::lapply(...)
 3. │   └─tiledbsoma (local) FUN(X[[i]], ...)
 4. │     ├─Matrix::t(self$to_sparse_matrix(collection = "X", layer_name = layer))
 5. │     └─self$to_sparse_matrix(collection = "X", layer_name = layer)
 6. │       └─layer$read(coords = coords)
 7. │         └─tiledbsoma:::sr_setup(...)
 8. ├─base::stop(`<<navlbl>>`)
 9. └─base (local) `<fn>`(`<<navlbl>>`)
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 176.8 s

── Failed tests ──────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-write-soma-resume.R:493:3): Resume-mode Seurat
<<not available>/C++Error/error/condition>
Error: [TileDB::C++API] Error: Failed to create context
Backtrace:
     ▆
  1. ├─base::suppressWarnings(...) at test-write-soma-resume.R:493:3
  2. │ └─base::withCallingHandlers(...)
  3. └─SOMAExperimentAxisQuery$new(expc, "RNA")$to_seurat(...)
  4.   ├─base::withCallingHandlers(...)
  5.   ├─base::tryCatch(...)
  6.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  7.   │   ├─base (local) tryCatchOne(...)
  8.   │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  9.   │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
 10.   │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 11.   │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 12.   └─tiledbsoma:::.load_seurat_command(uns, ms_names = private$.measurement_name)
 13.     └─xdf$read(column_names = "values")
 14.       └─tiledbsoma:::sr_setup(...)

Error (test-write-soma-resume.R:636:3): Resume-mode SingleCellExperiment
Error in `(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "[TileDB::C++API] Error: Failed to create context",
    call = NULL, cppstack = NULL), class = c("<not available>",
"C++Error", "error", "condition")))`: error in evaluating the argument 'x' in selecting a method for function 't': [TileDB::C++API] Error: Failed to create context
Backtrace:
    ▆
 1. ├─SOMAExperimentAxisQuery$new(expr, "RNA")$to_single_cell_experiment(...) at test-write-soma-resume.R:636:3
 2. │ └─base::lapply(...)
 3. │   └─tiledbsoma (local) FUN(X[[i]], ...)
 4. │     ├─Matrix::t(self$to_sparse_matrix(collection = "X", layer_name = layer))
 5. │     └─self$to_sparse_matrix(collection = "X", layer_name = layer)
 6. │       └─layer$read(coords = coords)
 7. │         └─tiledbsoma:::sr_setup(...)
 8. ├─base::stop(`<<navlbl>>`)
 9. └─base (local) `<fn>`(`<<navlbl>>`)

[ FAIL 2 | WARN 0 | SKIP 0 | PASS 3494 ]
Error: Test failures
Execution halted

Is that familiar to anyone?

eddelbuettel · 2024-08-21T16:24:09Z

That's two out of 3496, they are in code undergoing current changes (timestamped read-write) so maybe we should just park these two. The Heisenbug is that things do fail earlier for some of us (SCEOutgest is a common one, sometimes later ones). One can run these test files individually too (that is tedious, and a fault of the overcomplicated testthat we use) and then they pass while they may fail when run with all others -- which is also maddening.

This may all be related to Context and I have a change to handle that differently. Or it could be another race condition.
Anyway -- also details below for a run on my end (Ubuntu 24.04, current) which always passes. (This is from my laptop, it also works with slightly new parts on my main server but I am in a middle of a branch there....)

edd@ahmad:~/git/tiledb-soma/apis/r/tests(main)$ CI=TRUE Rscript testthat.R
tiledbsoma:    1.13.99.2
tiledb-r:      0.29.0.4
tiledb core:   2.26.0
libtiledbsoma: 2.26.0
R:             R version 4.4.1 (2024-06-14)
OS:            Ubuntu 24.04 LTS
✔ | F W  S  OK | Context
✔ |         25 | Arrow-utils                                                               
✔ |        140 | Blockwise [4.3s]                                                          
✔ |         14 | ConfigList                                                                
✔ |        115 | CoordsStrider                                                             
✔ |         26 | EphemeralCollection                                                       
✔ |         14 | EphemeralExperiment                                                       
✔ |         17 | EphemeralMeasurement                                                      
✔ |          5 | example-datasets                                                          
✔ |         49 | Factory                                                                   
✔ |         35 | IntIndexer                                                                
✔ |          9 | membership-caching                                                        
✔ |         11 | OrderedAndFactor                                                          
✔ |         26 | PlatformConfig                                                            
✔ |        277 | reopen [1.0s]                                                             
✔ |         70 | ScalarMap                                                                 
✔ |        172 | SCEOutgest [10.5s]                                                        
✔ |        379 | SeuratIngest [16.3s]                                                      
✔ |         99 | SeuratOutgest-assay [3.2s]                                                
✔ |        169 | SeuratOutgest-command [5.9s]                                              
✔ |         49 | SeuratOutgest-graph [2.2s]                                                
✔ |        134 | SeuratOutgest-object [7.2s]                                               
✔ |        180 | SeuratOutgest-reduction [4.2s]                                            
✔ |         29 | SingleCellExperimentIngest [11.0s]                                        
✔ |         13 | SOMAArrayReader-Arrow                                                     
✔ |         80 | SOMAArrayReader-Iterated [6.0s]                                           
✔ |         16 | SOMAAxisQuery                                                             
✔ |         22 | SOMACollection [2.8s]                                                     
✔ |        161 | SOMADataFrame [7.2s]                                                      
✔ |         52 | SOMADenseNDArray [1.3s]                                                   
✔ |         80 | SOMAExperiment-query-m-p [4.1s]                                           
✔ |         33 | SOMAExperiment-query-matrix-outgest [1.9s]                                
✔ |         62 | SOMAExperiment-query [4.5s]                                               
✔ |         12 | SOMAExperiment [1.1s]                                                     
✔ |         11 | SOMAMeasurement
✔ |          9 | SOMAOpen                                                                  
✔ |        156 | SOMASparseNDArray [2.2s]                                                  
✔ |         57 | SOMATileDBContext                                                         
✔ |          2 | Stats                                                                     
✔ |         26 | SummarizedExperimentIngest [2.8s]                                         
✔ |         16 | TileDBArray                                                               
✔ |         52 | TileDBCreateOptions                                                       
✔ |         37 | TileDBGroup                                                               
✔ |         21 | TileDBURI                                                                 
✔ |         50 | utils-matrixZeroBasedView                                                 
✔ |         24 | utils-uris                                                                
✔ |         13 | utils                                                                     
✔ |          3 | version                                                                   
✔ |        190 | write-soma-objects [3.7s]                                                 
✔ |        315 | write-soma-resume [37.4s]                                                 
                                                                                           
══ Results ════════════════════════════════════════════════════════════════════════════════
Duration: 144.5 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3557 ]
edd@ahmad:~/git/tiledb-soma/apis/r/tests(main)$

ryan-williams · 2024-08-26T14:46:52Z

It's possible that #2926 reduces the frequency of these failures; two runs there I re-ran several times this weekend:

Run #5814 passed 12x in a row before failing on attempt #13
Run #5815 passed 6 of 9 attempts.

This is in contrast to main having failed the last 81 of 81 runs.

I don't think there's a known reason for #2926 to reduce the frequency of these failures, and @eddelbuettel mentioned separately that the R CI has gone through previous phases of failing with more or less frequency, just flagging: the failure frequency may be about to decrease (or we will find that failures occur preferentially on main?).

There's also discussion around disabling failing tests on #2922.

ryan-williams · 2024-08-29T15:05:46Z

I updated the issue description, but posting here as well for visibility:

Update: #2926 seems to have mitigated the failures significantly (though brute-force re-running R CI on that PR still failed 4 of 22 attempts: #5814, #5815).

R CI currently shows mostly successes, though run #5864 (from #2919, commit 9cd956f, which includes #2926) failed.

We can continue using this issue to track failures as we see them. My understanding is we don't expect #2926 to have actually fixed the issue, it has just made the failures less frequent.

@jp-dark also flagged recently that ASAN was inadvertently broken in #2773; fixing that (and running it as part of CI) may help root-cause the issue here. Related internal tracking: sc-34306, sc-53780.

ryan-williams mentioned this issue Aug 21, 2024

[R] Move failing R CI workflow to daily cron #2922

Closed

ryan-williams mentioned this issue Aug 23, 2024

shellcheck scripts/bld #2927

Merged

ryan-williams mentioned this issue Aug 27, 2024

Fix notebook name typo #2933

Closed

ryan-williams changed the title ~~[Bug] R CI failing~~ [Bug] R CI flaky Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] R CI flaky #2906

[Bug] R CI flaky #2906

ryan-williams commented Aug 16, 2024 •

edited

Loading

Previous write-up

Example output

Breakdown by error message

"memory not mapped"

"bad_function_call"

"invalid permissions"

"address (nil), cause 'unknown'"

Occasional successful runs

Plan for debugging / fixing?

Change workflow to 24hr cron?

johnkerl commented Aug 16, 2024 •

edited

Loading

eddelbuettel commented Aug 17, 2024

ryan-williams commented Aug 21, 2024 •

edited

Loading

eddelbuettel commented Aug 21, 2024

ryan-williams commented Aug 26, 2024 •

edited

Loading

ryan-williams commented Aug 29, 2024

[Bug] R CI flaky #2906

[Bug] R CI flaky #2906

Comments

ryan-williams commented Aug 16, 2024 • edited Loading

Previous write-up

Example output

Breakdown by error message

"memory not mapped"

"bad_function_call"

"invalid permissions"

"address (nil), cause 'unknown'"

Occasional successful runs

Plan for debugging / fixing?

Change workflow to 24hr cron?

johnkerl commented Aug 16, 2024 • edited Loading

eddelbuettel commented Aug 17, 2024

ryan-williams commented Aug 21, 2024 • edited Loading

eddelbuettel commented Aug 21, 2024

ryan-williams commented Aug 26, 2024 • edited Loading

ryan-williams commented Aug 29, 2024

ryan-williams commented Aug 16, 2024 •

edited

Loading

johnkerl commented Aug 16, 2024 •

edited

Loading

ryan-williams commented Aug 21, 2024 •

edited

Loading

ryan-williams commented Aug 26, 2024 •

edited

Loading