Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] R CI flaky #2906

Open
ryan-williams opened this issue Aug 16, 2024 · 6 comments
Open

[Bug] R CI flaky #2906

ryan-williams opened this issue Aug 16, 2024 · 6 comments

Comments

@ryan-williams
Copy link
Member

ryan-williams commented Aug 16, 2024

Update: #2926 seems to have mitigated the failures significantly (though brute-force re-running R CI on that PR still failed 4 of 22 attempts: #5814, #5815).

R CI currently shows mostly successes, though run #5864 (from #2919, commit 9cd956f, which includes #2926) failed.

We can continue using this issue to track failures as we see them. My understanding is we don't expect #2926 to have actually fixed the issue, it has just made the failures less frequent.

@jp-dark also flagged recently that ASAN was inadvertently broken in #2773; fixing that (and running it as part of CI) may help root-cause the issue here. Related internal tracking: sc-34306, sc-53780.

Previous write-up

r-ci.yml has been failing on main since mid-June.

Example output

The last output is typically 70 | ScalarMap, before a segfault:

✔ |         70 | ScalarMap                                                      

 *** caught segfault ***
address 0x1019, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/c0e3d832-8dfe-4[52](https://github.com/single-cell-data/TileDB-SOMA/actions/runs/9713375174/job/26810031898#step:14:53)3-ac62-7a65da1f7dad.sh: line 1:  9642 Segmentation fault      (core dumped) Rscript testthat.R
Error: Process completed with exit code 139.

The segfault messages vary; memory not mapped, bad_function_call, invalid permissions, address (nil), cause 'unknown'. ryan-williams/tdbs-r-ci-dbg contains some analysis of the logs of failing runs, including frequencies of various error messages.

Breakdown by error message

Updated at 2024-08-29T09:01:20Z (by #17)

"memory not mapped"

Seen 52x, most recently #5816 (at 2024-08-23T03:41):

 *** caught segfault ***
address <ADDRESS>, cause 'memory not mapped'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.
All runs

#5419, #5422, #5423, #5442, #5444, #5449, #5451, #5452, #5464, #5476, #5486, #5492, #5512, #5521, #5523, #5537, #5538, #5552, #5559, #5563, #5571, #5582, #5586, #5588, #5602, #5603, #5604, #5605, #5618, #5621, #5623, #5634, #5635, #5640, #5641, #5644, #5646, #5648, #5686, #5690, #5693, #5696, #5698, #5706, #5711, #5727, #5736, #5743, #5784, #5785, #5808, #5816

"bad_function_call"

Seen 12x, most recently #5820 (at 2024-08-23T16:04):

terminate called after throwing an instance of 'std::bad_function_call'
  what():  bad_function_call
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Aborted                 (core dumped)
##[error]Process completed with exit code 134.
All runs

#5467, #5527, #5560, #5615, #5637, #5654, #5673, #5707, #5738, #5764, #5786, #5820

"invalid permissions"

Seen 12x, most recently #5790 (at 2024-08-20T13:45):

 *** caught segfault ***
address <ADDRESS>, cause 'invalid permissions'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.
All runs

#5418, #5459, #5518, #5556, #5568, #5616, #5688, #5714, #5724, #5729, #5737, #5790

"address (nil), cause 'unknown'"

Seen 7x, most recently #5787 (at 2024-08-19T16:45):

 *** caught segfault ***
address (nil), cause 'unknown'
An irrecoverable exception occurred. R is aborting now ...
/home/runner/work/_temp/<UUID>.sh: line <LINE>:  <ID> Segmentation fault      (core dumped)
##[error]Process completed with exit code 139.
All runs

#5511, #5532, #5550, #5657, #5746, #5750, #5787

Occasional successful runs

A few runs have passed on PR branches (e.g. 1, 2). However, I tried re-running the first one, and it failed, so it seems like it's just non-deterministic, and usually fails. The dark/images branch did pass 3x in a row, 8/8-11.

Plan for debugging / fixing?

Several folks on the team have made some progress debugging it, but a fix remains elusive.

From discussions with @johnkerl, ongoing work to remove our dependency on TileDB-R may resolve it.

Change workflow to 24hr cron?

It's been causing "spurious" ❌'s on PRs (and main) for 2 months, which can mask other failures.

Disabling it altogether feels too aggressive; it's good to have some visibility into the state of the R tests.

Setting it to run against main every 24hrs might be a good compromise.

@johnkerl
Copy link
Member

johnkerl commented Aug 16, 2024

I want to acknowledge significant work over time by @nguyenv @mojaveazure @eddelbuettel and myself here. There has been significant investigation in this direction. I would note that these tend to be CI-only failures, not occurring in sandbox development, and thus more difficult to debug. Non-trivial time investment has already been done; this is a well-thought-out proposal.

I also add that the end goal of CI is to find bugs before users do. Meanwhile the failures being tracked here are CI-only, not affecting end users, which not only makes them difficult to debug -- they function as a distraction from our true goals which are (a) delivering reliable software to our end users, and (b) completing the refactor work which we have reasonable confidence will resolve these CI-only failures.

The proposal to temporarily move to a 24-hour cadence is that -- temporary -- and we will revisit once the R/cppification work is complete.

@eddelbuettel
Copy link
Contributor

There is a Heisenbug that has driven a few of us to madness. But when you e.g. do (in the apis/r directory)

cd tests
CI=true Rscripts testthat.R

then all 3500+ tests pass. No skip, no fail. But it is frail, there is a possible interaction with the testrunner software (the widely-used testthat package, but it does some things behind the scence so @mojaveazure and I have planned to eventually replace it with tinytest (which eg tiledb-r uses, and which has a simpler setup). So when developing we adhere to a testing process. But CI is has been throwing a consistent spanner for years as also seen in the fact that macOS runs fewer tests as does coverage.

@ryan-williams
Copy link
Member Author

ryan-williams commented Aug 21, 2024

I ran the following on an Ubuntu EC2 instance (m6a.4xlarge, AMI ami-0b33ebbed151cf740), mimicking r-ci.yml:

git clone https://github.com/single-cell-data/TileDB-SOMA
cd TileDB-SOMA/apis/r
tools/r-ci.sh bootstrap

Rscript <(cat <<EOF
rversion <- paste(strsplit(as.character(getRversion()), split = '\\.')[[1L]][1:2], collapse = '.')
codename <-  system('. /etc/os-release; echo ${VERSION_CODENAME}', intern = TRUE)
repo <- "https://tiledb-inc.r-universe.dev"
(opt <- sprintf('options(repos = c("%s/bin/linux/%s/%s", "%s", getOption("repos")), timeout = 300L)', repo, codename, rversion, repo))
cat(opt, "\n", file = "~/.Rprofile", append = TRUE)
EOF
)

Rscript tools/install-tiledb-r.R
tools/r-ci.sh install_bioc SingleCellExperiment
Rscript -e "remotes::install_deps(dependencies = TRUE, upgrade = FALSE)"
R CMD build --no-build-vignettes --no-manual .
R CMD INSTALL $(ls -1tr *.tar.gz | tail -1)

cd tests
Rscript testthat.R  # non-CI mode: OK ✅
export CI=true; Rscript testthat.R  #
export CI=true; Rscript testthat.R  #
export CI=true; Rscript testthat.R  #

The 3 failing test runs had identical output (modulo timings):

Failing test output
tiledbsoma:    1.13.99.2
tiledb-r:      0.29.0
tiledb core:   2.25.0
libtiledbsoma: 2.25.0
R:             R version 4.4.1 (2024-06-14)
OS:            Ubuntu 22.04.4 LTS
✔ | F W  S  OK | Context
✔ |         25 | Arrow-utils
✔ |        140 | Blockwise [5.8s]
✔ |         14 | ConfigList
✔ |        115 | CoordsStrider
✔ |         26 | EphemeralCollection
✔ |         14 | EphemeralExperiment
✔ |         17 | EphemeralMeasurement
✔ |          5 | example-datasets
✔ |         49 | Factory
✔ |         35 | IntIndexer
✔ |          9 | membership-caching
✔ |         11 | OrderedAndFactor
✔ |         26 | PlatformConfig
✔ |        277 | reopen [1.4s]
✔ |         70 | ScalarMap
✔ |        172 | SCEOutgest [14.4s]
✔ |        379 | SeuratIngest [24.2s]
✔ |         99 | SeuratOutgest-assay [5.0s]
✔ |        169 | SeuratOutgest-command [9.2s]
✔ |         49 | SeuratOutgest-graph [2.9s]
✔ |        134 | SeuratOutgest-object [10.1s]
✔ |        180 | SeuratOutgest-reduction [5.5s]
✔ |         29 | SingleCellExperimentIngest [11.5s]
✔ |         13 | SOMAArrayReader-Arrow
✔ |         80 | SOMAArrayReader-Iterated [7.2s]
✔ |         16 | SOMAAxisQuery
✔ |         22 | SOMACollection [3.3s]
✔ |        161 | SOMADataFrame [9.9s]
✔ |         52 | SOMADenseNDArray [1.5s]
✔ |         80 | SOMAExperiment-query-m-p [4.7s]
✔ |         33 | SOMAExperiment-query-matrix-outgest [2.6s]
✔ |         62 | SOMAExperiment-query [5.8s]
✔ |         12 | SOMAExperiment [1.3s]
✔ |         11 | SOMAMeasurement
✔ |          9 | SOMAOpen
✔ |        156 | SOMASparseNDArray [2.7s]
✔ |         57 | SOMATileDBContext
✔ |          2 | Stats
✔ |         26 | SummarizedExperimentIngest [3.3s]
✔ |         16 | TileDBArray
✔ |         52 | TileDBCreateOptions
✔ |         37 | TileDBGroup
✔ |         21 | TileDBURI
✔ |         50 | utils-matrixZeroBasedView
✔ |         24 | utils-uris
✔ |         13 | utils
✔ |          3 | version
✔ |        190 | write-soma-objects [5.7s]
✖ | 2      252 | write-soma-resume [33.8s]
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-write-soma-resume.R:493:3): Resume-mode Seurat
<<not available>/C++Error/error/condition>
Error: [TileDB::C++API] Error: Failed to create context
Backtrace:
     ▆
  1. ├─base::suppressWarnings(...) at test-write-soma-resume.R:493:3
  2. │ └─base::withCallingHandlers(...)
  3. └─SOMAExperimentAxisQuery$new(expc, "RNA")$to_seurat(...)
  4.   ├─base::withCallingHandlers(...)
  5.   ├─base::tryCatch(...)
  6.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  7.   │   ├─base (local) tryCatchOne(...)
  8.   │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  9.   │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
 10.   │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 11.   │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 12.   └─tiledbsoma:::.load_seurat_command(uns, ms_names = private$.measurement_name)
 13.     └─xdf$read(column_names = "values")
 14.       └─tiledbsoma:::sr_setup(...)

Error (test-write-soma-resume.R:636:3): Resume-mode SingleCellExperiment
Error in `(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "[TileDB::C++API] Error: Failed to create context",
    call = NULL, cppstack = NULL), class = c("<not available>",
"C++Error", "error", "condition")))`: error in evaluating the argument 'x' in selecting a method for function 't': [TileDB::C++API] Error: Failed to create context
Backtrace:
    ▆
 1. ├─SOMAExperimentAxisQuery$new(expr, "RNA")$to_single_cell_experiment(...) at test-write-soma-resume.R:636:3
 2. │ └─base::lapply(...)
 3. │   └─tiledbsoma (local) FUN(X[[i]], ...)
 4. │     ├─Matrix::t(self$to_sparse_matrix(collection = "X", layer_name = layer))
 5. │     └─self$to_sparse_matrix(collection = "X", layer_name = layer)
 6. │       └─layer$read(coords = coords)
 7. │         └─tiledbsoma:::sr_setup(...)
 8. ├─base::stop(`<<navlbl>>`)
 9. └─base (local) `<fn>`(`<<navlbl>>`)
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 176.8 s

── Failed tests ──────────────────────────────────────────────────────────────────────────────────────────────────────
Error (test-write-soma-resume.R:493:3): Resume-mode Seurat
<<not available>/C++Error/error/condition>
Error: [TileDB::C++API] Error: Failed to create context
Backtrace:
     ▆
  1. ├─base::suppressWarnings(...) at test-write-soma-resume.R:493:3
  2. │ └─base::withCallingHandlers(...)
  3. └─SOMAExperimentAxisQuery$new(expc, "RNA")$to_seurat(...)
  4.   ├─base::withCallingHandlers(...)
  5.   ├─base::tryCatch(...)
  6.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  7.   │   ├─base (local) tryCatchOne(...)
  8.   │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  9.   │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
 10.   │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 11.   │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
 12.   └─tiledbsoma:::.load_seurat_command(uns, ms_names = private$.measurement_name)
 13.     └─xdf$read(column_names = "values")
 14.       └─tiledbsoma:::sr_setup(...)

Error (test-write-soma-resume.R:636:3): Resume-mode SingleCellExperiment
Error in `(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)))(structure(list(message = "[TileDB::C++API] Error: Failed to create context",
    call = NULL, cppstack = NULL), class = c("<not available>",
"C++Error", "error", "condition")))`: error in evaluating the argument 'x' in selecting a method for function 't': [TileDB::C++API] Error: Failed to create context
Backtrace:
    ▆
 1. ├─SOMAExperimentAxisQuery$new(expr, "RNA")$to_single_cell_experiment(...) at test-write-soma-resume.R:636:3
 2. │ └─base::lapply(...)
 3. │   └─tiledbsoma (local) FUN(X[[i]], ...)
 4. │     ├─Matrix::t(self$to_sparse_matrix(collection = "X", layer_name = layer))
 5. │     └─self$to_sparse_matrix(collection = "X", layer_name = layer)
 6. │       └─layer$read(coords = coords)
 7. │         └─tiledbsoma:::sr_setup(...)
 8. ├─base::stop(`<<navlbl>>`)
 9. └─base (local) `<fn>`(`<<navlbl>>`)

[ FAIL 2 | WARN 0 | SKIP 0 | PASS 3494 ]
Error: Test failures
Execution halted

Is that familiar to anyone?

@eddelbuettel
Copy link
Contributor

That's two out of 3496, they are in code undergoing current changes (timestamped read-write) so maybe we should just park these two. The Heisenbug is that things do fail earlier for some of us (SCEOutgest is a common one, sometimes later ones). One can run these test files individually too (that is tedious, and a fault of the overcomplicated testthat we use) and then they pass while they may fail when run with all others -- which is also maddening.

This may all be related to Context and I have a change to handle that differently. Or it could be another race condition.
Anyway -- also details below for a run on my end (Ubuntu 24.04, current) which always passes. (This is from my laptop, it also works with slightly new parts on my main server but I am in a middle of a branch there....)

edd@ahmad:~/git/tiledb-soma/apis/r/tests(main)$ CI=TRUE Rscript testthat.R
tiledbsoma:    1.13.99.2
tiledb-r:      0.29.0.4
tiledb core:   2.26.0
libtiledbsoma: 2.26.0
R:             R version 4.4.1 (2024-06-14)
OS:            Ubuntu 24.04 LTS
✔ | F W  S  OK | Context
✔ |         25 | Arrow-utils                                                               
✔ |        140 | Blockwise [4.3s]                                                          
✔ |         14 | ConfigList                                                                
✔ |        115 | CoordsStrider                                                             
✔ |         26 | EphemeralCollection                                                       
✔ |         14 | EphemeralExperiment                                                       
✔ |         17 | EphemeralMeasurement                                                      
✔ |          5 | example-datasets                                                          
✔ |         49 | Factory                                                                   
✔ |         35 | IntIndexer                                                                
✔ |          9 | membership-caching                                                        
✔ |         11 | OrderedAndFactor                                                          
✔ |         26 | PlatformConfig                                                            
✔ |        277 | reopen [1.0s]                                                             
✔ |         70 | ScalarMap                                                                 
✔ |        172 | SCEOutgest [10.5s]                                                        
✔ |        379 | SeuratIngest [16.3s]                                                      
✔ |         99 | SeuratOutgest-assay [3.2s]                                                
✔ |        169 | SeuratOutgest-command [5.9s]                                              
✔ |         49 | SeuratOutgest-graph [2.2s]                                                
✔ |        134 | SeuratOutgest-object [7.2s]                                               
✔ |        180 | SeuratOutgest-reduction [4.2s]                                            
✔ |         29 | SingleCellExperimentIngest [11.0s]                                        
✔ |         13 | SOMAArrayReader-Arrow                                                     
✔ |         80 | SOMAArrayReader-Iterated [6.0s]                                           
✔ |         16 | SOMAAxisQuery                                                             
✔ |         22 | SOMACollection [2.8s]                                                     
✔ |        161 | SOMADataFrame [7.2s]                                                      
✔ |         52 | SOMADenseNDArray [1.3s]                                                   
✔ |         80 | SOMAExperiment-query-m-p [4.1s]                                           
✔ |         33 | SOMAExperiment-query-matrix-outgest [1.9s]                                
✔ |         62 | SOMAExperiment-query [4.5s]                                               
✔ |         12 | SOMAExperiment [1.1s]                                                     
✔ |         11 | SOMAMeasurement
✔ |          9 | SOMAOpen                                                                  
✔ |        156 | SOMASparseNDArray [2.2s]                                                  
✔ |         57 | SOMATileDBContext                                                         
✔ |          2 | Stats                                                                     
✔ |         26 | SummarizedExperimentIngest [2.8s]                                         
✔ |         16 | TileDBArray                                                               
✔ |         52 | TileDBCreateOptions                                                       
✔ |         37 | TileDBGroup                                                               
✔ |         21 | TileDBURI                                                                 
✔ |         50 | utils-matrixZeroBasedView                                                 
✔ |         24 | utils-uris                                                                
✔ |         13 | utils                                                                     
✔ |          3 | version                                                                   
✔ |        190 | write-soma-objects [3.7s]                                                 
✔ |        315 | write-soma-resume [37.4s]                                                 
                                                                                           
══ Results ════════════════════════════════════════════════════════════════════════════════
Duration: 144.5 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3557 ]
edd@ahmad:~/git/tiledb-soma/apis/r/tests(main)$ 

@ryan-williams
Copy link
Member Author

ryan-williams commented Aug 26, 2024

It's possible that #2926 reduces the frequency of these failures; two runs there I re-ran several times this weekend:

This is in contrast to main having failed the last 81 of 81 runs.

I don't think there's a known reason for #2926 to reduce the frequency of these failures, and @eddelbuettel mentioned separately that the R CI has gone through previous phases of failing with more or less frequency, just flagging: the failure frequency may be about to decrease (or we will find that failures occur preferentially on main?).

There's also discussion around disabling failing tests on #2922.

@ryan-williams
Copy link
Member Author

I updated the issue description, but posting here as well for visibility:

Update: #2926 seems to have mitigated the failures significantly (though brute-force re-running R CI on that PR still failed 4 of 22 attempts: #5814, #5815).

R CI currently shows mostly successes, though run #5864 (from #2919, commit 9cd956f, which includes #2926) failed.

We can continue using this issue to track failures as we see them. My understanding is we don't expect #2926 to have actually fixed the issue, it has just made the failures less frequent.

@jp-dark also flagged recently that ASAN was inadvertently broken in #2773; fixing that (and running it as part of CI) may help root-cause the issue here. Related internal tracking: sc-34306, sc-53780.

@ryan-williams ryan-williams changed the title [Bug] R CI failing [Bug] R CI flaky Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants