Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Use 31-bit-friendly default shape for ingest #1440

Merged
merged 9 commits into from
Jul 7, 2023

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Jun 2, 2023

Issue and/or context: As discussed this week and last week. The room-for-growth is good. Two billion room-for-growth in each dimension is a quite roomy default, and works better for R interop.

Tracking issue: #1445

Changes: The default domain in Python is room for plenty of growth after creation. Here we retain that -- still plenty of room for growth -- just 32-bit friendly for better ability for R code to read Python-created SOMA data.

Notes for Reviewer: This is a non-breaking change at the API level.

@johnkerl johnkerl force-pushed the kerl/32-bit-friendly branch from 019ba8a to f32e95e Compare June 2, 2023 16:57
@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2023

Codecov Report

Patch coverage has no change and project coverage change: +26.87 🎉

Comparison is base (5b781f1) 64.44% compared to head (9e3ab57) 91.32%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1440       +/-   ##
===========================================
+ Coverage   64.44%   91.32%   +26.87%     
===========================================
  Files         102       30       -72     
  Lines        8336     2547     -5789     
===========================================
- Hits         5372     2326     -3046     
+ Misses       2964      221     -2743     
Flag Coverage Δ
python 91.32% <ø> (+0.54%) ⬆️
r ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 86 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bkmartinjr bkmartinjr requested a review from pablo-gar June 2, 2023 17:01
Copy link
Member

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bug to be resolved. The other question: in some cases, this is a breaking change, so do we want to ensure it is only released when we have the remainder of the issues resolved?

@johnkerl johnkerl requested a review from bkmartinjr June 2, 2023 18:18
@johnkerl
Copy link
Member Author

johnkerl commented Jun 2, 2023

Other related issues to be proposed by @mojaveazure , as we discussed today.

I'm happy to wait on merging this PR until then, if folks prefer.

Copy link
Contributor

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2023

@pablo-gar two points about passing user-specified shape all the way from top level (see also #1445):

(1) I want to do that on a separate PR -- not this one

(2) You point to a diff for passing the desired shape to a single array. But tiledbsoma.io does not deal with single arrays -- it deals with all the arrays in an experiment, and we would need to offer specification for different shapes for all of them. Here for example is pbmc3k_processed:

pbmc3k_processed/obs
pbmc3k_processed/ms/RNA/var
pbmc3k_processed/ms/RNA/X/data
pbmc3k_processed/ms/RNA/obsm/X_umap
pbmc3k_processed/ms/RNA/obsm/X_tsne
pbmc3k_processed/ms/RNA/obsm/X_pca
pbmc3k_processed/ms/RNA/obsm/X_draw_graph_fr
pbmc3k_processed/ms/RNA/obsp/distances
pbmc3k_processed/ms/RNA/obsp/connectivities
pbmc3k_processed/ms/RNA/varm/PCs
pbmc3k_processed/ms/raw/X/data
pbmc3k_processed/ms/raw/var

There are 12 different arrays here, with varying shapes.

What parameterization do you propose?

tiledbsoma.io.from_anndata(
  'pbmc3k_processed',
  'pbmc3k_processed.h5ad',
  'RNA',
  shapes={
    'obs': (2638,)
    'ms/RNA/var': (1838,),
    'ms/RNA/X/data': (2638, 1838),
    'ms/RNA/obsm/X_umap': (2638, 2),
    'ms/RNA/obsm/X_tsne': (2638, 2),
    'ms/RNA/obsm/X_pca': (2638, 50),
    'ms/RNA/obsm/X_draw_graph_fr': (2638, 2),
    'ms/RNA/obsp/distances': (2638, 2638),
    'ms/RNA/obsp/connectivities': (2638, 2638),
    'ms/RNA/varm/PCs': (1838, 1838),
    'ms/raw/X/data': (2638, 13714),
    'ms/raw/var': (13714,),
  }
)

Something like the above? Or something else?

@johnkerl johnkerl requested a review from pablo-gar June 6, 2023 14:22
@pablo-gar
Copy link
Contributor

@johnkerl Thanks! Important details on the many different arrays. As I think about it, there two main use cases we want to fulfill with this:

  • Future addition of data to the tiledbsoma object -- here it makes sense to have the shapes as the maximum 32-bit capacity.
  • Read-only tiledbsoma object. For this there are two options:
    • tiledbsoma object from one h5ad -- here it would make sense to have defaults as the original matrix shapes
    • tiledbsoma object from multiple h5ads (like we do in the Census) -- here it would make sense to be able to specify individual shapes for each matrix, like you have in your example.

Not sure I have the right answer since non-defaults add ux complexity.

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2023

Interop CI is failing with Error: Process completed with exit code 143 which, Google suggests to me, may be indicative of our running out of memory/CPU/time resources within the CI run ...

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2023

Interop CI is failing with Error: Process completed with exit code 143 which, Google suggests to me, may be indicative of our running out of memory/CPU/time resources within the CI run ...

It passes on my laptop with this branch checked out, so, it needs to be a CI-resource-config problem to be solved ... 🤔

@johnkerl
Copy link
Member Author

johnkerl commented Jun 6, 2023

Not sure I have the right answer since non-defaults add ux complexity.

@pablo-gar my position is that the above

tiledbsoma.io.from_anndata(
  'pbmc3k_processed',
  'pbmc3k_processed.h5ad',
  'RNA',
  shapes={
    'obs': (2638,)
    'ms/RNA/var': (1838,),
    'ms/RNA/X/data': (2638, 1838),
    'ms/RNA/obsm/X_umap': (2638, 2),
    'ms/RNA/obsm/X_tsne': (2638, 2),
    'ms/RNA/obsm/X_pca': (2638, 50),
    'ms/RNA/obsm/X_draw_graph_fr': (2638, 2),
    'ms/RNA/obsp/distances': (2638, 2638),
    'ms/RNA/obsp/connectivities': (2638, 2638),
    'ms/RNA/varm/PCs': (1838, 1838),
    'ms/raw/X/data': (2638, 13714),
    'ms/raw/var': (13714,),
  }
)

is prohibitively complex to require people to type. Just not OK.

Another option is we require them to do

tiledbsoma.io.from_anndata(
  'pbmc3k_processed',
  'pbmc3k_processed.h5ad',
  'RNA',
  nobs=2638, nvar, 1838, nrawvar = 13714,
)

or some such -- let the num-PCAs axes etc be tight.

The best option, I believe, is:

@johnkerl
Copy link
Member Author

johnkerl commented Jun 9, 2023

Summary

Re the CI fail:

Interop CI is failing with Error: Process completed with exit code 143 which, Google suggests to me, may be indicative of our running out of memory/CPU/time resources within the CI run ...

This is due to increased runtime and a combination of several factors:

  • On this PR from Python's from_h5ad/from_anndata we replace 63-bit shape with 31-bit shape
  • Both cause problems with R reading that, but at different points
  • Muffle-warning logic in tiledbsoma-r hides this
  • With 31-bit shape, it runs longer (in particular, too long for CI) when the Python-written shapes are 31-bit

Details

Output for the pre case:

ℹ Loading tiledbsoma
Warning: argument must be coercible to non-negative integer
Warning: argument must be coercible to non-negative integer
Warning: argument must be coercible to non-negative integer
Warning: argument must be coercible to non-negative integer
SEAQ SPARSE_MATRIX PRE
Warning: Array dimensions must not exceed '.Machine$integer.max'
SEAQ SPARSE_MATRIX PRE
Warning: Array dimensions must not exceed '.Machine$integer.max'
An object of class Seurat
1838 features across 2638 samples within 1 assay
Active assay: RNA (1838 features, 0 variable features)
Warning messages:
1: In as.integer.integer64(embed$shape()[2L]) :
  NAs produced by integer overflow
2: In as.integer.integer64(embed$shape()[2L]) :
  NAs produced by integer overflow
3: In as.integer.integer64(embed$shape()[2L]) :
  NAs produced by integer overflow
4: In as.integer.integer64(embed$shape()[2L]) :
  NAs produced by integer overflow

Output for the post case:

ℹ Loading tiledbsoma
Warning: vector memory exhausted (limit reached?)
Warning: vector memory exhausted (limit reached?)
Warning: vector memory exhausted (limit reached?)
Warning: vector memory exhausted (limit reached?)
SEAQ SPARSE_MATRIX PRE
SEAQ SPARSE_MATRIX POST
SEAQ MAT IDX IDX PRE
Warning: Cholmod error 'problem too large' at file ../Core/cholmod_memory.c, line 135
SEAQ SPARSE_MATRIX PRE
SEAQ SPARSE_MATRIX POST
SEAQ MAT IDX IDX PRE
Warning: Cholmod error 'problem too large' at file ../Core/cholmod_memory.c, line 135
An object of class Seurat
1838 features across 2638 samples within 1 assay
Active assay: RNA (1838 features, 0 variable features)

From the above program output we see that in the pre case:

From the above program output we see that in the post case:

Action items

  • Due to R idiosyncracies (tiledbsoma-py has zero problems reading or writing 63-bit shapes, or 31-bit shapes) we are encountering either failures in R or extreme slowness in R trying to write even 31-bit shapes written by Python
  • I don't like the muffleWarning at all -- this makes previously failing cases appear to succeed
    • I'd like to get rid of our use of muffleWarning
    • Failing that, I think we should use options(warn=2) in our unit-test cases -- as we've just seen, we had incomplete I/O even before this PR, and the R/Python interop tests weren't catching them

cc @mojaveazure @ebezzi @pablo-gar @aaronwolen

@eddelbuettel
Copy link
Contributor

Can you isolate where the cholmod error comes from? That is from inside package Matrix and indicative that we masy still do something wrong.

@johnkerl
Copy link
Member Author

johnkerl commented Jun 9, 2023

Can you isolate where the cholmod error comes from? That is from inside package Matrix and indicative that we masy still do something wrong.

@eddelbuettel it's at this line:
https://github.com/single-cell-data/TileDB-SOMA/blob/kerl/r-alloc-trace/apis/r/R/SOMAExperimentAxisQuery.R#L729
where mat is of type dgCMatrix

@eddelbuettel
Copy link
Contributor

Good. If you put a line with browser() there we can evaluate idx. Is it outside the matrix dims? For "normal" requests it behaves:

> set.seed(123); N <- 100; M <- as(rsparsematrix(N, N, 0.2), "dgCMatrix"); M[N,N]
[1] 0
> set.seed(123); N <- 100; M <- as(rsparsematrix(N, N, 0.2), "dgCMatrix"); M[N+1,N+1]
Error in intI(i, n = d[1L], dn[[1L]], give.dn = FALSE) : 
  index larger than maximal 100
> 
> for (i in 4:7) { set.seed(123); N <- 10^i; M <- as(rsparsematrix(N, N, 1000/(N*N)), "dgCMatrix"); print(M[N,N]) }
[1] 0
[1] 0
[1] 0
[1] 0
> 

@johnkerl
Copy link
Member Author

johnkerl commented Jun 9, 2023

@eddelbuettel https://gist.github.com/johnkerl/567bdfc1a6c750be1b7fec8792605eb3

I believe the 31-bit shape, not the idx itself, is to blame ...

Browse[1]> mat
2147483646 x 2147483646 sparse Matrix of class "dgCMatrix"
Error in subCsp_cols(x, j, drop = drop) :
  Cholmod error 'problem too large' at file ../Core/cholmod_memory.c, line 135

Browse[1]> idx
   [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14
  [15]   15   16   17   18   19   20   21   22   23   24   25   26   27   28
...
[2619] 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632
[2633] 2633 2634 2635 2636 2637 2638
Browse[1]>

@johnkerl johnkerl changed the title [python] Use 32-bit-friendly default shape for ingest [python] Use 31-bit-friendly default shape for ingest Jun 21, 2023
@johnkerl johnkerl force-pushed the kerl/32-bit-friendly branch from 18ac2ec to 41e7ee6 Compare June 21, 2023 20:08
@johnkerl johnkerl force-pushed the kerl/32-bit-friendly branch from 41e7ee6 to 6876784 Compare June 27, 2023 15:59
@johnkerl
Copy link
Member Author

johnkerl commented Jun 27, 2023

I just rebased on now-merged #1504

@johnkerl johnkerl requested a review from aaronwolen June 27, 2023 16:00
@johnkerl johnkerl marked this pull request as draft June 27, 2023 16:31
@johnkerl johnkerl force-pushed the kerl/32-bit-friendly branch 2 times, most recently from 81d6470 to cba7647 Compare June 29, 2023 23:20
@johnkerl
Copy link
Member Author

I just rebased on now-merged #1504

@aaronwolen @mojaveazure

  • Before that rebase, as narrated in thorough detail above
    • The test pytest apis/system/tests/test_write_anndata_read_seurat.py was passing (with warnings) on my laptop with very long runtime
    • It was 143'ing in CI (out of resources) -- which I (mistakenly) attributed to the high runtime
  • After that rebase:
    • The test pytest apis/system/tests/test_write_anndata_read_seurat.py is passing (with warnings) on my laptop with very good runtime (just a few seconds)
    • But still 143'ing in CI -- which I am now hypothesizing is due to RAM needs within the CI container

Not sure where to go next from here.

I'm not willing to merge a PR which turns interop CI (or any CI for that matter) from green to red.

@johnkerl johnkerl marked this pull request as ready for review June 30, 2023 17:23
johnkerl added 9 commits July 6, 2023 23:47

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@johnkerl johnkerl force-pushed the kerl/32-bit-friendly branch from 612b137 to 9e3ab57 Compare July 7, 2023 03:50
@johnkerl johnkerl dismissed stale reviews from pablo-gar and bkmartinjr July 7, 2023 13:32

Files were regenerated. Extra non-default-shape arguments are a great idea, and outside the scope of this PR.

@johnkerl
Copy link
Member Author

johnkerl commented Jul 7, 2023

@aaronwolen @pablo-gar after rebase on #1504 #1508 #1521 this is now a simple, green-CI PR

Copy link
Member

@aaronwolen aaronwolen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@johnkerl johnkerl merged commit ee266fb into main Jul 7, 2023
@johnkerl johnkerl deleted the kerl/32-bit-friendly branch July 7, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants