Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flat file -> DataMatrix? #21

Open
kescobo opened this issue Aug 8, 2024 · 5 comments
Open

Flat file -> DataMatrix? #21

kescobo opened this issue Aug 8, 2024 · 5 comments

Comments

@kescobo
Copy link
Member

kescobo commented Aug 8, 2024

Is there a description somewhere of how to create a DataMatrix from some other data type? The tutorial doesn't make this clear, it provides data that's already in the correct format.

I have spatial-transcriptomics data that looks like this:

42239698×10 DataFrame
      Row │ fov     cell_ID  cell         x_local_px  y_local_px  x_global_px  y_global_px  z      target            CellComp
          │ UInt16  UInt16   String15     UInt16      UInt16      Float32      Float32      Int16  String31          String15
──────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        1 │      1        0  c_1_1_0            4256        1262      7762.71      80367.5      0  Bmpr1a            None
        2 │      1        0  c_1_1_0            4256        1304      7762.79      80325.8      7  Tcl1              None
        3 │      1        0  c_1_1_0            4256        2269      7762.8       79360.9      6  Twist1            None

The table is ~42 million rows.

I can get counts / cell with

cell_counts = combine(groupby(tx, "cell"), "target" =>
                      (t-> [(; target, count = count(==(target), t)) for target in t]) =>
                      ["target", "count"]
)

Though it takes a long time. Just wondering if there's an obvious way to get to the sparse matrix / DataMatrix format?

@rasmushenningsson
Copy link
Collaborator

If I understand things correctly you actually want

cell_counts = combine(groupby(tx,["cell","target"]), nrow=>"count")

because the version above will give you duplicate lines.

But we can also construct a DataMatrix directly from tx by:

cells = unique(tx.cell)
targets = unique(tx.target)

cell_ind = identity.(indexin(tx.cell, cells))
target_ind = identity.(indexin(tx.target, targets))

X = sparse(target_ind, cell_ind, 1)

counts = DataMatrix(X, DataFrame(id=targets, name=targets), DataFrame(cell_id=cells))

which let's sparse handle the duplicate rows for us.

It's a bit manual to do it this way, it would be nice if this was possible without the user having to explicitly work with indices.

@rasmushenningsson
Copy link
Collaborator

rasmushenningsson commented Aug 12, 2024

Do you think it would be worthwhile to add a utility function for this?
Something like:

accumulate_data_matrix(tx; obs_cols="cell", var_cols="target", obs_annot_cols=["fov", "cell_ID"])

or

accumulate_data_matrix(tx; values="counts", obs_cols="cell", var_cols="target", obs_annot_cols=["fov", "cell_ID"])

if you have a column with counts.

(A better name would be nice though. 🙂)

@kescobo
Copy link
Member Author

kescobo commented Aug 16, 2024

cell_counts = combine(groupby(tx,["cell","target"]), nrow=>"count")

Ooh, that is much cleaner 😅

Do you think it would be worthwhile to add a utility function for this?

Hmm - yeah, that looks great. Agree that there could be a nicer name - is there a way to abuse multiple dispatch on the DataMatrix constructor? It's not everyone's cup tea, but I like that there are like 10 different ways to build a DataFrame for example. Worst case there could be something like CountTable(tx, args...) and then DataMatrix(ct::CountTable, args...)

@kescobo
Copy link
Member Author

kescobo commented Aug 21, 2024

In case you missed it on slack #appreciation - this approach worked great!

(Tried to upload the gif here, but it's too big)

I managed to get SVD and UMAP in about 10 sec, where it crashed my buddy's iMac when he tried to do it in Seurat 👍

@rasmushenningsson
Copy link
Collaborator

That's great to hear. 😄

Yeah, I'm not to fond of (ab)using the DataMatrix constructor for this. (But I do agree in the DataFrames case, just feel this is more specialized, so the user needs to signal intent in some way). I'll try to think of something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants