Flat file -> DataMatrix? #21

kescobo · 2024-08-08T16:31:51Z

Is there a description somewhere of how to create a DataMatrix from some other data type? The tutorial doesn't make this clear, it provides data that's already in the correct format.

I have spatial-transcriptomics data that looks like this:

42239698×10 DataFrame
      Row │ fov     cell_ID  cell         x_local_px  y_local_px  x_global_px  y_global_px  z      target            CellComp
          │ UInt16  UInt16   String15     UInt16      UInt16      Float32      Float32      Int16  String31          String15
──────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        1 │      1        0  c_1_1_0            4256        1262      7762.71      80367.5      0  Bmpr1a            None
        2 │      1        0  c_1_1_0            4256        1304      7762.79      80325.8      7  Tcl1              None
        3 │      1        0  c_1_1_0            4256        2269      7762.8       79360.9      6  Twist1            None

The table is ~42 million rows.

I can get counts / cell with

cell_counts = combine(groupby(tx, "cell"), "target" =>
                      (t-> [(; target, count = count(==(target), t)) for target in t]) =>
                      ["target", "count"]
)

Though it takes a long time. Just wondering if there's an obvious way to get to the sparse matrix / DataMatrix format?

The text was updated successfully, but these errors were encountered:

rasmushenningsson · 2024-08-12T10:56:47Z

If I understand things correctly you actually want

cell_counts = combine(groupby(tx,["cell","target"]), nrow=>"count")

because the version above will give you duplicate lines.

But we can also construct a DataMatrix directly from tx by:

cells = unique(tx.cell)
targets = unique(tx.target)

cell_ind = identity.(indexin(tx.cell, cells))
target_ind = identity.(indexin(tx.target, targets))

X = sparse(target_ind, cell_ind, 1)

counts = DataMatrix(X, DataFrame(id=targets, name=targets), DataFrame(cell_id=cells))

which let's sparse handle the duplicate rows for us.

It's a bit manual to do it this way, it would be nice if this was possible without the user having to explicitly work with indices.

rasmushenningsson · 2024-08-12T10:58:00Z

Do you think it would be worthwhile to add a utility function for this?
Something like:

accumulate_data_matrix(tx; obs_cols="cell", var_cols="target", obs_annot_cols=["fov", "cell_ID"])

or

accumulate_data_matrix(tx; values="counts", obs_cols="cell", var_cols="target", obs_annot_cols=["fov", "cell_ID"])

if you have a column with counts.

(A better name would be nice though. 🙂)

kescobo · 2024-08-16T13:23:14Z

cell_counts = combine(groupby(tx,["cell","target"]), nrow=>"count")

Ooh, that is much cleaner 😅

Do you think it would be worthwhile to add a utility function for this?

Hmm - yeah, that looks great. Agree that there could be a nicer name - is there a way to abuse multiple dispatch on the DataMatrix constructor? It's not everyone's cup tea, but I like that there are like 10 different ways to build a DataFrame for example. Worst case there could be something like CountTable(tx, args...) and then DataMatrix(ct::CountTable, args...)

kescobo · 2024-08-21T01:43:30Z

In case you missed it on slack #appreciation - this approach worked great!

(Tried to upload the gif here, but it's too big)

I managed to get SVD and UMAP in about 10 sec, where it crashed my buddy's iMac when he tried to do it in Seurat 👍

rasmushenningsson · 2024-08-27T12:49:11Z

That's great to hear. 😄

Yeah, I'm not to fond of (ab)using the DataMatrix constructor for this. (But I do agree in the DataFrames case, just feel this is more specialized, so the user needs to signal intent in some way). I'll try to think of something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flat file -> DataMatrix? #21

Flat file -> DataMatrix? #21

kescobo commented Aug 8, 2024 •

edited

Loading

rasmushenningsson commented Aug 12, 2024

rasmushenningsson commented Aug 12, 2024 •

edited

Loading

kescobo commented Aug 16, 2024

kescobo commented Aug 21, 2024

rasmushenningsson commented Aug 27, 2024

Flat file -> DataMatrix? #21

Flat file -> DataMatrix? #21

Comments

kescobo commented Aug 8, 2024 • edited Loading

rasmushenningsson commented Aug 12, 2024

rasmushenningsson commented Aug 12, 2024 • edited Loading

kescobo commented Aug 16, 2024

kescobo commented Aug 21, 2024

rasmushenningsson commented Aug 27, 2024

kescobo commented Aug 8, 2024 •

edited

Loading

rasmushenningsson commented Aug 12, 2024 •

edited

Loading