Dataset details of TCGA-BRCA #5

naivete5656 · 2024-11-06T08:19:17Z

Hi, thank you for sharing the implementation of this excellent work!

I have three questions about TCGA-BRCA.

How did you determine which cases to use for training? There are 1,062 diagnostic slides, but only 1,041 cases were used to train your model. Could you explain how you selected these images?
Could you let me know why you only used only diagnostic slides? The TCGA dataset includes tissue slides, which appear linked to the same sample IDs as the RNA data. Since tissue slides visually resemble diagnostic slides, I wonder if they could also be suitable for training.
How did you obtain the OncoTreeCode? I reviewed the shared csv, which contains OncoTreeCode data for each case. According to TCGA-BRCA, there are nine disease types, but these types don’t seem to align with the OncoTreeCodes in your CSV. I assume that Disease_type or diagnoses.0.primary_diagnosis might be the key information to identify the OncoTreeCode, but these labels don’t appear to match those in your CSV.

Best.

Provide feedback