Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixer.generate and model.fit error #9

Open
Yooooopick opened this issue Jun 25, 2023 · 4 comments
Open

mixer.generate and model.fit error #9

Yooooopick opened this issue Jun 25, 2023 · 4 comments

Comments

@Yooooopick
Copy link

Hello,
Thank you for your hard work for Kassandra. It's a nice and useful tool for cell fraction detection from bulk RNAseq data. After git clone https://github.com/BostonGene/Kassandra/ and running the "Model Training.ipynb" vignettes using the example data in the "/data" directory, I get the following error:

expr,values = mixer.generate('General_cells') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/root/Kassandra/core/mixer.py", line 133, in generate **self.generate_pure_cell_expressions(genes, self.num_av, [modeled_cell])} File "/root/Kassandra/core/mixer.py", line 189, in generate_pure_cell_expressions cells_index = self.change_subtype_proportions(cell=cell, File "/root/Kassandra/core/mixer.py", line 288, in change_subtype_proportions subtype_proportions = {cell: dict(self.proportions.loc[specified_subtypes])} File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 1091, in __getitem__ check_dict_or_set_indexers(key) File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 2618, in check_dict_or_set_indexers raise TypeError( TypeError: Passing a set as an indexer is not supported. Use a list instead.

and then,
>>> model.fit(mixer) ============== L1 models ============== Generating mixes for B_cells model Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/root/Kassandra/core/model.py", line 78, in fit expr, values = mixer.generate(cell, genes=self.cell_types[cell].genes, random_seed=i+1) File "/root/Kassandra/core/mixer.py", line 132, in generate average_cells = {**self.generate_pure_cell_expressions(genes, 1, cells_to_mix), File "/root/Kassandra/core/mixer.py", line 189, in generate_pure_cell_expressions cells_index = self.change_subtype_proportions(cell=cell, File "/root/Kassandra/core/mixer.py", line 288, in change_subtype_proportions subtype_proportions = {cell: dict(self.proportions.loc[specified_subtypes])} File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 1091, in __getitem__ check_dict_or_set_indexers(key) File "/root/anaconda3/envs/kassandra/lib/python3.8/site-packages/pandas/core/indexing.py", line 2618, in check_dict_or_set_indexers raise TypeError( TypeError: Passing a set as an indexer is not supported. Use a list instead.

Do you know what the problem might be?
Thank you!

@shpakb
Copy link
Collaborator

shpakb commented Jun 27, 2023

Hi @Yooooopick,

There are some cell types terms that are missing in "Cell_type" column of cells annotation data frame. Here is some code to check what you are missing:

missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]
missing_cts

There should't be any problems if you just run "Model Training.ipynb" as it is. Just checked it.

@Yooooopick
Copy link
Author

Thank you for your kind reply.
I run the code and the result is shown below:
> ['Immune_general', 'Monocytic_cells']
And I think the 'Immune_general','Monocytic_cells' belong to the upper level of annotation to such as Monocytes and macrophage and actually can not appear in the training data.
But concerning about this reason, I edit the "/config/cell_types.yaml" file and remove the 'Immune_general' and 'Monocytic_cells' ones and change the parent_type to "General_cells" despite the cell_proportion and so on will not be accurate. The same error appeared again.

Actually, I run the "Model Training.ipynb" vignettes using the example data in the "/data" directory after getting clone from the website just like below and this error is still here.
cancer_sample_annot = pd.read_csv('data/cancer_samples_annot.tsv.tar.gz', sep='\t', index_col=0)
cancer_expr = pd.read_csv('data/cancer_expr.tsv.tar.gz', sep='\t', index_col=0)
cells_sample_annot = pd.read_csv('data/cells_samples_annot.tsv.tar.gz', sep='\t', index_col=0)
cells_expr = pd.read_csv('data/cells_expr.tsv.tar.gz', sep='\t', index_col=0)

I will appreciate your recommended solution.

@shpakb
Copy link
Collaborator

shpakb commented Jun 28, 2023

Here is some code to patch annotation for missing cell types:

# adding missing cell types
cell_types = CellTypes.load('configs/full_blood_model.yaml')
missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]

for ct in missing_cts:
    subtypes = cell_types.get_direct_subtypes(ct)
    annot = cells_annot.loc[cells_annot['Cell_type'].isin(subtypes)]
    annot.index
    expr = cells_expr[annot.index]
    annot['Cell_type'] = ct
    annot.index = annot.index + f'_{ct}'
    annot['Dataset'] = annot.index
    expr.columns = expr.columns + f'_{ct}'
    cells_expr = pd.concat([cells_expr, expr], axis=1)
    cells_annot = pd.concat([cells_annot, annot])

It will duplicate annotation and expressions for all the direct subtypes of "Monocytic_cells" (Monocytes, Macrophages) and "Immune_general" (T, B, NK, mono, etc). Then you can proceed with the training using original config.

@jsangalang
Copy link

Hello, I still believe there is an error with the training dataset provided on the website. I tried the additional patch you included, but there are still no "Dendritic_cells" cell type found in the training dataset from cell_types.yaml.
I commented the Dendritic_cells from cell_types.yaml, which worked. Please address this issue in your dataset annotation.

model_column = 'Tumor_model_annot'
samples = data_annot.loc[data_annot['Tumor_model_annot'] == 'cancer_cells'].index
cancer_expr = data_expr[samples]
cancer_annot = data_annot.loc[samples]
cancer_annot['Tumor_type'] = cancer_annot['Dataset']
cancer_annot = cancer_annot[['Tumor_type', 'Dataset']]

samples = data_annot.loc[~data_annot[model_column].isna()].index
cells_expr = data_expr[samples]

cells_annot = data_annot.loc[samples]
cells_annot = cells_annot[[model_column, 'Dataset']]
cells_annot.columns = ['Cell_type', 'Dataset']
cells_annot = pd.concat([lab_annot, cells_annot])
cells_annot.loc[cells_annot['Dataset'].isna(), 'Dataset'] = cells_annot.loc[cells_annot['Dataset'].isna()].index
cells_expr = pd.concat([lab_expr, cells_expr], axis=1)

# to make sure that there is no repeated samples
samples = sorted(list(set(cells_annot.index).intersection(set(cells_expr.columns))))
cells_expr = cells_expr[samples]
cells_annot = cells_annot.loc[samples]

print(cells_expr.shape, cells_annot.shape)
print(cancer_expr.shape, cancer_annot.shape)

##############################

# Load cell types model

cell_types = CellTypes.load('configs/cell_types.yaml')
missing_cts = [x for x in cell_types.get_all_subtypes('General_cells') if not x in cells_annot['Cell_type'].unique()]
missing_cts

for ct in missing_cts:
    subtypes = cell_types.get_direct_subtypes(ct)
    annot = cells_annot.loc[cells_annot['Cell_type'].isin(subtypes)]
    annot.index
    expr = cells_expr[annot.index]
    annot['Cell_type'] = ct
    annot.index = annot.index + f'_{ct}'
    annot['Dataset'] = annot.index
    expr.columns = expr.columns + f'_{ct}'
    cells_expr = pd.concat([cells_expr, expr], axis=1)
    cells_annot = pd.concat([cells_annot, annot])

# to make sure that there is no repeated samples
samples = sorted(list(set(cells_annot.index).intersection(set(cells_expr.columns))))
cells_expr = cells_expr[samples]
cells_annot = cells_annot.loc[samples]
print(cells_expr.shape, cells_annot.shape)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants