Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Unable to determine gene collection #25

Open
tingxie2020 opened this issue Aug 18, 2022 · 4 comments
Open

ValueError: Unable to determine gene collection #25

tingxie2020 opened this issue Aug 18, 2022 · 4 comments

Comments

@tingxie2020
Copy link

I receive the error message,

ValueError: Unable to determine gene collection. Please make sure the input dataset specifies either HUGO gene symbols or Entrez gene ID's.

W1_1.var.head()
gene_ids feature_types highly_variable means dispersions dispersions_norm n_cells mt rb n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
Mrpl15 ENSMUSG00000033845 Gene Expression False 0.523846 1.410226 -0.317858 364 False False 364 0.125276 76.815287 196.682587
Lypla1 ENSMUSG00000025903 Gene Expression False 0.496954 1.360324 -0.604688 356 False False 356 0.117622 77.324841 184.666626
Tcea1 ENSMUSG00000033813 Gene Expression False 1.178549 1.267944 -0.668772 814 False False 814 0.410102 48.152866 643.860168
Atp6v1h ENSMUSG00000033793 Gene Expression False 0.555859 1.411481 -0.310645 389 False False 389 0.138000 75.222930 216.659286
Rb1cc1 ENSMUSG00000025907 Gene Expression False 1.298682 1.432109 0.011839 838 False False 838 0.449731 46.624204 706.077637

the gene_ids should be Entrez gene ids? I change the column name gene_ids to Entrez gene IDs or Gene stable ID or Entrez gene ids, but didn't work.
or what code should I use to map the gene ids to Ensembl BioMart (http://useast.ensembl.org/biomart) ?

Thanks
Ting

@tingxie2020
Copy link
Author

I used code:
annot = sc.queries.biomart_annotations(
"mmusculus",
["ensembl_gene_id", "Entrez_gene_ID","start_position", "end_position", "chromosome_name"],
).set_index("ensembl_gene_id")
annot

but got error:
KeyError Traceback (most recent call last)
File ~/ENTER/lib/python3.9/site-packages/pybiomart/dataset.py:243, in Dataset.query(self, attributes, filters, only_unique, use_attr_names)
242 try:
--> 243 attr = self.attributes[name]
244 self._add_attr_node(dataset, attr)

KeyError: 'Entrez_gene_ID'

During handling of the above exception, another exception occurred:

BiomartException Traceback (most recent call last)
Input In [65], in <cell line: 1>()
----> 1 annot = sc.queries.biomart_annotations(
2 "mmusculus",
3 ["ensembl_gene_id", "Entrez_gene_ID","start_position", "end_position", "chromosome_name"],
4 ).set_index("ensembl_gene_id")
5 annot

File ~/ENTER/lib/python3.9/site-packages/scanpy/queries/_queries.py:108, in biomart_annotations(org, attrs, host, use_cache)
74 @_doc_params(doc_org=_doc_org, doc_host=_doc_host, doc_use_cache=_doc_use_cache)
75 def biomart_annotations(
76 org: str,
(...)
80 use_cache: bool = False,
81 ) -> pd.DataFrame:
82 """
83 Retrieve gene annotations from ensembl biomart.
84
(...)
106 >>> adata.var[annot.columns] = annot
107 """
--> 108 return simple_query(org=org, attrs=attrs, host=host, use_cache=use_cache)

File ~/ENTER/lib/python3.9/site-packages/scanpy/queries/_queries.py:70, in simple_query(org, attrs, filters, host, use_cache)
66 server = Server(host, use_cache=use_cache)
67 dataset = server.marts["ENSEMBL_MART_ENSEMBL"].datasets[
68 "{}_gene_ensembl".format(org)
69 ]
---> 70 res = dataset.query(attributes=attrs, filters=filters, use_attr_names=True)
71 return res

File ~/ENTER/lib/python3.9/site-packages/pybiomart/dataset.py:246, in Dataset.query(self, attributes, filters, only_unique, use_attr_names)
244 self._add_attr_node(dataset, attr)
245 except KeyError:
--> 246 raise BiomartException(
247 'Unknown attribute {}, check dataset attributes '
248 'for a list of valid attributes.'.format(name))
250 if filters is not None:
251 # Add filter elements.
252 for name, value in filters.items():

BiomartException: Unknown attribute Entrez_gene_ID, check dataset attributes for a list of valid attributes.

I don't know how to check mmusculus dataset attributes for the entrez_gene_ids. Could anyone help?

This is for CellO cell type annotation which requires Entrez gene ids or HUGO gene symbols, could anyone help?

Thanks
Ting

@tingxie2020
Copy link
Author

I have used R biomaRt to find out the mmusculus dataset attributes. It is "entrezgene_id". So I got annot dataframe which contains entrez gene ids.

But now I encountered another issue when I used:

cello_data.var[annot.columns] = annot

for map the my anndata (cello_data2) gene ids to entrez gene ids and add the entrez gene ids. I got the error:

ValueError Traceback (most recent call last)
Input In [84], in <cell line: 1>()
----> 1 cello_data2.var[annot.columns] = annot

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:3643, in DataFrame.setitem(self, key, value)
3641 self._setitem_frame(key, value)
3642 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3643 self._setitem_array(key, value)
3644 elif isinstance(value, DataFrame):
3645 self._set_item_frame_value(key, value)

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:3687, in DataFrame._setitem_array(self, key, value)
3685 check_key_length(self.columns, key, value)
3686 for k1, k2 in zip(key, value.columns):
-> 3687 self[k1] = value[k2]
3689 elif not is_list_like(value):
3690 for col in key:

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:3655, in DataFrame.setitem(self, key, value)
3652 self._setitem_array([key], value)
3653 else:
3654 # set column
-> 3655 self._set_item(key, value)

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:3832, in DataFrame._set_item(self, key, value)
3822 def _set_item(self, key, value) -> None:
3823 """
3824 Add series to DataFrame in specified column.
3825
(...)
3830 ensure homogeneity.
3831 """
-> 3832 value = self._sanitize_column(value)
3834 if (
3835 key in self.columns
3836 and value.ndim == 1
3837 and not is_extension_array_dtype(value)
3838 ):
3839 # broadcast across multiple columns if necessary
3840 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:4532, in DataFrame._sanitize_column(self, value)
4530 # We should never get here with DataFrame value
4531 if isinstance(value, Series):
-> 4532 return _reindex_for_setitem(value, self.index)
4534 if is_list_like(value):
4535 com.require_length_match(value, self.index)

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:10999, in _reindex_for_setitem(value, index)
10995 except ValueError as err:
10996 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10997 if not value.index.is_unique:
10998 # duplicate axis

10999 raise err
11001 raise TypeError(
11002 "incompatible index of inserted column with frame index"
11003 ) from err
11004 return reindexed_value

File ~/ENTER/lib/python3.9/site-packages/pandas/core/frame.py:10994, in _reindex_for_setitem(value, index)
10992 # GH#4107
10993 try:

10994 reindexed_value = value.reindex(index)._values
10995 except ValueError as err:
10996 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10997 if not value.index.is_unique:
10998 # duplicate axis

File ~/ENTER/lib/python3.9/site-packages/pandas/core/series.py:4672, in Series.reindex(self, *args, **kwargs)
4668 raise TypeError(
4669 "'index' passed as both positional and keyword argument"
4670 )
4671 kwargs.update({"index": index})
-> 4672 return super().reindex(**kwargs)

File ~/ENTER/lib/python3.9/site-packages/pandas/core/generic.py:4966, in NDFrame.reindex(self, *args, **kwargs)
4963 return self._reindex_multi(axes, copy, fill_value)
4965 # perform the reindex on the axes
-> 4966 return self._reindex_axes(
4967 axes, level, limit, tolerance, method, fill_value, copy
4968 ).finalize(self, method="reindex")

File ~/ENTER/lib/python3.9/site-packages/pandas/core/generic.py:4986, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4981 new_index, indexer = ax.reindex(
4982 labels, level=level, limit=limit, tolerance=tolerance, method=method
4983 )
4985 axis = self._get_axis_number(a)
-> 4986 obj = obj._reindex_with_indexers(
4987 {axis: [new_index, indexer]},
4988 fill_value=fill_value,
4989 copy=copy,
4990 allow_dups=False,
4991 )
4992 # If we've made a copy once, no need to make another one
4993 copy = False

File ~/ENTER/lib/python3.9/site-packages/pandas/core/generic.py:5032, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
5029 indexer = ensure_platform_int(indexer)
5031 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5032 new_data = new_data.reindex_indexer(
5033 index,
5034 indexer,
5035 axis=baxis,
5036 fill_value=fill_value,
5037 allow_dups=allow_dups,
5038 copy=copy,
5039 )
5040 # If we've made a copy once, no need to make another one
5041 copy = False

File ~/ENTER/lib/python3.9/site-packages/pandas/core/internals/managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
677 # some axes don't allow reindexing with dups
678 if not allow_dups:
--> 679 self.axes[axis]._validate_can_reindex(indexer)
681 if axis >= self.ndim:
682 raise IndexError("Requested axis not found in manager")

File ~/ENTER/lib/python3.9/site-packages/pandas/core/indexes/base.py:4107, in Index._validate_can_reindex(self, indexer)
4105 # trying to reindex on an axis with duplicates
4106 if not self._index_as_unique and len(indexer):
-> 4107 raise ValueError("cannot reindex on an axis with duplicate labels")

ValueError: cannot reindex on an axis with duplicate labels

so, how to solve this?

@tingxie2020
Copy link
Author

When I used the code:
annot = sc.queries.biomart_annotations(
"mmusculus",
["hgnc_symbol", "entrezgene_id"],
)
annot

it extract the hgnc_symbol and entrezgene_id, then I set the index:
annot2=annot
annot2.set_index("hgnc_symbol")

this time it let me run the code successfully.
cello_data3.var[annot2.columns] = annot2

It didn't shown the duplicate index error.
then I rename the entrezgene_id with Entrez gene ID

but when I run the CellO code:
cello_resource_loc = "/opt/test_cello"
model_prefix = "cello_data3" # <-- The trained model will be stored in a file called GSM3516666_LX682_NORMAL.model.dill

cello.scanpy_cello(
cello_data3,
'leiden',
cello_resource_loc,
out_prefix=model_prefix
)
It still shows the error:
ValueError: Unable to determine gene collection. Please make sure the input dataset specifies either HUGO gene symbols or Entrez gene ID's.

I don't know what to do now. Any suggestions?

@tingxie2020
Copy link
Author

by the way, right now the cello_data3 head looks like:
cello_data3.var.head()

n_cells	highly_variable	means	dispersions	dispersions_norm	highly_variable_nbatches	highly_variable_intersection	hgnc_symbol	Entrez gene ID

Xkr4 4 False 0.000200 -0.470874 -1.075329 0 False NaN NaN
Gm19938 12 False 0.000936 0.114140 -0.173123 0 False NaN NaN
Rp1 31 False 0.004254 0.368517 0.485255 1 False NaN NaN
Sox17 8 False 0.000919 0.401739 0.376563 0 False NaN NaN
Mrpl15 2373 False 0.209986 0.241276 0.035849 0 False NaN NaN

somehow some genes don't have entrezgene ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant