Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomart().query returns request status together with results #182

Closed
TheoDps opened this issue Dec 19, 2022 · 3 comments
Closed

Biomart().query returns request status together with results #182

TheoDps opened this issue Dec 19, 2022 · 3 comments

Comments

@TheoDps
Copy link

TheoDps commented Dec 19, 2022

First, thank you very much for this very useful package!

The dataframe returned by biomart query contains the request's success status on its last line:

>> import gseapy
>> from gseapy import Biomart
>> import sys
>> print(sys.version)
3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0]
>> print(gseapy.__version__)
1.0.2
>> bm = Biomart()
>> queries = {
>>     "ensembl_gene_id": ["ENSG00000125285", "ENSG00000182968"]
>> }
>> results = bm.query(
>>     dataset="hsapiens_gene_ensembl",
>>     attributes=["ensembl_gene_id", "external_gene_name", "entrezgene_id"],
>>     filters=queries,
>> )
>> print(results)
   ensembl_gene_id external_gene_name  entrezgene_id
0  ENSG00000125285              SOX21          11166
1  ENSG00000182968               SOX1           6656
2        [success]                NaN           <NA>

Side effect

In addition to the mild annoyance caused by this extra line, this behaviour means that the query fails if the first column is converted to a numerical value like the entrez id. For instance in the case of the example from Biomart()'s docstring:

>>> from gseapy import Biomart
>>> bm = Biomart(verbose=False, host="ensembl.org")
>>> ## view validated marts
>>> marts = bm.get_marts()
>>> ## view validated dataset
>>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
>>> ## view validated attributes
>>> attrs = bm.get_attributes(dataset='hsapiens_gene_ensembl')
>>> ## view validated filters
>>> filters = bm.get_filters(dataset='hsapiens_gene_ensembl')
>>> ## query results
>>> queries = ['ENSG00000125285','ENSG00000182968'] # a python list
>>> results = bm.query(dataset='hsapiens_gene_ensembl',
                    attributes=['entrezgene_id', 'go_id'],
                    filters={'ensembl_gene_id': queries}
                    )
WARNING:root:host ensembl.org is not reachable, will try ensembl.org 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 13
     11 ## query results
     12 queries = ['ENSG00000125285','ENSG00000182968'] # a python list
---> 13 results = bm.query(dataset='hsapiens_gene_ensembl',
     14                     attributes=['entrezgene_id', 'go_id'],
     15                     filters={'ensembl_gene_id': queries}
     16                     )

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/gseapy/biomart.py:246, in Biomart.query(self, dataset, attributes, filters, filename)
    244     return df
    245 if "entrezgene_id" in df.columns:
--> 246     df["entrezgene_id"] = df["entrezgene_id"].astype(pd.Int32Dtype())
    248 self.results = df
    249 # save file to cache path.

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/generic.py:6240, in NDFrame.astype(self, dtype, copy, errors)
   6233     results = [
   6234         self.iloc[:, i].astype(dtype, copy=copy)
   6235         for i in range(len(self.columns))
   6236     ]
   6238 else:
   6239     # else, only a single dtype is given
-> 6240     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6241     return self._constructor(new_data).__finalize__(self, method="astype")
   6243 # GH 33113: handle empty frame or series

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/managers.py:450, in BaseBlockManager.astype(self, dtype, copy, errors)
    449 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 450     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/managers.py:352, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353 except (TypeError, NotImplementedError):
    354     if not ignore_failures:

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/blocks.py:526, in Block.astype(self, dtype, copy, errors)
    508 """
    509 Coerce to the new dtype.
    510 
   (...)
    522 Block
    523 """
    524 values = self.values
--> 526 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    528 new_values = maybe_coerce_values(new_values)
    529 newb = self.make_block(new_values)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:299, in astype_array_safe(values, dtype, copy, errors)
    296     return values.copy()
    298 try:
--> 299     new_values = astype_array(values, dtype, copy=copy)
    300 except (ValueError, TypeError):
    301     # e.g. astype_nansafe can fail on object-dtype of strings
    302     #  trying to convert to float
    303     if errors == "ignore":

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:230, in astype_array(values, dtype, copy)
    227     values = values.astype(dtype, copy=copy)
    229 else:
--> 230     values = astype_nansafe(values, dtype, copy=copy)
    232 # in pandas we don't store numpy str dtypes, so convert to object
    233 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:95, in astype_nansafe(arr, dtype, copy, skipna)
     93 # dispatch on extension dtype if needed
     94 if isinstance(dtype, ExtensionDtype):
---> 95     return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
     97 elif not isinstance(dtype, np.dtype):  # pragma: no cover
     98     raise ValueError("dtype must be np.dtype or ExtensionDtype")

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/masked.py:132, in BaseMaskedArray._from_sequence(cls, scalars, dtype, copy)
    128 @classmethod
    129 def _from_sequence(
    130     cls: type[BaseMaskedArrayT], scalars, *, dtype=None, copy: bool = False
    131 ) -> BaseMaskedArrayT:
--> 132     values, mask = cls._coerce_to_array(scalars, dtype=dtype, copy=copy)
    133     return cls(values, mask)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/numeric.py:258, in NumericArray._coerce_to_array(cls, value, dtype, copy)
    256 default_dtype = dtype_cls._default_np_dtype
    257 mask = None
--> 258 values, mask, _, _ = _coerce_to_data_and_mask(
    259     value, mask, dtype, copy, dtype_cls, default_dtype
    260 )
    261 return values, mask

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/numeric.py:212, in _coerce_to_data_and_mask(values, mask, dtype, copy, dtype_cls, default_dtype)
    208     values[mask] = cls._internal_fill_value
    209 if inferred_type in ("string", "unicode"):
    210     # casts from str are always safe since they raise
    211     # a ValueError if the str cannot be parsed into a float
--> 212     values = values.astype(dtype, copy=copy)
    213 else:
    214     values = dtype_cls._safe_cast(values, dtype, copy=False)

ValueError: invalid literal for int() with base 10: '[success]'

edit: rephrasing for clarity

@TheoDps
Copy link
Author

TheoDps commented Dec 19, 2022

The extra ["success"] line seems to be the expected behaviour from biomart's rest api when using completionStamp = "1": https://www.ensembl.org/info/data/biomart/biomart_restful.html#completionstamp.

May be worth testing for response.text.endswith(["success"]) and removing it instead of str(response.text).startswith("Query ERROR") in query_simple?

zqfang pushed a commit that referenced this issue Dec 19, 2022
@zqfang
Copy link
Owner

zqfang commented Dec 19, 2022

Thanks for the bug report. Sorry I've missed this bug. Fix now.

I'll release a new version soon

zqfang pushed a commit that referenced this issue Dec 19, 2022
@zqfang
Copy link
Owner

zqfang commented Dec 20, 2022

fixed in new release v1.0.3. close now

@zqfang zqfang closed this as completed Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants