`Biomart().query` returns request status together with results #182

TheoDps · 2022-12-19T10:56:32Z

First, thank you very much for this very useful package!

The dataframe returned by biomart query contains the request's success status on its last line:

>> import gseapy
>> from gseapy import Biomart
>> import sys
>> print(sys.version)
3.8.8 (default, Apr 13 2021, 19:58:26) 
[GCC 7.3.0]
>> print(gseapy.__version__)
1.0.2
>> bm = Biomart()
>> queries = {
>>     "ensembl_gene_id": ["ENSG00000125285", "ENSG00000182968"]
>> }
>> results = bm.query(
>>     dataset="hsapiens_gene_ensembl",
>>     attributes=["ensembl_gene_id", "external_gene_name", "entrezgene_id"],
>>     filters=queries,
>> )
>> print(results)
   ensembl_gene_id external_gene_name  entrezgene_id
0  ENSG00000125285              SOX21          11166
1  ENSG00000182968               SOX1           6656
2        [success]                NaN           <NA>

Side effect

In addition to the mild annoyance caused by this extra line, this behaviour means that the query fails if the first column is converted to a numerical value like the entrez id. For instance in the case of the example from Biomart()'s docstring:

>>> from gseapy import Biomart
>>> bm = Biomart(verbose=False, host="ensembl.org")
>>> ## view validated marts
>>> marts = bm.get_marts()
>>> ## view validated dataset
>>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
>>> ## view validated attributes
>>> attrs = bm.get_attributes(dataset='hsapiens_gene_ensembl')
>>> ## view validated filters
>>> filters = bm.get_filters(dataset='hsapiens_gene_ensembl')
>>> ## query results
>>> queries = ['ENSG00000125285','ENSG00000182968'] # a python list
>>> results = bm.query(dataset='hsapiens_gene_ensembl',
                    attributes=['entrezgene_id', 'go_id'],
                    filters={'ensembl_gene_id': queries}
                    )
WARNING:root:host ensembl.org is not reachable, will try ensembl.org 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 13
     11 ## query results
     12 queries = ['ENSG00000125285','ENSG00000182968'] # a python list
---> 13 results = bm.query(dataset='hsapiens_gene_ensembl',
     14                     attributes=['entrezgene_id', 'go_id'],
     15                     filters={'ensembl_gene_id': queries}
     16                     )

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/gseapy/biomart.py:246, in Biomart.query(self, dataset, attributes, filters, filename)
    244     return df
    245 if "entrezgene_id" in df.columns:
--> 246     df["entrezgene_id"] = df["entrezgene_id"].astype(pd.Int32Dtype())
    248 self.results = df
    249 # save file to cache path.

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/generic.py:6240, in NDFrame.astype(self, dtype, copy, errors)
   6233     results = [
   6234         self.iloc[:, i].astype(dtype, copy=copy)
   6235         for i in range(len(self.columns))
   6236     ]
   6238 else:
   6239     # else, only a single dtype is given
-> 6240     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6241     return self._constructor(new_data).__finalize__(self, method="astype")
   6243 # GH 33113: handle empty frame or series

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/managers.py:450, in BaseBlockManager.astype(self, dtype, copy, errors)
    449 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 450     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/managers.py:352, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353 except (TypeError, NotImplementedError):
    354     if not ignore_failures:

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/internals/blocks.py:526, in Block.astype(self, dtype, copy, errors)
    508 """
    509 Coerce to the new dtype.
    510 
   (...)
    522 Block
    523 """
    524 values = self.values
--> 526 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    528 new_values = maybe_coerce_values(new_values)
    529 newb = self.make_block(new_values)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:299, in astype_array_safe(values, dtype, copy, errors)
    296     return values.copy()
    298 try:
--> 299     new_values = astype_array(values, dtype, copy=copy)
    300 except (ValueError, TypeError):
    301     # e.g. astype_nansafe can fail on object-dtype of strings
    302     #  trying to convert to float
    303     if errors == "ignore":

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:230, in astype_array(values, dtype, copy)
    227     values = values.astype(dtype, copy=copy)
    229 else:
--> 230     values = astype_nansafe(values, dtype, copy=copy)
    232 # in pandas we don't store numpy str dtypes, so convert to object
    233 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/dtypes/astype.py:95, in astype_nansafe(arr, dtype, copy, skipna)
     93 # dispatch on extension dtype if needed
     94 if isinstance(dtype, ExtensionDtype):
---> 95     return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
     97 elif not isinstance(dtype, np.dtype):  # pragma: no cover
     98     raise ValueError("dtype must be np.dtype or ExtensionDtype")

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/masked.py:132, in BaseMaskedArray._from_sequence(cls, scalars, dtype, copy)
    128 @classmethod
    129 def _from_sequence(
    130     cls: type[BaseMaskedArrayT], scalars, *, dtype=None, copy: bool = False
    131 ) -> BaseMaskedArrayT:
--> 132     values, mask = cls._coerce_to_array(scalars, dtype=dtype, copy=copy)
    133     return cls(values, mask)

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/numeric.py:258, in NumericArray._coerce_to_array(cls, value, dtype, copy)
    256 default_dtype = dtype_cls._default_np_dtype
    257 mask = None
--> 258 values, mask, _, _ = _coerce_to_data_and_mask(
    259     value, mask, dtype, copy, dtype_cls, default_dtype
    260 )
    261 return values, mask

File ~/.local/share/jupyter/3.4.2/lib/lib/python3.8/site-packages/pandas/core/arrays/numeric.py:212, in _coerce_to_data_and_mask(values, mask, dtype, copy, dtype_cls, default_dtype)
    208     values[mask] = cls._internal_fill_value
    209 if inferred_type in ("string", "unicode"):
    210     # casts from str are always safe since they raise
    211     # a ValueError if the str cannot be parsed into a float
--> 212     values = values.astype(dtype, copy=copy)
    213 else:
    214     values = dtype_cls._safe_cast(values, dtype, copy=False)

ValueError: invalid literal for int() with base 10: '[success]'

edit: rephrasing for clarity

The text was updated successfully, but these errors were encountered:

TheoDps · 2022-12-19T11:25:23Z

The extra ["success"] line seems to be the expected behaviour from biomart's rest api when using completionStamp = "1": https://www.ensembl.org/info/data/biomart/biomart_restful.html#completionstamp.

May be worth testing for response.text.endswith(["success"]) and removing it instead of str(response.text).startswith("Query ERROR") in query_simple?

zqfang · 2022-12-19T21:12:33Z

Thanks for the bug report. Sorry I've missed this bug. Fix now.

I'll release a new version soon

zqfang · 2022-12-20T17:57:58Z

fixed in new release v1.0.3. close now

zqfang pushed a commit that referenced this issue Dec 19, 2022

#182, don't return completionstamp

be3969a

zqfang pushed a commit that referenced this issue Dec 19, 2022

#182. better output

3ae736f

zqfang closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Biomart().query` returns request status together with results #182

`Biomart().query` returns request status together with results #182

TheoDps commented Dec 19, 2022 •

edited

Loading

TheoDps commented Dec 19, 2022

zqfang commented Dec 19, 2022

zqfang commented Dec 20, 2022

Biomart().query returns request status together with results #182

Biomart().query returns request status together with results #182

Comments

TheoDps commented Dec 19, 2022 • edited Loading

Side effect

TheoDps commented Dec 19, 2022

zqfang commented Dec 19, 2022

zqfang commented Dec 20, 2022

`Biomart().query` returns request status together with results #182

`Biomart().query` returns request status together with results #182

TheoDps commented Dec 19, 2022 •

edited

Loading