Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 file not generated when exposure feature module is used on PDB files from Propedia database #463

Closed
DanLep97 opened this issue Aug 4, 2023 · 5 comments · Fixed by #465
Assignees
Labels
bug Something isn't working

Comments

@DanLep97
Copy link
Collaborator

DanLep97 commented Aug 4, 2023

Describe the bug
When building the HDF5 file of the graph database using exposure component from the propedia database (and protCID database as well), I get the following error:

"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/root/deeprankcore/deeprank-core/deeprankcore/query.py", line 245, in _process_one_query
    graph.write_to_hdf5(output_path)
  File "/root/deeprankcore/deeprank-core/deeprankcore/utils/graph.py", line 218, in write_to_hdf5
    node_features_group.create_dataset(
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/usr/local/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 86, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1664, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1688, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1748, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/netcache/data/dlepikhov/propedia_ssl/script/build_propedia.py", line 44, in <module>
    h5_p = queries.process(
  File "/root/deeprankcore/deeprank-core/deeprankcore/query.py", line 329, in process
    pool.map(pool_function, self.queries)
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

This happens only when trying to calculate the exposure feature component. At first, I thought it is because H atoms are a problem, but removing them from pdb files didn't help.

Environment:

  • OS system: ubuntu
  • Version:
  • Branch commit ID:
  • Inputs:

To Reproduce
Steps/commands/screenshots to reproduce the behaviour:

Run the following script:

import sys
import os
sys.path.append(os.path.abspath("."))
from deeprankcore.features import torsion_angle, components, contact, exposure
from deeprankcore.query import QueryCollection, ProteinProteinInterfaceResidueQuery
from deeprankcore.dataset import GraphDataset
import pickle
import argparse
import glob

arg_parser = argparse.ArgumentParser(description="""
    Script used to build the features using deeprankcore package.
""")
arg_parser.add_argument("--h5out",
    help="Path where the HDF5 features will be saved."
)
arg_parser.add_argument("--pdb",
    help="glob string to look for pdb files used to generate features."
)
arg_parser.add_argument("--nworkers",
    help="""
    Providing this argument will set a specific number of cpus used to process the query.
    By default, all cpus are used.
    """,
    default=None,
    type=int
)
a = arg_parser.parse_args()

pdb_paths = glob.glob(a.pdb)

queries = QueryCollection()

chain_ids = [p.split("/")[-1].replace(".pdb", "").split("_")[-2:] for p in pdb_paths]
print(f"Number of cases: {len(pdb_paths)}")

for i, p in enumerate(pdb_paths):
    queries.add(ProteinProteinInterfaceResidueQuery(
        pdb_path = p,
        chain_id1 = chain_ids[i][0],
        chain_id2 = chain_ids[i][1],
    ))

h5_p = queries.process(
    a.h5out,
    cpu_count = a.nworkers,
    feature_modules = [
        components,
        torsion_angle,
        contact,
        exposure
    ]
)

Expected Results
Normally I get a HDF5 concatenated file.

Actual Results or Error Info
If applicable, add screenshots to help explain your problem.

Additional Context
Add any other context about the problem here.

@DanLep97 DanLep97 added the bug Something isn't working label Aug 4, 2023
@gcroci2
Copy link
Collaborator

gcroci2 commented Aug 4, 2023

I see torsion_angle module, which is not on the main branch. On which branch are you working? Are you sure that the error is not related to the new module?

"using exposure component from the propedia database (and protCID database as well)" what do you mean here? Our exposure module uses HSExposureCA from biopython. Maybe you're referring to the source of your pdb files?

The error seems related to the type of one of the arrays, which is a Numpy object of type 'O' and h5py does not recognize it. It could mean that you have mixed types (e.g. np.int8 and np.float32) in one of your features.

@DanLep97
Copy link
Collaborator Author

DanLep97 commented Aug 4, 2023

Indeed I'm referencing the pdb source. I'm on branch #448.

I just tried without the torsion angles, on the up-to-date main branch, I'm getting the same error. It is happening with the pdb source from propedia, the whole 15K peptide-protein complexes.

Thanks for the help.

@gcroci2 gcroci2 changed the title Bug: HDF5 file not generated after exposure feature component is added to the feature list HDF5 file not generated when exposure feature module is used on PDB files from Propedia database Aug 4, 2023
@gcroci2
Copy link
Collaborator

gcroci2 commented Aug 4, 2023

For some reasons, the exposure features generated from such PDB files contain mixed values types and are consequently treated as Objects from numpy.

What happens if you enforce node.features[Nfeat.HSE] to be a numpy array? (lines 74-77 in deeprankcore.features.exposure)

if hse_key in hse:
    node.features[Nfeat.HSE] = np.array(hse[hse_key])
else:
    node.features[Nfeat.HSE] = np.array((0, 0, 0))

You could try to cast node.features[Nfeat.RESDEPTH] as a numpy float, to be sure that also that feature's type is fine.

@cbaakman
Copy link
Collaborator

cbaakman commented Aug 7, 2023

>>> class A:
...     pass
... 
>>> a = A()
>>> import numpy
>>> arr = numpy.array(a)
>>> arr.dtype
dtype('O')

h5py cannot store arrays of dtype 'O'. So preprocessing would fail.

The solution would be to find out what type of object is created and see if it can be converted.

@DanLep97
Copy link
Collaborator Author

DanLep97 commented Aug 7, 2023

Dtype of np.array((0,0,0)) is int64. But it looks like I'm getting a dtype('0') as you demonstrated @cbaakman because of some errors in French I get when calculating the exposure:
image

This error got fixed when forcing the hse feature to be a numpy float64 as suggested by @gcroci2

Adding #465

@gcroci2 gcroci2 linked a pull request Aug 8, 2023 that will close this issue
@gcroci2 gcroci2 moved this to Done in Development Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants