-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chain id parameter for structure.io.pdbx.get_sequence #600
Conversation
'get_sequence' in the module 'biotite.structure.io.pdb' to return a dictionary mapping chain_id to the sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few further small comments here. If we always return a dictionary, I think the function would be cleaner, if it iterates directly over strand IDs, sequence strings and sequence types instead of splitting it into two for-loops.
sequences : list of Sequence or dict | ||
If `chain_ids` is False, returns a list of protein and nucleotide | ||
sequences for each entity. | ||
If `chain_ids` is True, returns a dictionary where each key is a | ||
chain ID and each value is the corresponding sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should note here that the chain IDs correspond to atom_site.auth_asym_id
.
for entity, strand_ids in enumerate(strand_ids): | ||
for strand_id in strand_ids: | ||
strand_ids_to_seq_dict[strand_id] = sequences[entity-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lines enumerates, so the first tuple value is not the entity (ID) but the index (always starting at zero).
for entity, strand_ids in enumerate(strand_ids): | |
for strand_id in strand_ids: | |
strand_ids_to_seq_dict[strand_id] = sequences[entity-1] | |
for i, strand_ids in enumerate(strand_ids): | |
for strand_id in strand_ids: | |
strand_ids_to_seq_dict[strand_id] = sequences[i] |
Thanks for the PR. I would be in favor of dropping the Note that this would require a few small adjustments of that function call in the tests and documentation (searching for |
I agree with @padix-key. I think we should drop the |
with entity_poly.pdbx_strand_id as keys. Updated test_pdx.py test_get_sequence to reflect returning a dict instead of a list
I sent an additional commit with the suggested changes as I understand them. It is worth noting that there may be instances where entity_poly.pdbx_strand_id is not equivalent to atom_site.label_asym_id. The structure used in the test function (PDB:5UGO) is an example of this where the strand ID is "T" and the asym_id is "A". This may be an edge case and I am unsure if it will cause issues but its worth mentioning. |
Thanks for the changes. There are still remaining
The strand ID is not matching |
sequence_dict = { | ||
strand_id: sequence | ||
for sequence, strand_ids in zip(sequences, strand_ids) | ||
for strand_id in strand_ids | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case some converted sequence is None
in the list comprehension above, sequences
would have a different length than strand_ids
.
Please ignore those commits. I am still learning git and did not think those commits would be sent to this pull request. |
No problem. You may also convert this PR to a Draft PR to further indicate that this is work in progress |
The last commit is the one I would like to merge if possible. |
Looks good to me, however, one small adjustment is necessary: With the merge of #552 |
Added a new parameter 'chain_id' to the function 'get_sequence' in the module 'biotite.structure.io.pdbx' to return a dictionary mapping chain_id to the sequence