Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple output links with the same link label #5278

Closed
giovannipizzi opened this issue Dec 16, 2021 · 11 comments
Closed

Multiple output links with the same link label #5278

giovannipizzi opened this issue Dec 16, 2021 · 11 comments
Labels
priority/critical-blocking must be resolved before next release type/bug

Comments

@giovannipizzi
Copy link
Member

While testing the current develop branch (commit ff1318b) I encountered this very unexpected thing: at least one of my CalcJobs got multiple nodes attached with the same link name. See below:

In [4]: n
Out[4]: <CalcJobNode: uuid: c198fa99-b7cd-4572-a9bc-65f1f88215a3 (pk: 51023) (aiida.calculations:quantumespresso.pw)>

In [5]: n.get_outgoing().all()
Out[5]: 
[LinkTriple(node=<RemoteData: uuid: e7a9dc04-fadb-4e68-bb09-2bcac0ed5c29 (pk: 51195)>, link_type=<LinkType.CREATE: 'create'>, link_label='remote_folder'),
 LinkTriple(node=<FolderData: uuid: 4997eae4-144f-43a5-a74d-f3166ab74ac3 (pk: 51576)>, link_type=<LinkType.CREATE: 'create'>, link_label='retrieved'),
 LinkTriple(node=<BandsData: uuid: 3b18bbb2-f8bb-410f-bd70-617cd40d05f2 (pk: 51607)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_band'),
 LinkTriple(node=<TrajectoryData: uuid: 1f9af593-ffb9-45d0-97cc-b93462013242 (pk: 51609)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_trajectory'),
 LinkTriple(node=<Dict: uuid: a4c63884-caff-4e09-ae05-0312c31fa30c (pk: 51611)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_parameters'),
 LinkTriple(node=<BandsData: uuid: ddc69ce5-c8cf-4370-bf06-82f92d1b7aff (pk: 51606)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_band'),
 LinkTriple(node=<TrajectoryData: uuid: 648d994c-99b7-4d6c-9966-edf2da8f0aea (pk: 51608)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_trajectory'),
 LinkTriple(node=<Dict: uuid: 24029a98-dbd3-4c3b-9329-37d9a2fea061 (pk: 51610)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_parameters')]

This, beside breaking a number of assumptions in AiiDA, also makes certain commands raise, like e.g. n.outputs.output_trajectory raises KeyError: "duplicate label 'output_band' in namespace ''".

Any idea of what has changed that can cause this behaviour? I'm adding this to the v2.0 milestone as I think this is a critical bug.

I'm not sure how I can provide more debug information on what happened (I can provide the AiiDA daemon logs, though)

@giovannipizzi giovannipizzi added type/bug priority/critical-blocking must be resolved before next release labels Dec 16, 2021
@giovannipizzi giovannipizzi added this to the v2.0.0 milestone Dec 16, 2021
@giovannipizzi
Copy link
Member Author

Here is the log. Not sure if it is relevant, but I was using the new aiida-hyperqueue plugin.
aiida-default.log

Pinging @sphuber @chrisjsewell @ramirezfranciscof in case they have an idea of what might be going on.
(Note: I have a RMQ version that is quite recent)

@chrisjsewell
Copy link
Member

aiida.common.exceptions.FeatureNotAvailable: Cannot get detailed job info

At a quick guess, it is not handling this exception "correctly". I note the only place this exception is raised in core is in _get_joblist_command, not for detailed job info.
Perhaps you could try adding a dummy command for retrieving detailed job info in aiida-hyperqueue, and see if you still encounter the issue

@chrisjsewell
Copy link
Member

The order of creation (at least by pk) is a bit weird though; each output node is duplicated before the next one is added

 LinkTriple(node=<BandsData: uuid: ddc69ce5-c8cf-4370-bf06-82f92d1b7aff (pk: 51606)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_band'),
LinkTriple(node=<BandsData: uuid: 3b18bbb2-f8bb-410f-bd70-617cd40d05f2 (pk: 51607)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_band'),
 LinkTriple(node=<TrajectoryData: uuid: 648d994c-99b7-4d6c-9966-edf2da8f0aea (pk: 51608)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_trajectory'),
 LinkTriple(node=<TrajectoryData: uuid: 1f9af593-ffb9-45d0-97cc-b93462013242 (pk: 51609)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_trajectory'),
 LinkTriple(node=<Dict: uuid: 24029a98-dbd3-4c3b-9329-37d9a2fea061 (pk: 51610)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_parameters')]
 LinkTriple(node=<Dict: uuid: a4c63884-caff-4e09-ae05-0312c31fa30c (pk: 51611)>, link_type=<LinkType.CREATE: 'create'>, link_label='output_parameters'),

@chrisjsewell
Copy link
Member

Another thing of note:

# A calculation can have both a 'return' and a 'create' link to
# a single data output node, which would violate the unique constraint
# defined below, since the difference in link type is not considered.
# The distinction between the type of a 'create' and a 'return' link is not
# implemented at the moment, so the unique constraint is disabled.
__table_args__ = (
# I cannot add twice the same link
# I want unique labels among all inputs of a node
# UniqueConstraint('output_id', 'label'),
)

This talks about not having a unique constraint on just output_id and label, but why not have a unique constraint on output_id, label and type @sphuber?

@sphuber
Copy link
Contributor

sphuber commented Dec 16, 2021

I think that should be possible, but not entirely sure to be honest. Not for example that we cannot put a unique constraint on input_id, label and type. Because a WorkflowNode can have multiple CALL links with the same label.

@sphuber
Copy link
Contributor

sphuber commented Dec 16, 2021

At a quick guess, it is not handling this exception "correctly". I note the only place this exception is raised in core is in _get_joblist_command, not for detailed job info.

How do you think this call would lead to the same outputs getting attached multiple times? The FeatureNotAvailable is being thrown when the calcjob is in the retrieving transport task, the outputs nodes won't be attached until that is done and the parser is called and returns the output nodes.

Given that the creation of the nodes seems interleaved (as you already noticed), it seems more likely that two workers somehow managed to start working on the same node and both called the parser almost exactly at the same time. This means they both got to add the output nodes and then they both tried to seal the node. I see a lot of ModificationNotAllowed errors and together with the ChannelClosed errors due to @giovannipizzi 's recent version of RabbitMQ that is rescheduling jobs after a timeout, it is quite likely that multiple workers get the same job.... Pretty bad news if that is the case.

That being said, that potentially explains why the code that adds the output nodes is called twice, but it should still have triggered the link validation. But maybe here as well, since they were so close to one another in execution, when the one was storing the node, the second was adding the output and the link validation would not have complained yet because the transaction of the first would not have finished. It would require really precise timing but it certainly is feasible I think

@giovannipizzi
Copy link
Member Author

Thanks both for the quick answers. I also think/fear that the issue is with two workers working on the same thing and retrieving/parsing the results at the same time.
I would say that this is most probably the cause: do you have an idea on how to check this? E.g. is there some way to check the RMQ logs?

As I mentioned in #5105 - we probably need to work on dropping RMQ soon and redesigning all that part, but this will definitely take time and maybe we should anyway release 2.0 first. Maybe it's better to add in 2.0 a strong check on the RMQ version at every load_profile command, and if the version is not compatible, we stop and tell the user to use a version that is compatible? This should avoid weird behaviour and not delay too much the 2.0 timeline

@sphuber
Copy link
Contributor

sphuber commented Dec 17, 2021

I would definitely not try and hold up v2.0 for this. It is not something we will be introducing with v2.0 anyway, it already existed ever since v1.0. But I also agree that this is a critical problem that we need to tackle a.s.a.p.

@sphuber sphuber modified the milestones: v2.0.0, Post 2.0 Jan 17, 2022
@sphuber sphuber removed this from the v2.3.0 milestone Dec 6, 2022
@khsrali
Copy link
Contributor

khsrali commented Oct 15, 2024

@edan-bainglass this seems to be a storage backend or data type issue
Can you please have a look at this.. chances this is already solved and could be closed?

P.S. picked your name from here

@edan-bainglass
Copy link
Member

edan-bainglass commented Oct 15, 2024

Seems more a message broker issue, no? I don't see immediately anyone assigned to broker issues. Maybe @sphuber can comment. Otherwise, maybe @mbercx can have a look.

@sphuber
Copy link
Contributor

sphuber commented Oct 16, 2024

Yes, I think this was mostly likely due to multiple workers working on the same node. If RMQ was too recent as Gio mentioned (but didn't fix the server configuration) this is quite likely and nothing we can do. Anyway, I am closing this for now

@sphuber sphuber closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release type/bug
Projects
None yet
Development

No branches or pull requests

5 participants