Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: 0: error: ...ross-inline.h:106: Maximum zero-offset tie chain reached (100), increase #define in ross-types.h #231

Open
lzk23 opened this issue Oct 15, 2022 · 3 comments

Comments

@lzk23
Copy link

lzk23 commented Oct 15, 2022

Hello,
I am testing the running multiple jobs with contiguous allocation as in the Exercise 3 in (https://github.com/codes-org/codes/wiki/quick-start-interconnects). However, this error, node: 0: error: /home/codes-dev/build-ross/include/ross-inline.h:106: Maximum zero-offset tie chain reached (100), increase #define in ross-types.h occurs. I try to increase the value MAX_TIE_CHAIN in ross-types.h. However, with this value increasing, the simulation eat much memory and run extremely slowly.
For this case, i have increased MAX_TIE_CHAIN from 100 to 20000, and the error disappears. However, the memory required is more than 300G, which lead to the program broken.
How to fix this problem. Thanks a lot.

@nmcglo
Copy link
Member

nmcglo commented Oct 15, 2022

So this is one downside to the ROSS unbiased tiebreaker. The unbiased tiebreaker feature of ROSS will fairly and consistently choose an ordering of events that are tied temporally with other events.

Things get complicated, however, when zero-offset events are also present. To clarify: zero offset events are events that are created with zero tw_stime delay from the event that created them. Since zero offset events naturally tie, temporally, with their causal event (and also any events that tie with it), consistently ordering those events in a fair way requires an array of tie breaking values (automatically generated by ROSS) with a cardinality that is equal to the number of zero offset "generations"

Ex: if you have an event A that creates another event with zero offset, A'. And A' creates another zero offset event A'', and so on and so forth to get A''''', you'd need a tie breaking value array of size 6 to fairly break ties in a way that doesn't violate causality. That size is the max tie chain length and because this is encoded into messages transmitted across PEs, it has to be statically allocated into each event. Thus the longer that chain needs to be, the heavier the impact on memory will be. Setting that value to 20,000 will mean that each event has an array of 20,000 64-bit floats encoded into it. That's a very heavy structure.

Solutions:

  1. Disable the tiebreaker in ROSS during cmake configuration. USE_RAND_TIEBREAKER is the flag name, I believe. You'll have to re-make ROSS and CODES after this. This will, however, result in your simulation possibly being non-deterministic if there are a significant number of tied events (particularly if they tie at the same time on the same LP).

  2. Determine where all of the zero offset events are coming from and add some positive offset to it, even a tiny amount will make things significantly easier on the tiebreaking feature.

If you want some more context on this tie breaking feature, here's a paper I wrote on it:

https://nmcglo.com/public-files/papers/2021_wsc_tiebreaker.pdf

@lzk23
Copy link
Author

lzk23 commented Oct 17, 2022

Thanks for your reply. Actually, i don't know the principle behind CODES and ROSS.
I have tried to add the flag USE_RAND_TIEBREAKER (-DUSE_RAND_TIEBREAKER=on, is this right?) during cmake configuration for ROSS. However, the problem still exists. As for the second solution, i really don't know how to determine where are the zero offset events.

@nmcglo
Copy link
Member

nmcglo commented Dec 2, 2022

Apologies for delay in response, I've been starting a new job and traveling a lot of November.

The quick solution is actually to set -DUSE_RAND_TIEBREAKER=off when configuring ROSS (then rebuild ROSS and CODES), this will disable the deterministic tiebreaker feature of ROSS which reverts the functionality of ROSS in handling event processing order to the state that it was a year or so ago. For the most part, it is "good enough". The tiebreaker's purpose is to guarantee the deterministic ordering of event processing when there exists simultaneous events in the simulation. Without the tiebreaker there is a mild probability of non-deterministic output and the tiebreaking of simultaneous events is not 'unbiased' which implies that there will be some ruleset that will break ties in a way that doesn't assign an equal probability to any ordering of these simultaneous events.

It should not make significant difference semantically unless you're trying to make very formal and strict statistical analysis on the output of many runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants