-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struggling to set RoundRobin
parameter with SST
#3721
Comments
Thanks for the report. Can you please run with the environment variable "SstVerbose" set to a numeric value of 2 or more? That should let us know if SST is seeing the parameter, output will be something like this: eisen@Endor build % export SstVerbose=2 |
Thanks for the tip, here is the output:
Which seems to indicate that infact, RoundRobin is set on the backend. |
Interesting... I just checked to see that our CI test that covers RoundRobin distribution is still working, and it seems to be. You might kick that SstVerbose parameter up to '4', which should get you more detailed information about timestep distribution. Probably only necessary to do that on the writer side. Here's what a portion of the output looks like for our CI test, you can see the Round Robin distribution info and where each step was sent: Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 5 (ref count 1), one to each reader |
Ok, I increased the writer verbosity as you suggested. It produced the following output. I notice that I only have
|
Follow up question: When using RoundRobin with adios, do all connected readers need to I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps. I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from #3675 (reply in thread)) Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2: |
Ah, we may have a conceptual disconnect. It looks like you just have a single MPI reader application connected to the writer. That reader has multiple ranks, but since ADIOS is designed for communication between MPI applications, it assumes that all the writer/reader ranks in an application act cooperatively. None of SST's distribution modes come into play because there is only one reader application and it gets all the timesteps. Each of the reader's ranks might select different parts of the incoming arrays, but they will all come from the same set of data that the writer ranks created for that timestep. The RoundRobin distribution mode was designed to scatter created timesteps to multiple reader applications. There is a test in ADIOS that does this and you can try it by first running the writer like this: Note that I didn't run with MPI above, so we only have a single rank for the writer and each of the two readers. They could each be MPI applications. |
Sorry, I hadn't had time to go through your demo, and may not yet today. But generally if you pass an MPI communicator in to ADIOS initialization, then a bunch of things in ADIOS are collective operations. Every rank has do to Open(), BeginStep, EndStep, etc. However, you might get to where you want to be by NOT passing the mpi communicator in to ADIOS. Then each rank will operate completely independently as if it were it's own separate 1-rank application. That may be good or bad depending upon exactly what you're trying to do. (I.E. if you want everything to run sort of in lock-step, this isn't the way.) |
Ok, thanks for conveying the internal philosophy of the Adios2 round robin distribution method. Unfortunately - our application depends on all readers sitting on the same MPI application. I will try your suggestion of not passing the mpi communicator to adios, thanks for the tip! |
Just an update, I have managed to get Now it is "working" in our toy example. But I am wondering, what are the ramifications on the adios backend? You say "if you want everything to run in lock step then it isnt the way." Maybe I misunderstand, but we are still using our own MPI communicator on our side - so we have full control over the lock-step nature of our reading (in case we want/don't want that). So are you referring to something intrinisic to the adios back end? For example, without the communicator, is there some undefined behavior possible in the step distribution on adios' side? In all cases, thanks a lot for your assistance. I think your previous tips enable us to move away from out the toy example and try integration into our software. |
WRT what I meant by that comment, I'd go back to ADIOS' origins. It was designed to pass information between timestep-oriented simulation and analysis jobs where the prominent data structures were global arrays decomposed across the writer ranks with different portions of them consumed by each reader rank. In that context, ADIOS makes sure that the reader ranks are all working on the same timestep at the same time, etc. You're just have a bit more of a novel use case, so ADIOS isn't in that role. I don't think there should be any undefined behavior (at least WRT MPI). Hopefully the more defined behavior is also appropriate for your situation. Reader-side ADIOS BeginStep() without timeout will block until it gets data, which may hold up one of your ranks until its turn to get data sent to it (which might in turn hold up your whole application because your own collective MPI operations might wait for that rank to run again). There is a timeout parameter to BeginStep that you can use to help manage that, but with RoundRobin data sent to a particular reader is his to consume and won't be available any other reader. So one reader that didn't do BeginStep for a while could have a queue while another might have run through all his data. Maybe that's not a problem because it just doesn't matter or your outside-of-adios synchronization keeps that sort of thing in check. If it was a problem, you might also consider the |
Code extracted from: https://github.com/pybind/pybind11.git at commit 58c382a8e3d7081364d2f5c62e7f429f0412743b (stable). Upstream Shortlog ----------------- Aaron Gokaslan (7): 6f01c60a Improve Python 3.11 support (ornladios#3694) 45219c6b fix: potential memory leak in pypy (ornladios#3774) dd617dec fix: missing move in eval.h (ornladios#3775) 251516bc Cleanup casters to release none() to avoid ref counting (ornladios#4269) 305c4711 fix: Revert pfect args make iterator (ornladios#4234) 5b395c9b fix: improve bytes to str decoding error handling (ornladios#4294) a491af61 bugfix: delete proper ctors in gil.h (#4490) Chekov2k (1): 15fde1de Add `PYBIND11_SIMPLE_GIL_MANAGEMENT` option (cmake, C++ define) (ornladios#4216) Chris Ohk (1): 5327c199 docs: Correct minor typos (ornladios#3721) DWesl (1): a4f6627d docs: clarify requirements for including pybind11 (#5326) Eli Schwartz (1): 882cb769 add --version option to pybind11-config (#4526) Eric Cousineau (1): 93d68dd9 cast: Qualify symbol usage in PYBIND11_TYPE_CASTER (ornladios#3758) Ethan Steinberg (2): 2de6e398 [v2.10] Revert the addition of the GIL check feature (ornladios#4432) e414c4bd fix: improve the error reporting for inc_ref GIL failures (ornladios#4427) Henry Schreiner (37): 8c859e48 fix: minor CMake warning fix for unused variable (ornladios#3718) 80589625 ci: fix PyPy (ornladios#3768) 5c2b53b5 chore: bump changelog for 2.9.2 (ornladios#3834) 914c06fb chore: set to version 2.9.2 412918d1 feat: add entrypoint for cmake modules dir (ornladios#4258) a8f21107 docs: update changelog (ornladios#4265) eaa5f7bd Revert "feat: add entrypoint for cmake modules dir" (ornladios#4270) 0e82c360 fix: add flag for overriding classic Python search values (ornladios#4195) 738a6f83 ci: move to final release of 3.11 (ornladios#4286) f2ee641e docs: prepare for 2.10.1 release (ornladios#4279) 80dc998e chore: bump versions for 2.10.1 0bd8896a chore: prepare for 2.10.3 (ornladios#4437) 993eb2b6 chore: update to black 23 (#4482) 1cae8dc0 fix: tests dir has started to show up in packaging (#4510) f5cff4f2 fix: nicer stack level for warning (#4516) 3f5a7e55 docs: changelog for 2.10.4 (#4532) 5b0a6fc2 chore: bump version to 3.10.4 63020d33 docs: prepare for 2.13.1 (#5203) 941f45bc chore: prepare for 2.13.1 f50830ea tests: run on pyodide (#4745) d8fcfe34 fix(cmake): add required emscripten flags (#5298) 6d5704cd docs: prepare for 2.13.2 (#5299) 07f30430 chore: prepare for 2.13.2 835139f5 fix: emscripten cmake issue (#5301) 45eaee91 fix: quote paths from pybind11-config (#5302) 7662af69 docs: prepare for 2.13.3 bd676436 chore: prepare for 2.13.3 75c11769 Revert "fix: quote paths from pybind11-config (#5302)" (#5309) 63b0d146 docs: prepare for 2.13.4 (#5312) c6239a8a chore: version 2.13.4 b0050f30 fix: never use `..` in a header include (#5321) 0d21cadc fix: allow -Wpedantic in C++20 mode (#5322) b3f5f2e7 docs: prepare for 2.13.5 (#5327) 7c33cdc2 chore: prepare for 2.13.5 7b67d8e9 docs: update changelog for 2.13.6 (#5372) e445ca2b ci: PyPI attestations (#5374) a2e59f0e chore: bump to 2.13.6 Lalaland (1): ce63bcb9 Fix casts to void* (ornladios#4275) Markus Bauer (1): 973a16e9 fix: escape paths with spaces in pybind11-config (#4874) Michael Carlstrom (3): dd0e4a0b feat(types): add support for Typing.Callable Special Case (#5202) 65afa13e fix: add guard for GCC <10.3 on C++20 (#5205) a4dd41a1 feat(types) Adds special Case for empty C++ tuple type annotation (#5214) Mike Essenmacher (1): 1f4cf8fe Replace "whitelist" with "allowlist" (#4506) ObeliskGate (2): ff3ca786 fix: `<ranges>` support for `py::tuple` and `py::list` (#5314) b9f85757 fix: using `__cpp_nontype_template_args` instead of `__cpp_nontype_template_parameter_class` (#5330) Ralf Gommers (1): b4307453 docs: extend `PYBIND11_MODULE` documentation, mention `mod_gil_not_used` (#5250) Ralf W. Grosse-Kunstleve (16): 895fc663 ci: update PGI build (old one no longer signed) (ornladios#4260) 3fb36a99 fix: unicode surrogate character in Python exception message. (ornladios#4297) 0abe64c5 Fix `detail::obj_class_name()` to work correctly for meta classes. (ornladios#4436) 050de893 ci: remove clang 10 C++20 (it broke recently) (ornladios#4438) f14bb03d Add clang15 C++20 job (ornladios#4443) 4f6183cf Ensure `import pybind11_tests` traceback is shown. (#4455) 5ece09ad Resolve new flake8 error (#4462) c0e2eeba Bump isort version to 5.12.0 (#4480) c773a02a Appease new flake8 B028 error: (#4513) 9a1eeed0 Make warning suppressions MINGW-specific again. (#4515) 2965fa8d Preparation for v2.11.1 patch release (#4752) 8a099e44 Fix version number mishap: actually update 0 to 1 (#4756) 129934ad Small cleanup/refactoring in support of PR #5213 (#5251) 042c3cfd clang-tidy upgrade (to version 18) (#5272) 570d323b Add `while True` & `top` method to FAQ. (#5340) a5fcc560 Enable type-safe interoperability between different independent Python/C++ bindings systems. (#5296) Sam Gross (3): 3b47b464 fix: use manual padding of instance_map_shard (#5200) 8443d084 Use PyMutex instead of std::mutex in free-threaded build. (#5219) f3a6d414 fix: make gil_safe_call_once thread-safe in free-threaded CPython (#5246) StarQTius (1): 42455b5e fix: clear local internals after finalizing interpreter ornladios#2101 (ornladios#3744) Stefano Rivera (1): bdec5737 Use sysconfig in Python >= 3.10 (ornladios#3764) Theodore Tsirpanis (1): 667563dd docs: remove outdated known limitation. (#5263) Varun Agrawal (1): 3074608e fix(cmake): remove extra = in flto assignment (#5207) Vasily Litvinov (1): 9e6a67d5 Properly translate C++ exception to Python exception when creating Python buffer from wrapped object (#5324) Vemund Handeland (1): 1f187d9a Fix char8_t support (ornladios#4278) Xiaofei Wang (1): 78e26321 Add `type_caster_std_function_specializations` feature. (#4597) albanD (1): da780a00 Make sure to properly untrack gc objects before freeing them (#4461) cyy (1): 9d6a79c0 fix: issuses detected by static analyzer (ornladios#4440) dependabot[bot] (9): c06f324c chore(deps): bump ilammy/msvc-dev-cmd from 1.12.0 to 1.12.1 (#4493) ec3f6e24 chore(deps): bump pypa/gh-action-pypi-publish from 1.6.4 to 1.8.1 (#4576) ea10a69d chore(deps): bump actions/attest-build-provenance in the actions group (#5216) 4b2f7cd6 chore(deps): bump certifi from 2024.2.2 to 2024.7.4 in /docs (#5226) d699e99c chore(deps): bump actions/attest-build-provenance in the actions group (#5243) fe808a01 chore(deps): bump the actions group with 2 updates (#5287) 6ee574fa chore(deps): bump actions/attest-build-provenance in the actions group (#5297) 0a96ff7e chore(deps): bump actions/attest-build-provenance in the actions group (#5335) 54ab4249 chore(deps): bump the actions group with 2 updates (#5361) fred-sch (1): f9ae715d fix: typo in documentation (#5284) kajananchinniah (1): 0ed64a04 docs: fixed typo in spelling of first (ornladios#4428) pre-commit-ci[bot] (5): d78de295 chore(deps): update pre-commit hooks (ornladios#4439) 3ea37d04 chore(deps): update pre-commit hooks (#4495) 41726b64 chore(deps): update pre-commit hooks (#5220) 44d0d9a4 chore(deps): update pre-commit hooks (#5288) 36ee4674 chore(deps): update pre-commit hooks (#5350) pwdcd (1): 6685547e chore: remove repetitive words (#5308) wenqing (1): 639ca6a7 Fixed a compilation error with gcc 14 (#5208) xkszltl (1): b596235f Inconsistent comments between 2 templates of `unchecked()`. (#4519)
Hello,
I am trying to set the parameter
RoundRobin
in mySST
writer, but it appears that the defaultAllToAll
is always used no matter how I try to set the parameter.Extra context: We started investigating the use of this library in another Adios discussion here.
To Reproduce
We have set up our minimal environment for you. In summary, we have N number of clients, each one is a writer. We have M number of server processes, each one is a reader. We are using SST for the engine, and we successfully run AllToAll communications.
However, when I try to set the
StepDistributionMode
toRoundRobin
, nothing changes. All M servers receive all steps.We tried to set the parameter using a variety of methods:
But neither of these methods change the behavior of the writer.
Here you can clone the minimal working repository at https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo
And to test it you can run:
or
The M server timesteps collected are saved to
time_step_<rank>.json
in the top directory. As you will see, the same output is produced for both, meaning all M server processes got all steps from all simulations.Expected behavior
RoundRobin should follow the documented description from the Adios documentation:
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: