-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include DLIO benchmark #124
base: master
Are you sure you want to change the base?
Conversation
# Conflicts: # dlio/utils.c
Since this is using code from the DLIO benchmark, make sure to follow the original DLIO's license. |
With the provided sample, in Perlmutter I am getting some warnings:
Are those expected? |
Yes, warnings are shown because the test involves working with a small amount of data. This is done because the same json configuration file is used in github workflow during testing. If a large number of generation and training files are specified in the configuration file, the warning will disappear, but jobs in github workflow may start taking a long time, which I believe is undesirable. |
@arcturus5340, could you also please include an explanation in the DLIO pattern page of why this is needed (i.e., why we added a pattern that mimics DLIO instead of running DLIO directly)? If I recall, we have it in the paper, so maybe let's also add it here to make it clear that this is not the full DLIO but rather only its I/O pattern. Could we also add some sentences in the documentation about how this pattern should be updated if changes in DLIO are made, kind of an overview summary of how one might update this pattern if changes are done on the DLIO side? |
@jeanbez, I have made changes to the documentation, are they sufficient? Is there anything else I need to do? |
We're doing some testing in the other systems as well, as soon as those are okay we should be able to merge. Thanks for updating the documentation with that additional information. |
Hi, I'm getting segmentation faults when testing on two nodes with the default configuration. I have experienced no errors when running on a single node. The gdb backtrace says that this line is the culprit: Line 424 in 33281ad
Here is the full trace:
It does not fail every single time. In addition, there are two benchmarks in the default |
It is only used when the total-training-steps parameter is set (which is not set in
This also seems unbelievable, since you mentioned that the bug occurs in both benchmarks. The first one is only responsible for generating data and does not use |
@arcturus5340 Disregard my statement about it failing in both benchmarks. I reran 50 times and it only failed in the training benchmark. I think I misread the log output... apologies. Here is my MPI details. MPICH version is 8.1.28. The NVIDIA HPC SDK version used is 23.3:
It failed in 7 of the 50 runs. In the stderr:
In the stdout
|
@TheAssembler1 can you send us the output of the |
In this case, we cannot rule out that the problem may be indirectly related to |
@arcturus5340 reran 50 times with read threads set to zero and had zero failures. |
Include I/O access patterns in common AI applications.
Documentation can be found in the directory with the extension (/dlio/README.md) or in the corresponding readthedocs section.