-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW]: Distributed Parallelization of xPU Stencil Computations in Julia #137
Comments
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @mloubout, @georgebisbas it looks like you're currently assigned to review this paper 🎉. Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post. ⭐ Important ⭐ If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/JuliaCon/proceedings-review) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿 To fix this do the following two things:
For a list of things I can do to help you, just type:
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
|
Failed to discover a |
Wordcount for |
|
|
👋 @mloubout, please update us on how your review is going (this is an automated reminder). |
👋 @georgebisbas, please update us on how your review is going (this is an automated reminder). |
ReviewThe abstract is written clearly and convey the message efficiently. The motivation and design principles are laid out clearly. The results show good performance and validate the claims of the author. Some minor comments:
very minor:
With these addressed, this should be accepted for publication |
It seems like the repository: https://github.com/omlins/ImplicitGlobalGrid.jl |
Yes this is expected, because it is a fork for the single purpose of the publication of this paper. The original repository is the following: https://github.com/eth-cscs/ImplicitGlobalGrid.jl |
Thanks @omlins. Where should we open an Issue for the paper according to the instructions? At the original repo? |
I do not think that for this configuration it is specified in the instructions. Thus, I would suggest to open issues concerning the manuscript on the fork (https://github.com/omlins/ImplicitGlobalGrid.jl) and potential issues concerning the package on the original repository (https://github.com/eth-cscs/ImplicitGlobalGrid.jl). |
Now I can see a tab of Issues: |
Yes, I added it upon your question. |
ImplcitGlobalGrid.jl offers an approach to automate distributed memory parallelism for stencil computations in Julia, leveraging a high-level symbolic interface and presenting highly competitive performance. This work has a high overlap with #138 As discussed in ParallelStencil.jI, some changes should be applied to publish those as separate works. I understand that this work extends the optimization suite available in ParallelStencl.jl by allowing the use of automated Distributed Memory Parallelism. Since both packages will be published independently, mentioning that DMP is built upon the ParallelStencil.jl would not harm. I suggest avoiding using the term xPU in the title. In the recent rise of “exotic” accelerators like Google’s TPU, Graphcore’s IPU, or expected QPUs, it probably would not harm just to use: "Distributed Parallelization of Stencil Computations in Julia" This is clarified at the end of the Introduction. However, I still advocate avoiding this term. It may also not help when someone will be looking for your work in the future. The package's functionality is confirmed, and the installation, testing, and use documentation are present and well-defined. The paper is well-written, clearly presenting the motivation and example usage of the package. It may be good to add explicitly that users get automated DMP by only slightly editing the source code of a solver. I.e., the text can value the symbolic interface a bit more. I am interested in how communication computation overlap is performed. The diffusion example presented is well-written. Minor comments (not aiming to be addressed, but possibly interesting points for future work and further discussion):
Performance evaluation:
Good to see that you are using a non-standard stencil kernel in Figure 3. Please add more info on this stencil kernel's computation and memory requirements. I also second the comments posted by @mloubout. Assuming the above concerns are addressed, this should be accepted for publication. |
@mloubout: Thank you very much for the thorough review. We are working on addressing the reviews. In the following you can find replies to two of the issues that you raised. We will respond to the remaining issues alongside with the improvements of the manuscript.
The reason for the scaling up to 2197 GPUs or nodes is that 17^3 (=2197) nodes is the biggest cubic node topology that can be submitted in the normal queue of Piz Daint (up to 2400 nodes, see: https://user.cscs.ch/access/running/piz_daint/#slurm-batch-queues). The number of 2197 GPUs appears to surprise when it is not explained, but there is nothing negative about it as the reported parallel efficiency confirms. Thus, we would like to add an explanation while keeping the plot as is.
Thank you for pointing this out; we will try to emphasize the performance difference. |
@georgebisbas: Thank you very much for the thorough review. We are working on addressing the reviews. In the following you can find replies to many of the issues that you raised. We will respond to the remaining issues alongside with the improvements of the manuscript.
Thank you for this statement; it makes clear that we need to emphazise in the document that ImplicitGlobalGrid.jl is not built upon ParallelStencil.jl. ImplicitGlobalGrid.jl does not in any way assume or require that codes using it also use ParallelStencil.jl (the same is also true vice versa). In the documentation, there is a Multi-GPU example without ParallelStencil.jl (https://github.com/eth-cscs/ImplicitGlobalGrid.jl?tab=readme-ov-file#50-lines-multi-gpu-example). In addition to pointing this out, we will also refer to the example in the documentation. The three main reasons why we chose an example using ParallelStencil.jl are 1) to illustrate interoperability with ParallelStencil.jl, which is a natural choice in Julia for the task at hand, 2) to achieve high-performance on a single-node, and 3) to illustrate the following:
The previous point showed that we were missing to point out that ImplicitGlobalGrid.jl is not built upon ParallelStencil.jl which unfortunately comes out again in this comment: the symbolic interface is not part of ImplicitGlobalGrid.jl, but of ParallelStencil.jl, and plays no role in the distributed parallelization (with exception of the
As noted in the previous two points, ImplicitGlobalGrid.jl does itself not perform a communication computation overlap, but it performs the communication in a way that enables other packages to do it easily an optimally: "all data transfers are performed on non-blocking high-priority streams or queues" (page 2, paragraph 1). Other packages as, e.g., ParallelStencil.jl can leverage this to perform communication computation overlap: "ParallelStencil.jl, e.g., can do so with a simple macro call (Fig. 1, line 36)" (page 2, paragraph 1). Thus, here is already described what ImplicitGlobalGrid.jl does concerning the communication computation overlap.
Thank you for the suggestion. We can understand your point. However, if we removed the 'xPU', then many people would assume that this work applies only to CPU when reading the title. The 'xPU' is important for us to emphasise the backend portability and we believe normal that the 'x' of 'xPU' refers only to a subset of what exists. Furthermore, ImplicitGlobalGrid.jl was designed to be easily extendable with new backends, which can be for other GPU kinds or other kinds of xPU (and we will certainly add new backends in the future for any hardware that is of interest for us). We will try to emphasize more on this.
The reason for the scaling up to 2197 GPUs or nodes is that 17^3 (=2197) nodes is the biggest cubic node topology that can be submitted in the normal queue of Piz Daint (up to 2400 nodes, see: https://user.cscs.ch/access/running/piz_daint/#slurm-batch-queues). The number of 2197 GPUs appears to surprise when it is not explained, but there is nothing negative about it as the reported parallel efficiency confirms. Thus, we would like to add an explanation while keeping the plot as is.
Given that CPU-only supercomputers are becoming rare and given that the challenge is higher when GPUs are involved, we considered it sufficient to focus on GPUs in this short paper. We see the interest of the CPU functionality mostly for code validation and possibly prototyping.
We do not have any use cases where we have a strictly fixed problem size. Thus, we considered presenting weak scaling more important.
The single-node performance is very high thanks to the implementation using ParallelStencil.jl. |
Are there any performance numbers, like Gpts/s or GCells/s or any roofline model, we could look at? |
@editorialbot generate pdf |
The error might be due to missing DOIs. @lucaferranti is my understanding correct? @omlins could you add the missing DOIs and try to generate the PDF again? Thank you! |
@fcdimitr : please excuse the delay for the reply. We aim to finalize this extended abstract hopefully this and next week. I have updated the bibliography file; I think now it should be able to generate the PDF. |
Thank you for pointing this out. In our approach, a field will only have halos in a given dimension if the corresponding overlap between the local fields is at least two cells wide; no halos are created if the overlap is only one cell wide (redundant computation is done instead), which avoids the need to deal with asymmetrical halos. We will see how to add this to the document.
Thank you for your comment. We will be happy to do a comparison with related work as the one you mentioned, and with other key work in the vast fields of distributed memory parallelization and architecture-agnostic programming, as well as, more generally speaking, with important work that shares our objective to contribute to solving the challenge of the 3 “P”s -- (scalable) Performance, (performance) Portability and Productivity --. However, given the large amount of work our contribution is related to, we aim to do so at a later point in a full paper submission. We did not have the space to do so in this very short extended abstract format of "at most two pages of content including references" (see author's guide: https://juliacon.github.io/proceedings-guide/author/), which "lays out in a concise fashion the methodology and use cases of the work presented at the conference", as opposed to a full paper submission, which "compared to an extended abstract, (...) compares the work to other approaches taken in the field and gives some additional insights on the conference contribution". |
@editorialbot generate pdf |
Hi @omlins! I'm just checking in; what is the status of this? Thank you! |
We aim to address the remaining issues in the manuscript in the next couple of days. Please excuse the delay. |
@editorialbot generate pdf |
@editorialbot generate pdf |
@mloubout, @georgebisbas: Dear reviewers, we have improved the manuscript, addressing all the issues raised. |
Hi!
I highly appreciate that, but why not presenting a 1024^3 problem of heat diffusion, just up to the point where the local decomposed problem becomes small enough to hit a strong scaling plateau?
It would be good to add a pointer to the ParallelStencil.jl paper in the manuscript
|
@editorialbot generate pdf |
@georgebisbas: Thank you very much for the additional comments. In the following you can find the replies:
We have extended the caption of Figure 2 with more information similar to Figure 3. --
We have added the pointer to the ParallelStencil.jl paper in the caption of Figure 2. --
We have added guidelines for contribution to the readme. --
Thank you for your suggestion. It could certainly be interesting to dive into a strong scaling analysis. However, we believe that this does not fit into this very short extended abstract of two pages. Thus, we would like to follow your original suggestion and leave this for future work. |
@fcdimitr please wait until I update the meta files and info to solve the non-showing DOI and special character issue before taking further action. 🙏 |
@editorialbot generate pdf |
@fcdimitr The DOI and author name related issues are solved. Thanks for waiting. Since there seems to be no remaining issues, we could proceed. |
Thank you @luraess! What is the Zeonodo archival DOI? 👋 @mloubout and @georgebisbas. Could you please mark all checkboxes in your list as done? Thank you! |
@omlins will share the DOI |
Done, thanks |
@fcdimitr : here you can find the DOI for the paper: https://doi.org/10.5281/zenodo.14056962 |
Submitting author: @omlins (Samuel Omlin)
Repository: https://github.com/omlins/ImplicitGlobalGrid.jl
Branch with paper.md (empty if default branch):
Version:
Editor: @fcdimitr
Reviewers: @mloubout, @georgebisbas
Archive: Pending
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@mloubout & @georgebisbas, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @fcdimitr know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Review checklist for @mloubout
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Paper format
paper.tex
file include a list of authors with their affiliations?Content
Review checklist for @georgebisbas
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Paper format
paper.tex
file include a list of authors with their affiliations?Content
The text was updated successfully, but these errors were encountered: