Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the memory selection layout when there is a layout mismatch #4346

Merged
merged 2 commits into from
Oct 7, 2024

Conversation

anagainaru
Copy link
Contributor

Since we set the memory space after we set all the variable shapes and selections, the memory selection needs to be updated together with the variable shape.

        const adios2::Dims shape{static_cast<size_t>(Nx * mpiSize), Ny};
        const adios2::Dims start{static_cast<size_t>(Nx * mpiRank), 0};
        const adios2::Dims count{Nx, Ny};
        auto var = io.DefineVariable<float>("r32", shape, start, count);

        const adios2::Dims memoryStart = {ghostCells, ghostCells};
        const adios2::Dims memoryCount = {totalNx, totalNy};
        var.SetMemorySelection({memoryStart, memoryCount});

        var.SetMemorySpace(adios2::MemorySpace::GPU);

The SetMemorySpace call was updating the shape, start and count but not the memory selection.

@franzpoeschel this is why your code was writing the wrong data for your 2D array. Could you please double check?

@anagainaru anagainaru requested a review from pnorbert September 13, 2024 14:37
@franzpoeschel
Copy link
Contributor

Hello Ana, thank you for your help.

Unfortunately, the reproducer linked in this comment still produces this file for me, even with this PR:

> bpls MemSelection.bp/ -d                                                                                                                                  
  double   cpu_pointer_with_mem_selection     {12, 12}                                                                                                      
    ( 0, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 0, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 1, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 1, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 2, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 2, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 3, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 3, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 4, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 4, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 5, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 5, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 6, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 6, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 7, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 7, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 8, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 8, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 9, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 9, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    (10, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    (10, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    (11, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    (11, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
                                                                                                                                                            
  double   cpu_pointer_without_mem_selection  {12, 12}                                                                                                      
    ( 0, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 0, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170                                                                
    ( 1, 0)    0 0 0 0 0 0                                                                                                                                  
    ( 1, 6)    0 0 0 0 0 0
    ( 2, 0)    0 0 0 0 0 0
    ( 2, 6)    0 0 0 0 0 0
    ( 3, 0)    0 0 0 0 0 0
    ( 3, 6)    0 0 0 0 0 0
    ( 4, 0)    0 0 0 0 0 0
    ( 4, 6)    0 0 0 0 0 0
    ( 5, 0)    0 0 0 0 5.85364e+170 5.85364e+170
    ( 5, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    ( 6, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 0 0
    ( 6, 6)    0 0 0 0 0 0
    ( 7, 0)    0 0 0 0 0 0
    ( 7, 6)    0 0 0 0 0 0
    ( 8, 0)    0 0 0 0 0 0
    ( 8, 6)    0 0 0 0 0 0
    ( 9, 0)    0 0 0 0 0 0
    ( 9, 6)    0 0 0 0 0 0
    (10, 0)    0 0 0 0 0 0
    (10, 6)    0 0 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    (11, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    (11, 6)    5.85364e+170 5.85364e+170 0 0 0 0

  double   gpu_pointer_with_mem_selection     {12, 12}
    ( 0, 0)    0 0 0 0 0 0
    ( 0, 6)    0 0 0 0 0 0
    ( 1, 0)    0 0 0 0 0 0
    ( 1, 6)    0 0 0 0 0 0
    ( 2, 0)    0 0 0 0 0 0
    ( 2, 6)    0 0 0 0 0 0
    ( 3, 0)    0 0 0 0 0 0
    ( 3, 6)    0 0 0 0 0 0
    ( 4, 0)    0 0 0 0 0 0
    ( 4, 6)    0 0 0 0 0 0
    ( 5, 0)    0 0 0 0 0 0
    ( 5, 6)    0 0 0 0 0 0
    ( 6, 0)    0 0 0 0 0 0
    ( 6, 6)    0 0 0 0 0 0
    ( 7, 0)    0 0 0 0 0 0
    ( 7, 6)    0 0 0 0 0 0
    ( 8, 0)    0 0 0 0 0 0
    ( 8, 6)    0 0 0 0 0 0
    ( 9, 0)    0 0 0 0 0 0
    ( 9, 6)    0 0 0 0 0 0
    (10, 0)    0 0 0 0 0 0
    (10, 6)    0 0 0 0 0 0
    (11, 0)    0 0 0 0 0 0
    (11, 6)    0 0 0 0 0 0

  double   gpu_pointer_without_mem_selection  {12, 12}
    ( 0, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    ( 0, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    ( 1, 0)    0 0 0 0 0 0
    ( 1, 6)    0 0 0 0 0 0
    ( 2, 0)    0 0 0 0 0 0
    ( 2, 6)    0 0 0 0 0 0
    ( 3, 0)    0 0 0 0 0 0
    ( 3, 6)    0 0 0 0 0 0
    ( 4, 0)    0 0 0 0 0 0
    ( 4, 6)    0 0 0 0 0 0
    ( 5, 0)    0 0 0 0 5.85364e+170 5.85364e+170
    ( 5, 6)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    ( 6, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 0 0
    ( 6, 6)    0 0 0 0 0 0
    ( 7, 0)    0 0 0 0 0 0
    ( 7, 6)    0 0 0 0 0 0
    ( 8, 0)    0 0 0 0 0 0
    ( 8, 6)    0 0 0 0 0 0
    ( 9, 0)    0 0 0 0 0 0
    ( 9, 6)    0 0 0 0 0 0
    (10, 0)    0 0 0 0 0 0
    (10, 6)    0 0 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    (11, 0)    5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170 5.85364e+170
    (11, 6)    5.85364e+170 5.85364e+170 0 0 0 0

I will try this on other systems next week, since you report that you cannot reproduce this.

@anagainaru anagainaru force-pushed the memSelLayout branch 5 times, most recently from 6f5c639 to f517038 Compare September 13, 2024 17:20
@anagainaru
Copy link
Contributor Author

anagainaru commented Sep 13, 2024

Ok, so two things I notice when I look at your code:

  1. You are using the same memory selection for both CPU and GPU but the layout between the two is different so the data is not accessed the same for both. If you change the selection for the GPU to:
    var.SetMemorySelection({{0, 0}, {pitch / sizeof(value_type), square_size}}); do you still see zeros?

  2. You are using cudaMallocPitch which maybe doesn't allocate continuous memory, which may explain why I cannot reproduce what you are seeing. This I am not sure about, I will check and come back to you.

I really think your issue now is a layout mismatch. CudaMemCpy doesn't reorganize data so you'll have the same data but we access it differently depending on if it's on the CPU and GPU. I can talk later today on teams if you want

@franzpoeschel
Copy link
Contributor

franzpoeschel commented Sep 14, 2024

2. You are using cudaMallocPitch which maybe doesn't allocate continuous memory, which may explain why I cannot reproduce what you are seeing. This I am not sure about, I will check and come back to you.

Correct, this is the reason why I want to use a memory selection here in the first place.
cudaMallocPitch allocates a 2D array and pads each line up to some platform-dependent power of 2, and I want to be rid of the padding in the output.

I output four variables in order to verify that this runs as I expect:

  1. gpu_pointer_without_mem_selection outputs the memory pointer and ignores that there is a padding. The output shows this: The actual 12 data items per line are padded up by zeroes to a physical line length of 64.
  2. In order to verify that I'm applying the memory selection correctly, I download the data to CPU with a simple cudaMemcpy(), ignoring the memory layout, meaning that also the padding is downloaded.
    I verify this by writing the variable cpu_pointer_without_mem_selection: It has the exact same content as gpu_pointer_without_mem_selection, so the memory layouts on CPU and GPU are identical.
  3. I then apply the memory selection on CPU and write this to cpu_pointer_with_mem_selection. This gives me the expected results.
  4. Trying the same on the GPU in gpu_pointer_with_mem_selection gives me only zeroes.
  1. You are using the same memory selection for both CPU and GPU but the layout between the two is different so the data is not accessed the same for both. If you change the selection for the GPU to:
    var.SetMemorySelection({{0, 0}, {pitch / sizeof(value_type), square_size}}); do you still see zeros?

I'll try that next week. Since you mention that you access CPU/GPU data differently, even when it has the same layout as in this case, this seems like I need to know some special tricks to output padded GPU data with memory selections?
Thank you for helping me to get this running!

@anagainaru anagainaru merged commit 24c3860 into ornladios:master Oct 7, 2024
37 checks passed
@anagainaru anagainaru deleted the memSelLayout branch October 7, 2024 15:14
@anagainaru anagainaru added this to the v2.10.2 milestone Oct 7, 2024
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 23, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 23, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 23, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 23, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 24, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 24, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea pushed a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 24, 2024
Update the memory selection layout when there is a layout mismatch

(cherry picked from commit 24c3860)
vicentebolea added a commit that referenced this pull request Oct 26, 2024
* release_210: (33 commits)
  Bump version to v2.10.2
  Merge pull request #4379 from vicentebolea/increace-cmake-version-dep-blosc
  Fix missing int64_t definition (#4365)
  Merge pull request #4346 from anagainaru/memSelLayout
  Merge pull request #4343 from anagainaru/gpuSpan
  Compress the lossless part of MGARD data with Zstd. It was uncompressed until now. (#4339)
  Fix an issue when reading blocks in streaming mode, when some producers did not write any block to output. (#4332)
  Fix in python API:  (#4331)
  Merge pull request #4303 from anagainaru/build-frontier
  Merge pull request #4293 from anagainaru/check_parser_lexer
  Derived variables of type StoreData should not record the expression string (#4247)
  Update ReadMe.md (#4243)
  Switch from ROCM_VERSION_MAJOR to HIP_VERSION_MAJOR (#4222)
  Merge pull request #4207 from anagainaru/hide-symbols
  Patch for when both xrootd and derived variables are activated (#4200)
  Add defines for BP3, BP4 and BP5 (#4191)
  Overload NdCopy instead of adding a new argument for the 2.10 release
  Revert "Backport #3759 to Release 2.10 branch. (cherry-picked from master branch #4340) (#4341)"
  Revert "Merge pull request #4189 from pnorbert/campaign-s3-support"
  ci,python: enable python win2022 serial build
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants