Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Const Assignment Map Fusion: If two maps assign the same value for every element in a subset of the underlying array (and the subset is not dependent on the array in any way), then we can often fuse the two maps (not always possible) #1685

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

pratyai
Copy link
Collaborator

@pratyai pratyai commented Oct 15, 2024

Since scheduling multiple map kernels with very little internal operation can have a large overhead, sometimes we would like to fuse two such maps if possible. Constant assignment maps are such a case. If two maps assign the same value for every element in a subset of the underlying array (and the subset is not dependent on the array in any way), then:

  • If the two maps' subsets are identical, we can simply fuse the two maps by moving the body of one map to another (with appropriate wiring), since we know the order of the assignments do not matter here, and can even be deduplicated in some cases.
  • If the two maps' subsets are not identical, even then we can occasionally fuse them using a grid-strided loop pattern (which essentially emulates a conditional to ensure that only the appropriate elements are assigned).

Motivating Example

Consider the following graphs, all representing a computation that assigns 1 to the boundary of a 2D domain. The first table represents the graphs scheduled for CPU, the second for GPU.

Device Original w/o GSL with GSL
CPU 2d-orig 2d-no-gsl 2d-with-gsl
GPU 2d-orig-GPU 2d-no-gsl-GPU 2d-with-gsl-GPU

Performance

We have profiled a 2D and a 3D boundary initialization, both on CPU and GPU (both on Davinci cluster).
Benchmark scripts and reports are to be found in https://github.com/pratyai/dace/tree/bench-const-assignment-fusion
I will be quoting the performance summaries in further comments.

Comment on GPU performance

The GPU transformation adds additional operation copying the entire array to and from GPU memory, resulting in O(n^d) main <=> GPU movement, whereas the assignment itself only touches O(n^{d-1}) elements. However, this is because the benchmark itself does not do anything else but the assignment. In real computations, we would likely need to move the entire array anyway.

Because of this, it is probably better to just focus on the combined performance of the map kernels here.

@pratyai
Copy link
Collaborator Author

pratyai commented Oct 15, 2024

CPU Results (comment might be updated)

benchmark_const_assignment_fusion_test_assign_top_row_0 0.13079101336188614 ms
===2D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.108          0.128          0.131          0.366          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_row_47)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.007          0.020          0.021          0.045          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_row_53)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.005          0.006          0.006          0.010          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_col_59)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.005          0.005          0.024          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_col_65)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.006          0.005          0.100          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 0.11151551734656096 ms
===2D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.090          0.111          0.112          0.920          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (4, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.014          0.021          0.018          0.151          
---------------------------------------------------------------------------
| |-Node (9, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.004          0.006          0.005          0.054          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 0.10572001338005066 ms
===2D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.084          0.103          0.106          0.393          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (8, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.009          0.022          0.022          0.024          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.23336600861512125 ms
===3D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.214          0.233          0.233          0.497          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_face_90)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.020          0.027          0.027          0.054          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_face_9)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.013          0.015          0.014          0.038          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_front_face_10)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.012          0.013          0.013          0.035          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_back_face_108)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.010          0.011          0.011          0.038          
---------------------------------------------------------------------------
|-State (4)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_face_114)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.029          0.031          0.031          0.051          
---------------------------------------------------------------------------
|-State (5)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_face_12)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.024          0.026          0.025          0.051          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.19121551304124296 ms
===3D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.173          0.192          0.191          0.427          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (4, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.026          0.033          0.033          0.056          
---------------------------------------------------------------------------
| |-Node (9, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.018          0.020          0.020          0.082          
---------------------------------------------------------------------------
| |-Node (14, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.037          0.043          0.042          0.066          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 0.18826551968231797 ms
===3D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              0.170          0.188          0.188          0.469          
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (14, Map map_fusion_wrapper)                                                            
| | |Thread 12176263715017013721:                                                            
| | |          0.091          0.099          0.096          0.200          
---------------------------------------------------------------------------

@pratyai
Copy link
Collaborator Author

pratyai commented Oct 15, 2024

GPU Results (comment might be updated)

benchmark_const_assignment_fusion_test_assign_top_row_0 167.06380699179135 ms
===2D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              166.200        167.907        167.064        172.466        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_row_47)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.002          0.004          0.004          0.009          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_row_53)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.004          0.005          0.004          0.018          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_col_59)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.004          0.006          0.006          0.014          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_col_65)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.005          0.006          0.006          0.014          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 170.15381151577458 ms
===2D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              166.765        169.892        170.154        172.003        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (6, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.003          0.004          0.004          0.008          
---------------------------------------------------------------------------
| |-Node (11, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.009          0.008          0.022          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_top_row_0 165.11968348640949 ms
===2D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              164.683        165.159        165.120        167.484        
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (10, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.007          0.008          0.008          0.013          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 50.67180350306444 ms
===3D boundary init: original op===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              49.457         50.629         50.672         51.666         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_top_face_90)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.013          
---------------------------------------------------------------------------
|-State (1)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_bottom_face_9)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.009          0.010          0.023          
---------------------------------------------------------------------------
|-State (2)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_front_face_10)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.024          
---------------------------------------------------------------------------
|-State (3)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_back_face_108)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.006          0.008          0.008          0.027          
---------------------------------------------------------------------------
|-State (4)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_left_face_114)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.042          0.044          0.044          0.053          
---------------------------------------------------------------------------
|-State (5)                                                                
| |-Node (0, Map benchmark_const_assignment_fusion_test_assign_right_face_12)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.044          0.045          0.045          0.055          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 50.77531602000818 ms
===3D boundary init: fused op w/o. grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              50.484         50.773         50.775         51.235         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (6, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.008          0.008          0.008          0.012          
---------------------------------------------------------------------------
| |-Node (11, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.006          0.008          0.008          0.020          
---------------------------------------------------------------------------
| |-Node (16, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.050          0.052          0.052          0.065          
---------------------------------------------------------------------------


benchmark_const_assignment_fusion_test_assign_bounary_3d 49.49491852312349 ms
===3D boundary init: fused op with grid-strided loop===
Instrumentation report
SDFG Hash: 
---------------------------------------------------------------------------
Element        Runtime (ms)   
               Min            Mean           Median         Max            
---------------------------------------------------------------------------
SDFG (0)                                                                   
|:                                                                         
|              49.321         49.533         49.495         50.064         
---------------------------------------------------------------------------
|-State (0)                                                                
| |-Node (16, Map map_fusion_wrapper)                                                            
| | |Thread 3118465057716905996:                                                            
| | |          0.055          0.057          0.057          0.063          
---------------------------------------------------------------------------

@pratyai pratyai added the no-ci Do not run any CI or actions for this PR label Oct 24, 2024
@pratyai
Copy link
Collaborator Author

pratyai commented Oct 28, 2024

@ThrudPrimrose

The PR contains all the working ideas I had about the constant assignment map fusion:

  • Fusing two maps if they all "blindly" assign constants (for multiple arrays, all the ranges of that array must always get the same constant, so that the order does not matter)
  • Can fuse maps if the constant assignments are hidden behind a branch that only depends on loop parameters (so that we can fuse grid-strided-loops too).
  • Fusing two states too, if after that the map fusion is possible (as described before).
  • Optionally, can fuse if the ranges are not exactly the same using grid-strided-loop.

If it is too many feature at once, I can remove the non-essential pieces for future PRs too. Please let me know if you'd see more tests, different implementation or organization etc.

@pratyai pratyai changed the title Const assignment map fusion Const Assignment Map Fusion: If two maps assign the same value for every element in a subset of the underlying array (and the subset is not dependent on the array in any way), then we can often fuse the two maps (not always possible) Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-ci Do not run any CI or actions for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant