Skip to content

Commit 51a2667

Browse files
authored
Merge pull request #21 from accel-sim/dev
Updating release-accelwattch with latest dev
2 parents fdc1a6a + a10de9e commit 51a2667

File tree

23 files changed

+1212
-388
lines changed

23 files changed

+1212
-388
lines changed

CHANGES

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,16 @@
11
LOG:
2+
Version 4.1.0 versus 4.0.0
3+
-Features:
4+
1- Supporting L1 write-allocate with sub-sector writing policy as in Volta+ hardware, and changing the Volta+ cards config to make L1 write-allocate with write-through
5+
2- Making the L1 adaptive cache policy to be configurable
6+
3- Adding Ampere RTX 3060 config files
7+
-Bugs:
8+
1- Fixing L1 bank hash function bug
9+
2- Fixing L1 read hit counters in gpgpu-sim to match nvprof, to achieve more accurate L1 correlation with the HW
10+
3- Fixing bugs in lazy write handling, thanks to Gwendolyn Voskuilen from Sandia labs for this fix
11+
4- Fixing the backend pipeline for sub_core model
12+
5- Fixing Memory stomp bug at the shader_config
13+
6- Some code refactoring:
214
Version 4.0.0 (development branch) versus 3.2.3
315
-Front-End:
416
1- Support .nc cache modifier and __ldg function to access the read-only L1D cache

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,26 @@ This version of GPGPU-Sim has been tested with a subset of CUDA version 4.2,
1111
Please see the copyright notice in the file COPYRIGHT distributed with this
1212
release in the same directory as this file.
1313

14+
GPGPU-Sim 4.0 is compatible with Accel-Sim simulation framework. With the support
15+
of Accel-Sim, GPGPU-Sim 4.0 can run NVIDIA SASS traces (trace-based simulation)
16+
generated by NVIDIA's dynamic binary instrumentation tool (NVBit). For more information
17+
about Accel-Sim, see [https://accel-sim.github.io/](https://accel-sim.github.io/)
18+
1419
If you use GPGPU-Sim 4.0 in your research, please cite:
1520

1621
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, Timothy G Rogers.
1722
Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling.
1823
In proceedings of the 47th IEEE/ACM International Symposium on Computer Architecture (ISCA),
1924
May 29 - June 3, 2020.
2025

21-
If you use CuDNN or PyTorch support, checkpointing or our new debugging tool for functional
26+
If you use CuDNN or PyTorch support (execution-driven simulation), checkpointing or our new debugging tool for functional
2227
simulation errors in GPGPU-Sim for your research, please cite:
2328

2429
Jonathan Lew, Deval Shah, Suchita Pati, Shaylin Cattell, Mengchi Zhang, Amruth Sandhupatla,
2530
Christopher Ng, Negar Goli, Matthew D. Sinclair, Timothy G. Rogers, Tor M. Aamodt
2631
Analyzing Machine Learning Workloads Using a Detailed GPU Simulator, arXiv:1811.08933,
2732
https://arxiv.org/abs/1811.08933
2833

29-
3034
If you use the Tensor Core model in GPGPU-Sim or GPGPU-Sim's CUTLASS Library
3135
for your research please cite:
3236

@@ -261,6 +265,7 @@ To clean the docs run
261265
The documentation resides at doc/doxygen/html.
262266

263267
To run Pytorch applications with the simulator, install the modified Pytorch library as well by following instructions [here](https://github.com/gpgpu-sim/pytorch-gpgpu-sim).
268+
264269
## Step 3: Run
265270

266271
Before we run, we need to make sure the application's executable file is dynamically linked to CUDA runtime library. This can be done during compilation of your program by introducing the nvcc flag "--cudart shared" in makefile (quotes should be excluded).

configs/tested-cfgs/SM75_RTX2060/gpgpusim.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@
7676

7777
# warp scheduling
7878
-gpgpu_num_sched_per_core 4
79-
-gpgpu_scheduler gto
79+
-gpgpu_scheduler lrr
8080
# a warp scheduler issue mode
8181
-gpgpu_max_insn_issue_per_warp 1
8282
-gpgpu_dual_issue_diff_exec_units 1

configs/tested-cfgs/SM75_RTX2060_S/gpgpusim.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@
103103
# Turing has four schedulers per core
104104
-gpgpu_num_sched_per_core 4
105105
# Greedy then oldest scheduler
106-
-gpgpu_scheduler gto
106+
-gpgpu_scheduler lrr
107107
## In Turing, a warp scheduler can issue 1 inst per cycle
108108
-gpgpu_max_insn_issue_per_warp 1
109109
-gpgpu_dual_issue_diff_exec_units 1

configs/tested-cfgs/SM7_QV100/gpgpusim.config

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -125,12 +125,12 @@
125125
-gpgpu_shmem_num_banks 32
126126
-gpgpu_shmem_limited_broadcast 0
127127
-gpgpu_shmem_warp_parts 1
128-
-gpgpu_coalesce_arch 60
128+
-gpgpu_coalesce_arch 70
129129

130130
# Volta has four schedulers per core
131131
-gpgpu_num_sched_per_core 4
132132
# Greedy then oldest scheduler
133-
-gpgpu_scheduler gto
133+
-gpgpu_scheduler lrr
134134
## In Volta, a warp scheduler can issue 1 inst per cycle
135135
-gpgpu_max_insn_issue_per_warp 1
136136
-gpgpu_dual_issue_diff_exec_units 1
@@ -144,17 +144,21 @@
144144
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
145145
# disable this mode in case of multi kernels/apps execution
146146
-gpgpu_adaptive_cache_config 1
147-
# Volta unified cache has four banks
147+
-gpgpu_shmem_option 0,8,16,32,64,96
148+
-gpgpu_unified_l1d_size 128
149+
# L1 cache configuration
148150
-gpgpu_l1_banks 4
149-
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
151+
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
152+
-gpgpu_l1_cache_write_ratio 25
153+
-gpgpu_l1_latency 20
154+
-gpgpu_gmem_skip_L1D 0
155+
-gpgpu_flush_l1_cache 1
156+
-gpgpu_n_cluster_ejection_buffer_size 32
157+
# shared memory configuration
150158
-gpgpu_shmem_size 98304
151159
-gpgpu_shmem_sizeDefault 98304
152160
-gpgpu_shmem_per_block 65536
153-
-gpgpu_gmem_skip_L1D 0
154-
-gpgpu_n_cluster_ejection_buffer_size 32
155-
-gpgpu_l1_latency 20
156161
-gpgpu_smem_latency 20
157-
-gpgpu_flush_l1_cache 1
158162

159163
# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 6MB L2 cache
160164
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32
@@ -229,4 +233,4 @@
229233
# tracing functionality
230234
#-trace_enabled 1
231235
#-trace_components WARP_SCHEDULER,SCOREBOARD
232-
#-trace_sampling_core 0
236+
#-trace_sampling_core 0

configs/tested-cfgs/SM7_TITANV/gpgpusim.config

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@
100100
# Volta has four schedulers per core
101101
-gpgpu_num_sched_per_core 4
102102
# Greedy then oldest scheduler
103-
-gpgpu_scheduler gto
103+
-gpgpu_scheduler lrr
104104
## In Volta, a warp scheduler can issue 1 inst per cycle
105105
-gpgpu_max_insn_issue_per_warp 1
106106
-gpgpu_dual_issue_diff_exec_units 1
@@ -114,17 +114,21 @@
114114
# For more info, see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x
115115
# disable this mode in case of multi kernels/apps execution
116116
-gpgpu_adaptive_cache_config 1
117-
# Volta unified cache has four banks
117+
-gpgpu_shmem_option 0,8,16,32,64,96
118+
-gpgpu_unified_l1d_size 128
119+
# L1 cache configuration
118120
-gpgpu_l1_banks 4
119-
-gpgpu_cache:dl1 S:1:128:256,L:L:s:N:L,A:256:8,16:0,32
121+
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
122+
-gpgpu_l1_cache_write_ratio 25
123+
-gpgpu_gmem_skip_L1D 0
124+
-gpgpu_l1_latency 20
125+
-gpgpu_flush_l1_cache 1
126+
-gpgpu_n_cluster_ejection_buffer_size 32
127+
# shared memory configuration
120128
-gpgpu_shmem_size 98304
121129
-gpgpu_shmem_sizeDefault 98304
122130
-gpgpu_shmem_per_block 65536
123-
-gpgpu_gmem_skip_L1D 0
124-
-gpgpu_n_cluster_ejection_buffer_size 32
125-
-gpgpu_l1_latency 20
126131
-gpgpu_smem_latency 20
127-
-gpgpu_flush_l1_cache 1
128132

129133
# 32 sets, each 128 bytes 24-way for each memory sub partition (96 KB per memory sub partition). This gives us 4.5MB L2 cache
130134
-gpgpu_cache:dl2 S:32:128:24,L:B:m:L:P,A:192:4,32:0,32
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
//21*1 fly with 32 flits per packet under gpgpusim injection mode
2+
use_map = 0;
3+
flit_size = 40;
4+
5+
// currently we do not use this, see subnets below
6+
network_count = 2;
7+
8+
// Topology
9+
topology = fly;
10+
k = 78;
11+
n = 1;
12+
13+
// Routing
14+
15+
routing_function = dest_tag;
16+
17+
18+
// Flow control
19+
20+
num_vcs = 1;
21+
vc_buf_size = 256;
22+
input_buffer_size = 256;
23+
ejection_buffer_size = 256;
24+
boundary_buffer_size = 256;
25+
26+
wait_for_tail_credit = 0;
27+
28+
// Router architecture
29+
30+
vc_allocator = islip; //separable_input_first;
31+
sw_allocator = islip; //separable_input_first;
32+
alloc_iters = 1;
33+
34+
credit_delay = 0;
35+
routing_delay = 0;
36+
vc_alloc_delay = 1;
37+
sw_alloc_delay = 1;
38+
39+
input_speedup = 1;
40+
output_speedup = 1;
41+
internal_speedup = 2.0;
42+
43+
// Traffic, GPGPU-Sim does not use this
44+
45+
traffic = uniform;
46+
packet_size ={{1,2,3,4},{10,20}};
47+
packet_size_rate={{1,1,1,1},{2,1}};
48+
49+
// Simulation - Don't change
50+
51+
sim_type = gpgpusim;
52+
//sim_type = latency;
53+
injection_rate = 0.1;
54+
55+
subnets = 2;
56+
57+
// Always use read and write no matter following line
58+
//use_read_write = 1;
59+
60+
61+
read_request_subnet = 0;
62+
read_reply_subnet = 1;
63+
write_request_subnet = 0;
64+
write_reply_subnet = 1;
65+
66+
read_request_begin_vc = 0;
67+
read_request_end_vc = 0;
68+
write_request_begin_vc = 0;
69+
write_request_end_vc = 0;
70+
read_reply_begin_vc = 0;
71+
read_reply_end_vc = 0;
72+
write_reply_begin_vc = 0;
73+
write_reply_end_vc = 0;
74+

0 commit comments

Comments
 (0)