Replies: 2 comments 2 replies
-
I have been thinking, and this ldst units in the diagrams, I suppose, are scalar(per thread), but anyway, I don't get how these 4 units (1 per sub-core) vs one is handled in the simulator. Moreover, I have another related question. I don't know if I have understood the code wrong, but L1D_cache_fill_port_util and L1D_cache_data_port_util track all the L1D caches at the same time. But it does not overlap accesses to different banks inside the same L1D and adds m_data_port_occupied_cycles for every access in a global variable, and it gets decreased by 1 just every cycle. Also, m_cache_port_available_cycles gets incremented just by 1. What I'm wondering is if this metric is right because I think is not taking into account parallel accesses to different banks and can be showing a wrong utilization of the port. Or there is only one port for the whole L1D even though there are several banks? |
Beta Was this translation helpful? Give feedback.
-
Hi Rodrigo, In the block diagram they show scalar ld/st units whereas in the simulator we model warp instructions (32 wide). Rather than having 4 ld/st units per SM we model a single ld/st unit per sm but partition it so that each sub-core can only access 1/4 of the unit. This is similar to how we partition all of the other execution resources in the SM for each sub-core. That partitioning happens when the warp instruction is being issued to the ld/st unit. You can read more about how the sub-core partitioning is structured here.
This is not correct, each SM has a single pipeline register set, but each set is configured to hold four warp instructions per SM (a warp inst is 32 scalar lanes wide). |
Beta Was this translation helpful? Give feedback.
-
I have some questions about the LDST unit in the simulator and its accuracy over Nvidia architectures.
Looking into the pictures of SMs provided by NVIDIA, there are several LDST units per sub-core.
Ampere SM:
Turing SM:
Volta SM:
Pascal SM:
However, in the simulator, there is a single unit per SM.
My question is how the simulator emulates the behavior of 32 ldst units of Volta with a single one and if it is right and balanced.
I ask this because I suppose Volta can solve coalescing or bank conflicts in each ldst unit and has 32 dispatch regs for the load store unit. On the other hand, the simulator has only one dispatch reg for the ldst unit. This means only one memory instruction can be dispatched to the ldst unit each cycle. Also, if the instruction in the dispatch reg requires many cycles to solve coalescing, it will stall other dispatches.
I do not know if I have mislead something, if this is a problem or if someone is working on this topic right now to have the sub-core model in the ldst unit.
Beta Was this translation helpful? Give feedback.
All reactions