LDST Unit model #239

rodhuega · 2023-07-04T13:33:36Z

rodhuega
Jul 4, 2023

I have some questions about the LDST unit in the simulator and its accuracy over Nvidia architectures.

Looking into the pictures of SMs provided by NVIDIA, there are several LDST units per sub-core.

Ampere SM:

Turing SM:

Volta SM:

Pascal SM:

However, in the simulator, there is a single unit per SM.
My question is how the simulator emulates the behavior of 32 ldst units of Volta with a single one and if it is right and balanced.

I ask this because I suppose Volta can solve coalescing or bank conflicts in each ldst unit and has 32 dispatch regs for the load store unit. On the other hand, the simulator has only one dispatch reg for the ldst unit. This means only one memory instruction can be dispatched to the ldst unit each cycle. Also, if the instruction in the dispatch reg requires many cycles to solve coalescing, it will stall other dispatches.

I do not know if I have mislead something, if this is a problem or if someone is working on this topic right now to have the sub-core model in the ldst unit.

rodhuega · 2023-07-06T15:33:23Z

rodhuega
Jul 6, 2023
Author

I have been thinking, and this ldst units in the diagrams, I suppose, are scalar(per thread), but anyway, I don't get how these 4 units (1 per sub-core) vs one is handled in the simulator.

Moreover, I have another related question.
Here says that L1 is not banked and can serve two coalesced memory requests per cycle. I infer that means having 2 ports in the L1D. In the accel-sim paper it says that (Volta as an example) has 4 banks, but I don't see where it says how many ports we have. So it is one per bank? and 4 ports per L1D?

I don't know if I have understood the code wrong, but L1D_cache_fill_port_util and L1D_cache_data_port_util track all the L1D caches at the same time. But it does not overlap accesses to different banks inside the same L1D and adds m_data_port_occupied_cycles for every access in a global variable, and it gets decreased by 1 just every cycle. Also, m_cache_port_available_cycles gets incremented just by 1. What I'm wondering is if this metric is right because I think is not taking into account parallel accesses to different banks and can be showing a wrong utilization of the port.

Or there is only one port for the whole L1D even though there are several banks?

0 replies

barnes88 · 2023-07-10T14:37:34Z

barnes88
Jul 10, 2023
Collaborator

Hi Rodrigo,

In the block diagram they show scalar ld/st units whereas in the simulator we model warp instructions (32 wide). Rather than having 4 ld/st units per SM we model a single ld/st unit per sm but partition it so that each sub-core can only access 1/4 of the unit. This is similar to how we partition all of the other execution resources in the SM for each sub-core. That partitioning happens when the warp instruction is being issued to the ld/st unit. You can read more about how the sub-core partitioning is structured here.

the simulator has only one dispatch reg for the ldst unit. This means only one memory instruction can be dispatched to the ldst unit each cycle.

This is not correct, each SM has a single pipeline register set, but each set is configured to hold four warp instructions per SM (a warp inst is 32 scalar lanes wide).

2 replies

rodhuega Jul 11, 2023
Author

Thanks for replying to me.

First of all, I agree with you about the scalar representation in the NVIDIA images. I figured it out later.

About the important thing of the ldst_unit per sub-core:
The first latches you are pointing to me are known as ID_OC_(Whatever_type_instruction)*1, used after issuing an instruction and before locating the instruction to a given collector unit. The second latches you are indicating to me are OC_EX_MEM. These latches are used when the instruction is ready in operand collection, and the instruction is dispatched but not yet sent to an execution unit. I agree that are 4 of each and one for each sub-core.

However, the problem I have tried to describe comes a bit later. Imagine having multiple memory instructions (at least one in each sub-core) in the OC_EX_MEM. Then, in the execute function is where these instructions try to go from these latches to the execution unit latches. All the units except the ldst_unit are instantiated the number of times that is specified in the config that matches the number of sub-cores and latches ID_OC_ and OC_EX_. But, the ldst_unit is not possible to configure and there is a single one per SM. So, in the execute() loop, there is only one iteration for dispatching from OC_EX_MEM to the ldst_unit m_distpatch_reg and checking if its empty. So, it goes from four registers to only one. As a result, only one sub-core can progress. The problem becomes worse in the ldst_unit::cycle, because this m_dispatch_regs holds the instruction until all the coalescing, bank conflicts and other problems are solved and this structure is cleared. So, it can happen that sub-core 3 is stalling sub-core 1 because sub-core 1 can not dispatch any memory instruction to the ldst_unit because this one is stalled until all the problems are solved.

This is why I'm wondering how a single ldst_unit for the whole SM can behave as four and is not bottlenecking. I don't know if there is another trick on another side to get the correlation and have a fair design of the memory pipeline. This is why I'm asking.

*1: By the way, I don't understand at all why it is needed to have this split of having specialized latches per execution pipeline when they go to the same kind of collector unit. I suppose this inherit and used for retro-compatibility with previous architectures where there was an option of having specialized collector units for each execution pipeline instead of general ones. But I don't think this is important, and I don't think it has an impact on performance metrics.

barnes88 Aug 14, 2023
Collaborator

This is why I'm wondering how a single ldst_unit for the whole SM can behave as four and is not bottlenecking. I don't know if there is another trick on another side to get the correlation and have a fair design of the memory pipeline. This is why I'm asking.

Yes, we allow 128 Bytes of data to be returned instead of 32 Bytes. So even though the unit can only satisfy one sub-core's request at a time, it has the correct bandwidth. You can verify the correlation results using the L1 Bandwidth Microbenchmarks or view some of the past correlation data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDST Unit model #239

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

LDST Unit model #239

rodhuega Jul 4, 2023

Replies: 2 comments · 2 replies

rodhuega Jul 6, 2023 Author

barnes88 Jul 10, 2023 Collaborator

rodhuega Jul 11, 2023 Author

barnes88 Aug 14, 2023 Collaborator

rodhuega
Jul 4, 2023

Replies: 2 comments 2 replies

rodhuega
Jul 6, 2023
Author

barnes88
Jul 10, 2023
Collaborator

rodhuega Jul 11, 2023
Author

barnes88 Aug 14, 2023
Collaborator