Skip to content

Commit

Permalink
i#6949: Enable core-sharded by default for simulators
Browse files Browse the repository at this point in the history
Adds a new interface trace_analysis_tool::preferred_shard_type() to
the drmemtrace framework to allow tools to request core-sharded
operation.

The cache simulator, TLB simulator, and schedule_stats tools override
the new interface to request core-sharded mode.

Unfortunately, it is not easy to detect core-sharded-on-disk traces in
the launcher, so the user must now pass `-no_core_sharded` when using
such traces with core-sharded-preferring tools to avoid the trace
being re-scheduled yet again.  Documentation for this is added and it
is turned into a fatal error since this re-scheduling there is almost
certainly user error.

In the launcher, if all tools prefer core-sharded, and the user did
not specify -no_core_sharded, core-sharded (or core-serial) mode is
enabled, with a -verbose 1+ message.
```
  $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool schedule_stats:cache_simulator
  Enabling -core_serial as all tools prefer it
  <...>
  Schedule stats tool results:
  Total counts:
             4 cores
             8 threads: 1257600, 1257602, 1257599, 1257603, 1257598, 1257604, 1257596, 1257601
        638938 instructions
  <...>
  Core #0 schedule: AEA_A_
  Core #1 schedule: BH_
  Core #2 schedule: CG
  Core #3 schedule: DF_
  <...>
  Cache simulation results:
  Core #0 (traced CPU(s): #0)
    L1I0 (size=32768, assoc=8, block=64, LRU) stats:
      Hits:                          123,659
  <...>
```

If at least one tool prefers core-sharded but others do not, a
-verbose 1+ message suggests running with an explicit -core_sharded.
```
  $ bin64/drrun -stderr_mask 0 -t drcachesim -indir ../src/clients/drcachesim/tests/drmemtrace.threadsig.x64.tracedir/ -verbose 1 -tool cache_simulator:basic_counts
  Some tool(s) prefer core-sharded: consider re-running with -core_sharded or -core_serial enabled for best results.
```

Reduces the scheduler queue diagnostics by 5x as they seem too
frequent in short runs.

Updates the documentation to mention the new defaults.

Updates numerous drcachesim test output templates.

Keeps a couple of tests using thread-sharded by passing -no_core_serial.

Fixes #6949
  • Loading branch information
derekbruening committed Oct 15, 2024
1 parent a6af579 commit 2c5ea79
Show file tree
Hide file tree
Showing 22 changed files with 168 additions and 67 deletions.
6 changes: 6 additions & 0 deletions api/docs/release.dox
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,12 @@ Further non-compatibility-affecting changes include:
- Added -trace_instr_intervals_file option to the drmemtrace trace analysis tools
framework. The file must be in CSV format containing a <start,duration> tracing
interval per line where start and duration are expressed in number of instructions.
- Added trace_analysis_tool::preferred_shard_type() to the drmemtrace framework to
allow switching to core-sharded by default if all tools prefer that mode.
- For the drmemtrace framework, if only core-sharded-preferring tools are enabled
(these include cache and TLB simulators and the schedule_stats tool), -core_sharded or
-core_serial is automatically turned on for offline analysis to enable more
representative simulated software thread scheduling onto virtual cores.

**************************************************
<hr>
Expand Down
10 changes: 10 additions & 0 deletions clients/drcachesim/analysis_tool.h
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,16 @@ template <typename RecordType> class analysis_tool_tmpl_t {
{
return "";
}
/**
* Identifies the preferred shard type for this analysis. If every tool requests
* #SHARD_BY_CORE, the framework may decided to use that mode even if the user
* left the default thread-sharded mode in place.
*/
virtual shard_type_t
preferred_shard_type()
{
return SHARD_BY_THREAD;
}
/** Returns whether the tool was created successfully. */
virtual bool
operator!()
Expand Down
45 changes: 45 additions & 0 deletions clients/drcachesim/analyzer_multi.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -472,8 +472,53 @@ analyzer_multi_tmpl_t<RecordType, ReaderType>::analyzer_multi_tmpl_t()
return;
}

bool offline = !op_indir.get_value().empty() || !op_infile.get_value().empty();
// TODO i#7040: Add core-sharded support for online tools.
if (offline && !op_core_sharded.specified() && !op_core_serial.specified() &&
!op_cpu_scheduling.get_value()) {
bool switch_to_core_sharded = true;
bool one_prefers_core_sharded = false;
for (int i = 0; i < this->num_tools_; ++i) {
if (this->tools_[i]->preferred_shard_type() == SHARD_BY_CORE) {
one_prefers_core_sharded = true;
} else {
switch_to_core_sharded = false;
break;
}
if (this->parallel_ && !this->tools_[i]->parallel_shard_supported()) {
this->parallel_ = false;
}
}
if (switch_to_core_sharded) {
// XXX i#6949: Ideally we could detect a core-sharded-on-disk input
// here and avoid this but that's not simple so we rely on the user
// to pass -no_core_sharded for such inputs.
if (this->parallel_) {
if (op_verbose.get_value() > 0)
fprintf(stderr, "Enabling -core_sharded as all tools prefer it\n");
op_core_sharded.set_value(true);
} else {
if (op_verbose.get_value() > 0)
fprintf(stderr, "Enabling -core_serial as all tools prefer it\n");
op_core_serial.set_value(true);
}
} else if (one_prefers_core_sharded) {
if (op_verbose.get_value() > 0) {
fprintf(stderr,
"Some tool(s) prefer core-sharded: consider re-running with "
"-core_sharded or -core_serial enabled for best results.\n");
}
}
}

typename sched_type_t::scheduler_options_t sched_ops;
if (op_core_sharded.get_value() || op_core_serial.get_value()) {
if (!offline) {
// TODO i#7040: Add core-sharded support for online tools.
this->success_ = false;
this->error_string_ = "Core-sharded is not yet supported for online analysis";
return;
}
if (op_core_serial.get_value()) {
this->parallel_ = false;
}
Expand Down
20 changes: 16 additions & 4 deletions clients/drcachesim/common/options.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -299,13 +299,19 @@ droption_t<std::string> op_v2p_file(
droption_t<bool> op_cpu_scheduling(
DROPTION_SCOPE_CLIENT, "cpu_scheduling", false,
"Map threads to cores matching recorded cpu execution",
"By default, the simulator schedules threads to simulated cores in a static "
"By default for online analysis, the simulator schedules threads to simulated cores "
"in a static "
"round-robin fashion. This option causes the scheduler to instead use the recorded "
"cpu that each thread executed on (at a granularity of the trace buffer size) "
"for scheduling, mapping traced cpu's to cores and running each segment of each "
"thread on the core that owns the recorded cpu for that segment. "
"This option is not supported with -core_serial; use "
"-cpu_schedule_file with -core_serial instead.");
"-cpu_schedule_file with -core_serial instead. For offline analysis, the "
"recommendation is to not recreate the as-traced schedule (as it is not accurate due "
"to overhead) and instead use a dynamic schedule via -core_serial. If only "
"core-sharded-preferring tools are enabled (e.g., " CPU_CACHE ", " TLB
", " SCHEDULE_STATS
"), -core_serial is automatically turned on for offline analysis.");

droption_t<bytesize_t> op_max_trace_size(
DROPTION_SCOPE_CLIENT, "max_trace_size", 0,
Expand Down Expand Up @@ -894,15 +900,21 @@ droption_t<bool> op_core_sharded(
"software threads. This option instead schedules those threads onto virtual cores "
"and analyzes each core in parallel. Thus, each shard consists of pieces from "
"many software threads. How the scheduling is performed is controlled by a set "
"of options with the prefix \"sched_\" along with -cores.");
"of options with the prefix \"sched_\" along with -cores. If only "
"core-sharded-preferring tools are enabled (" CPU_CACHE ", " TLB ", " SCHEDULE_STATS
") and they all support parallel operation, -core_sharded is automatically "
"turned on for offline analysis.");

droption_t<bool> op_core_serial(
DROPTION_SCOPE_ALL, "core_serial", false, "Analyze per-core in serial.",
"In this mode, scheduling is performed just like for -core_sharded. "
"However, the resulting schedule is acted upon by a single analysis thread"
"which walks the N cores in lockstep in round robin fashion. "
"How the scheduling is performed is controlled by a set "
"of options with the prefix \"sched_\" along with -cores.");
"of options with the prefix \"sched_\" along with -cores. If only "
"core-sharded-preferring tools are enabled (" CPU_CACHE ", " TLB ", " SCHEDULE_STATS
") and not all of them support parallel operation, -core_serial is automatically "
"turned on for offline analysis.");

droption_t<int64_t>
// We pick 10 million to match 2 instructions per nanosecond with a 5ms quantum.
Expand Down
18 changes: 13 additions & 5 deletions clients/drcachesim/docs/drcachesim.dox.in
Original file line number Diff line number Diff line change
Expand Up @@ -1292,17 +1292,21 @@ Neither simulator has a simple way to know which core any particular thread
executed on for each of its instructions. The tracer records which core a
thread is on each time it writes out a full trace buffer, giving an
approximation of the actual scheduling: but this is not representative
due to overhead (see \ref sec_drcachesim_as_traced). By default, these cache and TLB
simulators ignore that
due to overhead (see \ref sec_drcachesim_as_traced). For online analysis, by default,
these cache and TLB simulators ignore that
information and schedule threads to simulated cores in a static round-robin
fashion with load balancing to fill in gaps with new threads after threads
exit. The option "-cpu_scheduling" (see \ref sec_drcachesim_ops) can be
used to instead map each physical cpu to a simulated core and use the
recorded cpu that each segment of thread execution occurred on to schedule
execution following the "as traced" schedule, but as just noted this is not
representative. Instead, we recommend using offline traces and dynamic
re-scheduling as explained in \ref sec_drcachesim_sched_dynamic using the
`-core_serial` parameter. Here is an example:
re-scheduling in core-sharded mode as explained in \ref sec_drcachesim_sched_dynamic
using the
`-core_serial` parameter. In offline mode, `-core_serial` is the default for
these simulators unless additional tools that do not prefer core-sharded operation
are run at the same time. It is best to explicitly request `-core_serial` as
in this example:

\code
$ bin64/drrun -t drmemtrace -offline -- ~/test/pi_estimator 8 20
Expand Down Expand Up @@ -1471,6 +1475,8 @@ stored traces in core-sharded format: essentially switching to hardware-thread-o
traces. This is done using the \ref sec_tool_record_filter tool in `-core_sharded` mode.
The #dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID markers are not modified by the
dynamic scheduler, and should be ignored in a newly created core-sharded trace.
When analyzing core-sharded-on-disk traces, be sure to pass `-no_core_sharded` when
using core-sharded-preferring tools to avoid the trace being re-scheduled yet again.

Traces also include markers indicating disruptions in user mode control
flow such as signal handler entry and exit.
Expand Down Expand Up @@ -1510,7 +1516,9 @@ the framework controls the iteration), to request the next trace
record for each output on its own. This scheduling is also available to any analysis tool
when the input traces are sharded by core (see the `-core_sharded` and `-core_serial`
and various `-sched_*` option documentation under \ref sec_drcachesim_ops as well as
core-sharded notes when \ref sec_drcachesim_newtool).
core-sharded notes when \ref sec_drcachesim_newtool), and in fact is the
default when all tools prefer core-sharded operation via
#dynamorio::drmemtrace::analysis_tool_t::preferred_shard_type().

********************
\section sec_drcachesim_as_traced As-Traced Schedule Limitations
Expand Down
2 changes: 1 addition & 1 deletion clients/drcachesim/scheduler/scheduler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3246,7 +3246,7 @@ scheduler_tmpl_t<RecordType, ReaderType>::pick_next_input(output_ordinal_t outpu
VDO(this, 1, {
static int global_heartbeat;
// We are ok with races as the cadence is approximate.
if (++global_heartbeat % 10000 == 0) {
if (++global_heartbeat % 50000 == 0) {
print_queue_stats();
}
});
Expand Down
4 changes: 2 additions & 2 deletions clients/drcachesim/simulator/cache_simulator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -632,8 +632,8 @@ cache_simulator_t::print_results()
std::cerr << "Cache simulation results:\n";
// Print core and associated L1 cache stats first.
for (unsigned int i = 0; i < knobs_.num_cores; i++) {
print_core(i);
if (shard_type_ == SHARD_BY_CORE || thread_ever_counts_[i] > 0) {
bool non_empty = print_core(i);
if (non_empty) {
if (l1_icaches_[i] != l1_dcaches_[i]) {
std::cerr << " " << l1_icaches_[i]->get_name() << " ("
<< l1_icaches_[i]->get_description() << ") stats:" << std::endl;
Expand Down
6 changes: 4 additions & 2 deletions clients/drcachesim/simulator/simulator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -311,18 +311,19 @@ simulator_t::handle_thread_exit(memref_tid_t tid)
thread2core_.erase(tid);
}

void
bool
simulator_t::print_core(int core) const
{
if (!knob_cpu_scheduling_ && shard_type_ == SHARD_BY_THREAD) {
std::cerr << "Core #" << core << " (" << thread_ever_counts_[core]
<< " thread(s))" << std::endl;
return thread_ever_counts_[core] > 0;
} else {
std::cerr << "Core #" << core;
if (shard_type_ == SHARD_BY_THREAD && cpu_counts_[core] == 0) {
// We keep the "(s)" mainly to simplify test templates.
std::cerr << " (0 traced CPU(s))" << std::endl;
return;
return false;
}
std::cerr << " (";
if (shard_type_ == SHARD_BY_THREAD) // Always 1:1 for SHARD_BY_CORE.
Expand All @@ -338,6 +339,7 @@ simulator_t::print_core(int core) const
}
}
std::cerr << ")" << std::endl;
return need_comma;
}
}

Expand Down
10 changes: 9 additions & 1 deletion clients/drcachesim/simulator/simulator.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,13 @@ class simulator_t : public analysis_tool_t {
std::string
initialize_shard_type(shard_type_t shard_type) override;

shard_type_t
preferred_shard_type() override
{
// We prefer a dynamic schedule with more realistic thread interleavings.
return SHARD_BY_CORE;
}

bool
process_memref(const memref_t &memref) override;

Expand All @@ -83,7 +90,8 @@ class simulator_t : public analysis_tool_t {
double warmup_fraction, uint64_t sim_refs, bool cpu_scheduling,
bool use_physical, unsigned int verbose);

void
// Returns whether the core was ever non-empty.
bool
print_core(int core) const;

int
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_client.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ DynamoRIO statistics:
.*
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -36,9 +36,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_maps.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pre-DR start
pre-DR detach
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -24,9 +24,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_noreach.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pre-DR start
pre-DR detach
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -24,9 +24,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_replace.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ close file .*
close file .*
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -32,9 +32,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
8 changes: 4 additions & 4 deletions clients/drcachesim/tests/offline-burst_static.templatex
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ DynamoRIO statistics:
.*
all done
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -33,9 +33,9 @@ Core #0 \(1 thread\(s\)\)
Compulsory misses: *[0-9,\.]*
Invalidations: *0
.* Miss rate: [0-3][,\.]..%
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Hello, world!
Cache simulation results:
Core #0 \(1 thread\(s\)\)
Core #0 \(traced CPU\(s\): #0\)
L1I0 .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand All @@ -12,9 +12,9 @@ Core #0 \(1 thread\(s\)\)
Misses: 0
Compulsory misses: *[0-9,\.]*
Invalidations: 0
Core #1 \(0 thread\(s\)\)
Core #2 \(0 thread\(s\)\)
Core #3 \(0 thread\(s\)\)
Core #1 \(traced CPU\(s\): \)
Core #2 \(traced CPU\(s\): \)
Core #3 \(traced CPU\(s\): \)
LL .* stats:
Hits: *[0-9,\.]*
Misses: *[0-9,\.]*
Expand Down
Loading

0 comments on commit 2c5ea79

Please sign in to comment.