Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scoped NVTX ranges for improved profiling #827

Merged
merged 15 commits into from
Jun 28, 2023

Conversation

esseivaju
Copy link
Contributor

Use the NVTX library instead of cuda profiler. For now this implements the bare minimum functionalities to be useful. We'd probably want to use registered strings and possibly allow to specify a payload / category?

@esseivaju esseivaju added core Software engineering infrastructure minor Minor internal changes or fixes labels Jun 26, 2023
@esseivaju esseivaju requested review from sethrj and amandalund June 26, 2023 16:49
src/corecel/sys/ScopedDeviceProfiling.hh Outdated Show resolved Hide resolved
src/corecel/sys/ScopedDeviceProfiling.cuda.cc Outdated Show resolved Hide resolved
src/corecel/sys/ScopedDeviceProfiling.cuda.cc Outdated Show resolved Hide resolved
src/corecel/sys/ScopedDeviceProfiling.cuda.cc Outdated Show resolved Hide resolved
src/corecel/sys/ScopedDeviceProfiling.cuda.cc Outdated Show resolved Hide resolved
Comment on lines 69 to 83
auto message_handle = message_registry().find(message);
// recheck that nobody inserted the same message as we had to release the
// shared lock not strictly necessary as insert will do nothing if the key
// already exists
if (message_handle == message_registry().end())
{
auto handle
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
message_registry().insert({message, handle});
return handle;
}
else
{
return message_handle->second;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is to just insert with a nullptr for the handle, and if insertion succededed to register it and save in the returned iterator.

Suggested change
auto message_handle = message_registry().find(message);
// recheck that nobody inserted the same message as we had to release the
// shared lock not strictly necessary as insert will do nothing if the key
// already exists
if (message_handle == message_registry().end())
{
auto handle
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
message_registry().insert({message, handle});
return handle;
}
else
{
return message_handle->second;
}
auto [iter, inserted] = message_registry().insert({message, nullptr});
if (inserted)
{
iter->second
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
}
CELER_ENSURE(iter->second);
return iter->second;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethrj Actually, we can't really CELER_ENSURE that iter->second is non-null. If we're not running the application through a tool that can handle nvtx, i.e. ncu with --nvtx opthon, then all API calls will return 0 (ref).

Also we can't really make any assumption on what values are returned. I tried having multiple nested ranges and printing each value returned by nvtxDomainRegisterStringA. The first call returns 0, the second call returns 1 so they're not really pointers as you'd expect (relevant doc).

TLDR; just don't call CELER_ENSURE and trust the implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. I guess "handle" doesn't mean "pointer" in this case. (Maybe replace nullptr with {}?) That's really good to know. Could you add a note in the implementation that the API calls may fail if the profiler is disabled?

@sethrj sethrj added enhancement New feature or request and removed minor Minor internal changes or fixes labels Jun 27, 2023
@sethrj sethrj changed the title Basic implementation of ScopedDeviceProfiling using nvtx as backend Add scoped NVTX ranges for improved profiling Jun 27, 2023
src/corecel/CMakeLists.txt Outdated Show resolved Hide resolved
// Deactivate profiling
~ScopedProfiling();
// RAII semantics
void* operator new(std::size_t count) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen this pattern before... is it necessary? Do we need to add it to all our RAII classes? For comparison, scoped_lock deletes the assignment operator but doesn't delete new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not technically needed but per the name of the class, dynamically allocating a new instance is not "scoped profiling", so does it make sense to allow it? The C++ API of nvtx does that for their implementation of the scoped_range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delete and such should be to prevent accidental misuses, since it's not possible for us to prevent all intentional misuse.

Besides, if wanted to (for example) put it in a unique_ptr or shared_ptr to change/share the scope at runtime, I think that's a reasonable use case. A dumb user could new it and never release it, and then it just becomes an open range? Which seems like the expected behavior.

Copy link
Contributor Author

@esseivaju esseivaju Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I agree, but for this class and the stack semantics of nested ranges I'm not sure it would be legal to outlive the scope. Suppose we have Scope1 active from a caller function and we instantiate Scope2 and have it managed by a unique_ptr to outlive Scope1. Then, the Scope1 dtor would, I assume, end the range from the nested scope.

Also those range are thread_local so passing the ownership to another thread shouldn't be done which could easily happen with smart pointers or heap allocation.

src/corecel/sys/ScopedProfiling.hh Show resolved Hide resolved
src/corecel/sys/ScopedProfiling.nvtx.cc Outdated Show resolved Hide resolved
src/corecel/sys/ScopedProfiling.hh Show resolved Hide resolved
Comment on lines 69 to 83
auto message_handle = message_registry().find(message);
// recheck that nobody inserted the same message as we had to release the
// shared lock not strictly necessary as insert will do nothing if the key
// already exists
if (message_handle == message_registry().end())
{
auto handle
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
message_registry().insert({message, handle});
return handle;
}
else
{
return message_handle->second;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. I guess "handle" doesn't mean "pointer" in this case. (Maybe replace nullptr with {}?) That's really good to know. Could you add a note in the implementation that the API calls may fail if the profiler is disabled?

struct ScopedProfilingInput
{
std::string name; //!< Name of the range
uint32_t color{0xFF00FF00}; //!< ARGB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires #include <cstdint> and since it's unsigned:

Suggested change
uint32_t color{0xFF00FF00}; //!< ARGB
uint32_t color{0xFF00FF00u}; //!< ARGB

Comment on lines 59 to 65
// Can't outlive the scope it was created in to correctly match each push /
// pop. Must strictly follows RAII semantics (automatic storage duration)
void* operator new(std::size_t) = delete;
ScopedProfiling(ScopedProfiling const&) = delete;
ScopedProfiling& operator=(ScopedProfiling const&) = delete;
ScopedProfiling(ScopedProfiling&&) = delete;
ScopedProfiling& operator=(ScopedProfiling&&) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the operator new specialization, and the && instances are automatically deleted if & is deleted so you can omit those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see allowing operator new even though using it is asking for trouble.

Regarding the move operations, should we be explicit and follow C.21?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. It looks like ScopedMem, Stream, and ScopedSignalHandler all need that as well, and a few "scoped" classes are entirely missing the deleted copy/assign:

  • src/celeritas/ext/ScopedGeantExceptionHandler.hh
  • src/celeritas/ext/ScopedGeantLogger.hh
  • src/celeritas/ext/ScopedRootErrorHandler.hh
  • src/corecel/io/ScopedStreamFormat.hh
  • src/corecel/io/ScopedStreamRedirect.hh
  • src/corecel/io/ScopedTimeAndRedirect.hh
  • src/corecel/io/ScopedTimeLog.hh
  • src/corecel/sys/ScopedMpiInit.hh

Should those updates be a separate PR too? If you haven't pushed the restoration of ..&&) = delete I think we should do them all in a separate best-practices merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can do that in a separate PR.

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Julien. Do you want to add these to the individual actions/kernel launches in a separate PR?

@esseivaju
Copy link
Contributor Author

Yeah, I can do that in a separate PR though there is one thing I don't like; If nvtx isn't enabled, you still end up paying a small cost calling message_handle_for and locking mutexes so sprinkling the codebase with ScopedDevice has a cost...

@sethrj
Copy link
Member

sethrj commented Jun 27, 2023

Maybe to avoid the cost, we use the same trick as ScopedSignalHandler and use a global environment variable to disable it?

@esseivaju
Copy link
Contributor Author

yeah, that's sounds like a good solution. I'd say in this case we want to explicitly enable profiling with CELER_ENABLE_PROFILING

@sethrj sethrj merged commit 42ceb90 into celeritas-project:develop Jun 28, 2023
@esseivaju esseivaju deleted the nvtx-range branch October 26, 2023 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Software engineering infrastructure enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants