Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scoped NVTX ranges for improved profiling #827

Merged
merged 15 commits into from
Jun 28, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 65 additions & 11 deletions src/corecel/sys/ScopedDeviceProfiling.cuda.cc
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,13 @@
//---------------------------------------------------------------------------//
//! \file corecel/sys/ScopedDeviceProfiling.cuda.cc
//---------------------------------------------------------------------------//

#include "ScopedDeviceProfiling.hh"

#include <mutex>
#include <shared_mutex>
#include <unordered_map>

#include "celeritas_config.h"
#include "corecel/device_runtime_api.h"

Expand All @@ -22,20 +27,75 @@ namespace celeritas
{
namespace
{

//---------------------------------------------------------------------------//
/*!
* Global registry for strings used by NVTX.
* This Implemented as a free function instead of a static in
* ScopedDeviceProfiling to hide the NVTX dependency from user of the
* interface.
*/
std::unordered_map<std::string, nvtxStringHandle_t>& message_registry()
{
static std::unordered_map<std::string, nvtxStringHandle_t> registry;
return registry;
}

nvtxDomainHandle_t domain_handle()
{
static nvtxDomainHandle_t domain = nvtxDomainCreateA("celeritas");
return domain;
}

//---------------------------------------------------------------------------//
/*!
* Retrieve the handle for a given message, insert it if it doesn't already
* exists
*/
nvtxStringHandle_t message_handle_for(std::string const& message)
{
static std::shared_mutex mutex;

{
std::shared_lock lock(mutex);
auto message_handle = message_registry().find(message);
if (message_handle != message_registry().end())
{
return message_handle->second;
}
}
// we did not find the handle, insert it
std::unique_lock lock(mutex);
auto message_handle = message_registry().find(message);
// recheck that nobody inserted the same message as we had to release the
// shared lock not strictly necessary as insert will do nothing if the key
// already exists
if (message_handle == message_registry().end())
{
auto handle
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
message_registry().insert({message, handle});
return handle;
}
else
{
return message_handle->second;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is to just insert with a nullptr for the handle, and if insertion succededed to register it and save in the returned iterator.

Suggested change
auto message_handle = message_registry().find(message);
// recheck that nobody inserted the same message as we had to release the
// shared lock not strictly necessary as insert will do nothing if the key
// already exists
if (message_handle == message_registry().end())
{
auto handle
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
message_registry().insert({message, handle});
return handle;
}
else
{
return message_handle->second;
}
auto [iter, inserted] = message_registry().insert({message, nullptr});
if (inserted)
{
iter->second
= nvtxDomainRegisterStringA(domain_handle(), message.c_str());
}
CELER_ENSURE(iter->second);
return iter->second;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethrj Actually, we can't really CELER_ENSURE that iter->second is non-null. If we're not running the application through a tool that can handle nvtx, i.e. ncu with --nvtx opthon, then all API calls will return 0 (ref).

Also we can't really make any assumption on what values are returned. I tried having multiple nested ranges and printing each value returned by nvtxDomainRegisterStringA. The first call returns 0, the second call returns 1 so they're not really pointers as you'd expect (relevant doc).

TLDR; just don't call CELER_ENSURE and trust the implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. I guess "handle" doesn't mean "pointer" in this case. (Maybe replace nullptr with {}?) That's really good to know. Could you add a note in the implementation that the API calls may fail if the profiler is disabled?

}

//---------------------------------------------------------------------------//
/*!
* Create EventAttribute with a specific name
*/
nvtxEventAttributes_t make_attributes(std::string const& name)
{
nvtxEventAttributes_t attributes;
attributes.version = NVTX_VERSION;
attributes.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
attributes.colorType = NVTX_COLOR_ARGB;
attributes.color = 0xff00ff00;
attributes.messageType = NVTX_MESSAGE_TYPE_ASCII;
attributes.message.ascii = name.c_str();
attributes.messageType = NVTX_MESSAGE_TYPE_REGISTERED;
attributes.message.registered = message_handle_for(name);
attributes.payloadType = NVTX_PAYLOAD_TYPE_INT32;
attributes.payload.iValue = 0;
attributes.category = 0;
Expand All @@ -48,11 +108,8 @@ nvtxEventAttributes_t make_attributes(std::string const& name)
*/
ScopedDeviceProfiling::ScopedDeviceProfiling(std::string const& name)
{
if (celeritas::device())
{
nvtxEventAttributes_t attributes_ = make_attributes(name);
nvtxDomainRangePushEx(domain_handle(), &attributes_);
}
nvtxEventAttributes_t attributes_ = make_attributes(name);
nvtxDomainRangePushEx(domain_handle(), &attributes_);
}

//---------------------------------------------------------------------------//
Expand All @@ -61,10 +118,7 @@ ScopedDeviceProfiling::ScopedDeviceProfiling(std::string const& name)
*/
ScopedDeviceProfiling::~ScopedDeviceProfiling()
{
if (celeritas::device())
{
nvtxDomainRangePop(domain_handle());
}
nvtxDomainRangePop(domain_handle());
}

//---------------------------------------------------------------------------//
Expand Down