Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Vulkan][Refactor] Move ownership of per-CPU-thread objects to VulkanDeviceAPI #8196

Merged
merged 6 commits into from
Jun 11, 2021

Conversation

Lunderberg
Copy link
Contributor

Continuing refactoring from #8188. The goal of this PR is to establish clear ownership of vulkan resources. Previously, resources were owned both by VulkanDeviceAPI and by VulkanThreadEntry, depending on whether it is a per-CPU-thread resource. This PR introduces a ThreadMap<T> class that behaves similarly to thread_local, but can be used for non-static data members, then uses ThreadMap<T> to move all ownership of vulkan resources to be relative to the VulkanDeviceAPI. Individual steps are in separate commits for readability.

  • Moved ownership of per-thread active device id to VulkanDeviceAPI

    • Previously, the active device was owned by VulkanThreadEntry, so lookups to multiple global variables were required. Now, everything goes from the VulkanDeviceAPI.

    • Removed VulkanThreadEntry, as all functionality has been moved to either VulkanDevice or VulkanDeviceAPI.

  • Moved ownership of per-thread workspace pool to VulkanDeviceAPI

    • Previously, the WorkspacePool was owned by VulkanThreadEntry, and required a lookup from VulkanDeviceAPI::AllocWorkspace. As a result, non-global VulkanDeviceAPI would interact with each other.
  • Move ownership of per-thread uniform buffer to VulkanDevice

    • Previously, VulkanUniformBuffer was owned by VulkanThreadEntry, so any use required looking up both the thread entry and the device. Now, thread-specific lookup is handled in the VulkanDevice class.
  • Move the VulkanStagingBuffer to be owned by the VulkanDevice

    • Previously, was owned by VulkanThreadEntry, so any use required looking up both the thread entry and the device. Now, thread-specific lookup is handled in the VulkanDevice class.
  • Pulled VulkanBuffer allocation/deallocation into constructor/destructor.

    • VulkanBuffer owns the VkBuffer and VkDeviceMemory that it allocates, and deallocates on destruction.

    • VulkanHostVisibleBuffer owns a VulkanBuffer, and additional calls vkUnmapMemory on destruction.

  • Moved VulkanStream ownership from VulkanThreadEntry to VulkanDevice

    • Implemented ThreadMap, a container for per-thread objects. Unlike dmlc::ThreadLocalStore, ThreadMap is intended for use as a non-static thread-specific lookup.

    • Added ThreadMap as a member to VulkanDevice, updated all uses.

@Lunderberg
Copy link
Contributor Author

Potential reviewers: @masahi @tmoreau89

@tqchen
Copy link
Member

tqchen commented Jun 4, 2021

Thanks @Lunderberg , it would still be useful to think about the overhead. While thread local is a different way of organizing things, the direct TLS lookup have a lower overhead than a lock based cache lookup. Assuming we only have the singleton case it might still be desirable to use TLS when possible(i understand that might restrict us to singleton usecases)

Another minor thing is we usually keep the non-public facing apis in src/ so they are not visible to the downstream deps. This would help reduce the possible future API compact problem

@Lunderberg
Copy link
Contributor Author

Thank you @tqchen , and that makes sense with the potential overhead issues. I had been assuming that the main performance cost would be in the std::unordered_map lookup. I'd expect that lookup to be roughly equivalent, since while the local map requires a lookup by std::thread::id, the TLS requires a lookup by device_id. Having two threads block should be rare, since the locks are acquired non-exclusively whenever the thread's object already exists, but you are right that I didn't consider the cost of acquiring that shared lock in the first place.

I'll see if I can measure the performance difference, and will also see whether I can refactor to maintain the ownership model of everything deriving from the VulkanDeviceAPI, while keeping the minimal overhead of TLS.

Regarding the non-public apis, that makes sense. It wasn't vulkan-specific, and so I didn't want it all the way in src/runtime/vulkan, but hadn't considered that impact of adding to the public api. I will move thread_map.h (or whatever thread_local variant replaces it) to be in src/runtime instead.

@tqchen
Copy link
Member

tqchen commented Jun 4, 2021

That sounds good. Thanks @Lunderberg , i agree likely the hit could be negligible considering the launching overhead

…y to VulkanDevice

- Implemented ThreadMap, a container for per-thread objects.  Unlike
  dmlc::ThreadLocalStore, ThreadMap is intended for use as a
  non-static thread-specific lookup.

- Added ThreadMap<VulkanStream> as a member to VulkanDevice, updated
  all uses.
…onstructor/destructor.

- VulkanBuffer owns the VkBuffer and VkDeviceMemory that it allocates,
  and deallocates on destruction.

- VulkanHostVisibleBuffer owns a VulkanBuffer, and additional calls
  vkUnmapMemory on destruction.
…lkanDevice

- Previously, was owned by VulkanThreadEntry, so any use required
  looking up both the thread entry and the device.  Now,
  thread-specific lookup is handled in the VulkanDevice class.
…kanDevice

- Previously, VulkanUniformBuffer was owned by VulkanThreadEntry, so
  any use required looking up both the thread entry and the device.
  Now, thread-specific lookup is handled in the VulkanDevice class.
…lkanDeviceAPI

- Previously, the WorkspacePool was owned by VulkanThreadEntry, and
  required a lookup from VulkanDeviceAPI::AllocWorkspace.  As a
  result, non-global VulkanDeviceAPI would interact with each other.
…VulkanDeviceAPI

- Previously, the active device was owned by VulkanThreadEntry, so
  lookups to multiple global variables were required.  Now, everything
  goes from the VulkanDeviceAPI.

- Removed VulkanThreadEntry, as all functionality has been moved to
  either VulkanDevice or VulkanDeviceAPI.
@Lunderberg
Copy link
Contributor Author

Running some performance tests, it looks like the refactor has very little impact on the overall runtime. The plots below show the Q1/median/Q3 runtimes for different low-level tasks that would need to access the thread-specific resources. The only significant difference is for the copying data to the device, which is slightly higher for very small buffer copies.

image

Benchmarking details: Used pytest-benchmark, mostly with default settings. Number of iterations chosen based on runtime of the first iteration such that each data point is collected in ~1 second. Repeated initialization was allowed to use up to 10 seconds per data point.

@Lunderberg
Copy link
Contributor Author

On second thought, it looks like that difference was because pytest-benchmark does not include warmup runs by default. That run just happened to be the first that required initialization, which messed up some of the timing. With warmup runs, the performance is matched before/after the refactor.

image

@masahi
Copy link
Member

masahi commented Jun 10, 2021

@Lunderberg @tqchen Is this good to go?

@Lunderberg
Copy link
Contributor Author

The performance tests are matched pre/post change for cases that access the thread-specific resources, so I think it's good to go.

@tqchen Any follow-up concerns?

@masahi masahi merged commit f906fa8 into apache:main Jun 11, 2021
@masahi
Copy link
Member

masahi commented Jun 11, 2021

Thanks @Lunderberg, surprisingly after this change, the segfault on my APU that was happening on process exit is somehow gone!

@Lunderberg
Copy link
Contributor Author

Ooh, that's a nice side-effect to get! I guess that confirms that the segfault was caused by out-of-order destruction, though I'm still not quite sure how that happened, given that we checked the order of construction.

@Lunderberg Lunderberg deleted the vulkan_refactor_part2 branch June 11, 2021 14:50
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 17, 2021
…DeviceAPI (apache#8196)

* [Vulkan][Refactor] Moved VulkanStream ownership from VulkanThreadEntry to VulkanDevice

- Implemented ThreadMap, a container for per-thread objects.  Unlike
  dmlc::ThreadLocalStore, ThreadMap is intended for use as a
  non-static thread-specific lookup.

- Added ThreadMap<VulkanStream> as a member to VulkanDevice, updated
  all uses.

* [Vulkan][Refactor] Pulled VulkanBuffer allocation/deallocation into constructor/destructor.

- VulkanBuffer owns the VkBuffer and VkDeviceMemory that it allocates,
  and deallocates on destruction.

- VulkanHostVisibleBuffer owns a VulkanBuffer, and additional calls
  vkUnmapMemory on destruction.

* [Vulkan][Refactor] Move the VulkanStagingBuffer to be owned by the VulkanDevice

- Previously, was owned by VulkanThreadEntry, so any use required
  looking up both the thread entry and the device.  Now,
  thread-specific lookup is handled in the VulkanDevice class.

* [Vulkan][Refactor] Move ownership of per-thread uniform buffer to VulkanDevice

- Previously, VulkanUniformBuffer was owned by VulkanThreadEntry, so
  any use required looking up both the thread entry and the device.
  Now, thread-specific lookup is handled in the VulkanDevice class.

* [Vulkan][Refactor] Moved ownership of per-thread workspace pool to VulkanDeviceAPI

- Previously, the WorkspacePool was owned by VulkanThreadEntry, and
  required a lookup from VulkanDeviceAPI::AllocWorkspace.  As a
  result, non-global VulkanDeviceAPI would interact with each other.

* [Vulkan][Refactor] Moved ownership of per-thread active device id to VulkanDeviceAPI

- Previously, the active device was owned by VulkanThreadEntry, so
  lookups to multiple global variables were required.  Now, everything
  goes from the VulkanDeviceAPI.

- Removed VulkanThreadEntry, as all functionality has been moved to
  either VulkanDevice or VulkanDeviceAPI.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jun 17, 2021
…DeviceAPI (apache#8196)

* [Vulkan][Refactor] Moved VulkanStream ownership from VulkanThreadEntry to VulkanDevice

- Implemented ThreadMap, a container for per-thread objects.  Unlike
  dmlc::ThreadLocalStore, ThreadMap is intended for use as a
  non-static thread-specific lookup.

- Added ThreadMap<VulkanStream> as a member to VulkanDevice, updated
  all uses.

* [Vulkan][Refactor] Pulled VulkanBuffer allocation/deallocation into constructor/destructor.

- VulkanBuffer owns the VkBuffer and VkDeviceMemory that it allocates,
  and deallocates on destruction.

- VulkanHostVisibleBuffer owns a VulkanBuffer, and additional calls
  vkUnmapMemory on destruction.

* [Vulkan][Refactor] Move the VulkanStagingBuffer to be owned by the VulkanDevice

- Previously, was owned by VulkanThreadEntry, so any use required
  looking up both the thread entry and the device.  Now,
  thread-specific lookup is handled in the VulkanDevice class.

* [Vulkan][Refactor] Move ownership of per-thread uniform buffer to VulkanDevice

- Previously, VulkanUniformBuffer was owned by VulkanThreadEntry, so
  any use required looking up both the thread entry and the device.
  Now, thread-specific lookup is handled in the VulkanDevice class.

* [Vulkan][Refactor] Moved ownership of per-thread workspace pool to VulkanDeviceAPI

- Previously, the WorkspacePool was owned by VulkanThreadEntry, and
  required a lookup from VulkanDeviceAPI::AllocWorkspace.  As a
  result, non-global VulkanDeviceAPI would interact with each other.

* [Vulkan][Refactor] Moved ownership of per-thread active device id to VulkanDeviceAPI

- Previously, the active device was owned by VulkanThreadEntry, so
  lookups to multiple global variables were required.  Now, everything
  goes from the VulkanDeviceAPI.

- Removed VulkanThreadEntry, as all functionality has been moved to
  either VulkanDevice or VulkanDeviceAPI.

Co-authored-by: Eric Lunderberg <elunderberg@octoml.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants