Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete methods for sequences inside managers #113

Closed
wants to merge 4 commits into from
Closed

Delete methods for sequences inside managers #113

wants to merge 4 commits into from

Conversation

unexploredtest
Copy link
Contributor

@unexploredtest unexploredtest commented Jan 16, 2021

This solves #36 partially, the only thing that remains is creating a method that gives the ability delete a given anonymous sequence.

It's my first time making a PR for a C++ project, so it might not be good. I have provided comments for each commit explaining what each does.

@unexploredtest
Copy link
Contributor Author

unexploredtest commented Jan 16, 2021

Why is this PR trying to completely delete and renew single_include/kompute/Kompute.hpp? Just a few lines were added, right?

@axsaucedo
Copy link
Member

Thank you for the contribution @aliPMPAINT - I think I'll need to dive into it a bit further, as we need to confirm all the subsequent resources are being freed (so there are no memory leaks). It's also important to make sure to consider that some asynchronous sequences may be running, so these should either be awaited or an issue should be raised.

In regards to your question, this is most likely due to the line ending, as the output file would have different line endings if built on windows vs linux (I think it would be good to standardise this with post proxessing / regex).

@axsaucedo axsaucedo self-requested a review January 16, 2021 14:37
@axsaucedo axsaucedo added the enhancement New feature or request label Jan 16, 2021
@axsaucedo
Copy link
Member

Ok just had a quick look, it seems there is a failing test:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestSequence
[ RUN      ] TestSequence.SequenceDestructorViaManager
[2021-01-16 14:41:45.218] [debug] [Manager.cpp:144] Kompute Manager creating instance
[2021-01-16 14:41:45.218] [debug] [Manager.cpp:169] Kompute Manager adding debug validation layers
[2021-01-16 14:41:45.881] [debug] [Manager.cpp:202] Kompute Manager Instance Created
[2021-01-16 14:41:45.882] [debug] [Manager.cpp:206] Kompute Manager adding debug callbacks
[2021-01-16 14:41:45.883] [debug] [Manager.cpp:229] Kompute Manager creating Device
[2021-01-16 14:41:45.884] [info] [Manager.cpp:255] Using physical device index 0 found GeForce GTX 1650
[2021-01-16 14:41:45.967] [debug] [Manager.cpp:312] Kompute Manager device created
[2021-01-16 14:41:45.968] [debug] [Manager.cpp:326] Kompute Manager compute queue obtained
[2021-01-16 14:41:45.969] [debug] [Manager.cpp:100] Kompute Manager creating Sequence object
[2021-01-16 14:41:45.969] [debug] [Manager.cpp:119] Kompute Manager createManagedSequence with sequenceName: newSequence and queueIndex: 0
[2021-01-16 14:41:45.970] [debug] [Sequence.cpp:17] Kompute Sequence Constructor with existing device & queue
[2021-01-16 14:41:45.970] [debug] [Sequence.cpp:259] Kompute Sequence creating command pool
[2021-01-16 14:41:45.970] [debug] [Sequence.cpp:275] Kompute Sequence Command Pool Created
[2021-01-16 14:41:45.971] [debug] [Sequence.cpp:281] Kompute Sequence creating command buffer
[2021-01-16 14:41:45.971] [debug] [Sequence.cpp:297] Kompute Sequence Command Buffer Created
[2021-01-16 14:41:45.971] [debug] [Manager.cpp:53] Kompute Manager Destructor started
[2021-01-16 14:41:45.972] [debug] [Manager.cpp:63] Kompute Manager explicitly running destructor for managed sequences
[2021-01-16 14:41:45.972] [info] [Manager.cpp:68] Destroying device
[2021-01-16 14:41:45.991] [debug] [Manager.cpp:71] Kompute Manager Destroyed Device
[2021-01-16 14:41:46.049] [debug] [Manager.cpp:93] Kompute Manager Destroyed Instance
C:\Users\axsau\Programming\vulkan\vulkan-compute\test\TestSequence.cpp(40): error: Value of: sq->isInit()
  Actual: true
Expected: false
[2021-01-16 14:41:46.050] [debug] [Sequence.cpp:28] Kompute Sequence Destructor started
[2021-01-16 14:41:46.051] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2021-01-16 14:41:46.052] [debug] [Sequence.cpp:230] Kompute Sequence Freed CommandBuffer
[2021-01-16 14:41:46.052] [info] [Sequence.cpp:234] Destroying CommandPool
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
[  FAILED  ] TestSequence.SequenceDestructorViaManager (845 ms)
[----------] 1 test from TestSequence (845 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (848 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestSequence.SequenceDestructorViaManager

 1 FAILED TEST

@unexploredtest
Copy link
Contributor Author

unexploredtest commented Jan 16, 2021

I can guess what made the test fail, I have changed the manager's destructor(the first commit), and as the sequence was created outside of the manager's scope, it didn't destruct it thoroughly.

The reason I did so was because before,

        for (const std::pair<std::string, std::shared_ptr<Sequence>>& sqPair :
             this->mManagedSequences) {
            sqPair.second->freeMemoryDestroyGPUResources();
        }

and this->mManagedSequences.clear(); both tried to free the sequences, and thus they were being deleted twice, like this example(the one provided on colab):

===========================================
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:234] Destroying CommandPool
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:248] Kompute Sequence clearing operations buffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:234] Destroying CommandPool
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:248] Kompute Sequence clearing operations buffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:234] Destroying CommandPool
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:248] Kompute Sequence clearing operations buffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:234] Destroying CommandPool
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:248] Kompute Sequence clearing operations buffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:221] Freeing CommandBuffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:234] Destroying CommandPool
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:248] Kompute Sequence clearing operations buffer
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.
[2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.
[2020-11-18 08:32:22.196] [info] [Manager.cpp:72] Destroying device

If we observe([2020-11-18 08:32:22.196] [info] [Sequence.cpp:32] Kompute Sequence destructor called but sequence is not initialized so no need to removing GPU resources.), sequences were trying to be freed twice, the first commit solves this issue, but I didn't anticipate the fact that being on different scopes can cause problems.

@axsaucedo
Copy link
Member

Ok I see, I think it would still be important to ensure the destructor is called correctly in the manager to avoid lingering resources, as ultimately the memory in the GPU is managed hierarchically

@unexploredtest
Copy link
Contributor Author

Yeah sure, and sorry for my silly mistake

@axsaucedo
Copy link
Member

No worries, the memory management hierary is not too complex once you get your head around it, but it has certainly gotten me in some long debugging sessions, using memory profilers does help understand what objects / code components may be causing issues, more detail via #15

@alexander-g
Copy link
Contributor

@axsaucedo Bump, this is quite important.
If you are unsure about asynchronous operations, it would already help to make those methods private and to only clean up anonymous sequences in synchronous ops, i.e. in evalOpDefault directly after the sequence is evaluated.

@axsaucedo
Copy link
Member

@alexander-g hmm that is actually a pretty good idea... ok I agree this is quite important, thank you @aliPMPAINT for the initial work, I will pull the branch, and do some deeper testing, as well as expose via python interface to make sure we can merge. Thanks both for driving forward this areas.

@axsaucedo
Copy link
Member

@alexander-g currently looking at this, I now remember why we don't delete the function right after the sequence is executed. The reason why this is by design is because the sequence actually acquires the GPU memory ownership of the Tensors when running the OpCreateTensors. More specifically, it is currently in the memory hierarchy for the OpBase to have a choice whether to free the Tensors it owns or not - here is the line that shows how OpCreateTensor owns the tensors it uses by setting the last value to true
https://github.com/EthicalML/vulkan-kompute/blob/df5477a2d76b920232811e8513579240467ad673/src/OpTensorCreate.cpp#L18

Here you can see the line that specifies the ownership of the tensors themselves:
https://github.com/EthicalML/vulkan-kompute/blob/df5477a2d76b920232811e8513579240467ad673/src/include/kompute/operations/OpBase.hpp#L66

This is actually tied to the discussion I provided in your #130 PR (#130 (comment)) where I basically mention that there may be a better way to think about the OpTensorCreate operation, and the hierarchy of the Tensor memory management overall. This is something that would require a deeper dive, primarily as I need to check if it can make sense for the memory ownership to no longer be Manager -> Sequence -> Op -> Tensor, and update it to remove Tensor as a dependency of the operation. This could require a less trivial refactor, as it's not clear what that hierarchy would be otherwise.

This is teh reason why anonymous Sequences are actually kept. This could actually be change such that anonymous sequences are destroyed after execution, which is what we had at the very beginning but as you can imagine this led to tensor memory being destroyed right after a OpTensorCreate in an anonymous function. But this woudl require specific awareness of users, which would have to always run the OpTensorCreate in a non-anonymous function, and then the rest of the operations in an anonymous function.

Does this make sense?

@axsaucedo
Copy link
Member

For completeness it may be worth sharing another piece of insight to provide the full picture. Namely that when I was initially desigining the tensors, i had an idea where instead of having two tensors - a Staging and a Device tensor - all the logic would be contained as a single Tensor. Namely this single tensor would contain both relevant memories for the staging and host tensors. The reason why this was not pursued at the end is because the idea was to provide further access and granularity into how the memory is made available to the user - this way the user can actually decide to build their own OpTensor operations, which may own different tensors, sometimes perhaps destroying the staging tensor completely to avoid having the memory overhead.

With this in mind, it could be revisited, but there is that disadvantage that for any device tensor, there would be always the overhead of an extra memory component that would be obscured by the user. The advantage of this is that in this case the staging tensor wouldnt' be created directly by the OpCreate operation, which means that the tensors could be owned by the top level manager. For all of these there are various tradeoffs that could be reassessed.

For now, the simplest way to approach this is to enable a flag that alllows for deletion of anonymous sequences as soon as they are executed, which can be used by more advanced users that would know that using the OpCreateTensor would involve memory management, and hence should be executed in a non-anonymous function that should be managed manually (and then deleted explicitly with the deleteNamedSequence.

@axsaucedo
Copy link
Member

I will update #36 with the discussion from this issue as well

@axsaucedo
Copy link
Member

Closing in favour of #113

@axsaucedo axsaucedo closed this Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants