-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor SparsePageSource, delete cache files after use #5321
Conversation
@@ -300,77 +327,118 @@ class SparsePageSource : public DataSource<T> { | |||
writer.PushWrite(std::move(page)); | |||
} | |||
std::unique_ptr<dmlc::Stream> fo( | |||
dmlc::Stream::Create(cinfo.name_info.c_str(), "w")); | |||
dmlc::Stream::Create(cache_info_.name_info.c_str(), "w")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a check to make sure this file doesn't exist? There is a FileExists
function in tests/cpp/helpers.h
we can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ThreadedIterator
is not thread safe, we should be very careful about stateful operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added fatal errors if cache exists or if multiple threads try to use the prefetcher. Please take a look at my test for threads. It's not technically 100% guaranteed to trigger the exception, we can remove the test if we get false negatives in CI.
This refactoring separates out many of the components used in the external memory dmatrix. In particular the prefetching and management of pages is extracted into
ExternalMemoryPrefetcher<T>
and different external memory page types all have their own constructors now.Part of the reason for doing this is to manage the lifetime of the cache files and remove them when the DMatrix gets destructed. Discussion about this in #4237. Leaving behind cache files leads to many unpredictable and unwanted behaviours, for example if I train the first time using 10% of my data, then train a second time using 100% but do not change the cache prefix, the 10% data will get loaded and used, completely ignoring the training input from the second run.