-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM errors in FastTree #6175
Comments
I am going to have to look more into this as FastTree doesn't require all the data in memory. It should be streaming the data in as needed. I wonder though if AutoML is trying to cache things. If its not then there is investigation work to be done here. |
@michaelgsharp Our code is converting them to .ToDataFrame(-1) after reading the IDV file. I believe it puts them to RAM. However, these errors happen later, except maybe the last one. I don't recall now why the conversion to DataFrame was done. It was probably for some data manipulation. I need to review if it is still needed. The difference with different runs could be that the codes selects a random sampling key from 3 options and a random seed for MLContext, but I did not yet verify this. That could explain why I now have 2 computers succesfully training: 64 gb laptop + 128 gb desktop. One of these is used as sampling key, the other columns are dropped from training data. |
I think FastTree needs lots of virtual memory in Model Builder too (e.g. dotnet/machinelearning-modelbuilder#1875 ) If I had less than 500 Gb virtual memory the servicehub in model builder crashed sooner or later in some datasets. If RAM requirements could be eased it might help model builder, too. In Model Builder I could just increase virtual memory. |
So in Model Builder, without using the DataFrame, you still had the OOM issue? Do you remember what the parameters were for that model? Or for the model you are having issues with now? Its possible that AutoML is being to aggresive with its trees. Its also possible something isn't working right in FastTree itself. |
Sorry, it was a bit confusing to mention Model Builder here. I just mentioned it because it consumes huge amounts of virtual memory. In case of Model Builder, ServiceHub crashes when running out of memory and I can not longer extract training parameters (the zip file still exists in the temp folder, but I do not know how to get the training parameters from it) I think this is also first time I have had OOM issues with the C# interface of Microsoft.ML.FastTree (except the data loading issue in the other topic) It seems the issue is somehow related to Sampling Key and/or MLContext seed. I have been able to since run the AutoML experiment without sampling key with same training data, and additional validation file. |
Just update from AutoML 2.0 (daily build, not yet include disktranspose merge from today, I will try it once I understand how to user it). I just realized now I am using SamplingKey again as in my previous comment, will update soon if it works without it. Dataset is read from binary IDV file. I am not doing .ToDataFrame(-1) this time. Unknown=>ReplaceMissingValues=>ConvertType=>OneHotEncoding=>Concatenate=>FastTreeBinary
|
Use of sampling key did not have any impact. The model it tries to train first seems to be small: 4 trees x 4 leaves. I just wonder why I have these setting in 4 parameters (e6,e7,e8) from TrialSettings object. I would expect it to only have FastTree parameters in one place. The number of rows is high, ~150 Gb binary IDV file.
|
It would appear PR #6316 has fixed OOM for FastTree. Thanks @LittleLittleCloud for the good work! This is very helpful. Questions for later (documentation etc.):
**Code to enable DiskTranspose **
Commentary I am now experimenting with bigger dataset than before, total of 0.5 TB-2 TB from multiple CSV files. They are probably a bit too big for AutoML, because one trial may take 1-2 days. Adding more CPU cores will not help much because the most time-consuming parts do not consume much CPU. Currently, I feel like I would to limit dataset size to max 50 - 150 GB. So far I can only guess, but I think making the dataset much bigger might not make a big difference for the current non-NN algorithms. Anyway, more is more :) |
I just got an OOM using a similar setup as the above after around 150 minutes of training on a 26 GB dataset. |
@tcaivano @torronen this PR might be helpful #6714 for training super-large dataset. The gist is the more dataset might not result to a better metric. So maybe another solution for mitigating OOM error is to start from a small portion of dataset first and slowly increase the scale of dataset until it hits the turning point or run out of memory? |
@LittleLittleCloud Nice, thanks. Do you know if #6710 is also relevant? I agree with subsampling being a better approach most of the time. I just wonder if there should be some tools for making sure the subsample is representative, but that might depend on the case and require analysis outside to scope of Microsoft.ML. It is just that finding the "OOM error point" is not a nice experience, because creating, moving, converting and beginning to run experiments with big datasets is time consuming, and then finally hitting OOM is very discouraging. So, I think fixing the root cause should be considered, even if it might not be the optimal approach ( if not already in the PR's ) I like your suggestion. Would it be feasible even to make a "subsampling tuner" ? It might start with 1% of full data, get optimal params, then 5%, then 10%, then 30%, then 100%. On each subsample size it could start search around the optimal values from previous step. Approach 1: fixed time for each step. This would ensure there is time for the 100% sample size. |
I certainly agree with the analysis that we should probably be doing sampling most of the time; in fact, the research that I'm doing is an examination of that very claim. Regardless, it would be really useful for this to be functional in the future. |
@tcaivano That's very cool. If you have public results, and if you can, please share! I have some feeling it may also improve results in some cases, such as sensor data: the weather may stay the same for extended times, and putting 100x "sunny day data" may (probably) lead to overfit. My assumption is that the data should include reasonable amounts of data from each kind of weather but at the same time the edge cases should not be lost. For example, typhoon/hurricane data, or transition from rain to sunny weather etc. But how to ensure the subsample includes such data? |
System Information (please complete the following information):
Describe the bug
Out-of-memory errors on FastTree. There is still virtual (paging) memory available, but RAM is full. Maybe there is something that could be done to more effectively use virtual memory? Strangely, one of the 128 Gb Ryzens is running FastTree on this dataset, while 4 similar failed with various OOM errors so I am able to run the training, just a bit slower.
Dataset: 112 Gb IDV file, 369 Gb CSV file
RAM: 128 gb
It is one file, with sampling key.
The one working has about 145 Gb virtual memory, system managed. Other's have 500 - 1000 gb fixed sized files in Windows advanced settings.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
FastTree might be able to use the paging file.
Or, maybe, we could optionally stop data loading near OOM point and then use it for training.
Additional context
1 models were returned after 3516.55 seconds
The below one is something like "Too few virtual address resources to complete the operation" - sorry I do not have the error in English.
The text was updated successfully, but these errors were encountered: