Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM errors in FastTree #6175

Closed
torronen opened this issue Apr 27, 2022 · 13 comments
Closed

OOM errors in FastTree #6175

torronen opened this issue Apr 27, 2022 · 13 comments
Labels
AutoML.NET Automating various steps of the machine learning process P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Milestone

Comments

@torronen
Copy link
Contributor

torronen commented Apr 27, 2022

System Information (please complete the following information):

  • OS & Version: Windows 11
  • ML.NET Version: ML.NET v1.5.5
  • .NET Version: NET6.0

Describe the bug
Out-of-memory errors on FastTree. There is still virtual (paging) memory available, but RAM is full. Maybe there is something that could be done to more effectively use virtual memory? Strangely, one of the 128 Gb Ryzens is running FastTree on this dataset, while 4 similar failed with various OOM errors so I am able to run the training, just a bit slower.

Dataset: 112 Gb IDV file, 369 Gb CSV file
RAM: 128 gb
It is one file, with sampling key.

The one working has about 145 Gb virtual memory, system managed. Other's have 500 - 1000 gb fixed sized files in Windows advanced settings.

To Reproduce
Steps to reproduce the behavior:

  1. Create big dataset
  2. Create IDV file
  3. Run AutoML experiments with IDV file

Expected behavior
FastTree might be able to use the paging file.
Or, maybe, we could optionally stop data loading near OOM point and then use it for training.

Additional context

Exception during AutoML iteration: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Collections.Generic.List`1.set_Capacity(Int32 value)
   at System.Collections.Generic.List`1.AddWithResize(T item)
   at Microsoft.ML.Trainers.FastTree.DataConverter.ValuesList.Add(Int32 index, Double value) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 2387
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 1865
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 1765
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 1773
   at Microsoft.ML.Trainers.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 956
   at Microsoft.ML.Trainers.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 2740
   at Microsoft.ML.Trainers.FastTree.FastTreeTrainerBase`3.ConvertData(RoleMappedData trainData) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTree.cs:line 194
   at Microsoft.ML.Trainers.FastTree.FastTreeBinaryTrainer.TrainModelCore(TrainContext context) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.FastTree\FastTreeClassification.cs:line 198
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 157
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.Fit(IDataView input) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Training\TrainerEstimatorBase.cs:line 77
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\EstimatorChain.cs:line 68
   at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String groupId, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, IChannel logger) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\Experiment\Runners\RunnerUtil.cs:line 29

1 models were returned after 3516.55 seconds

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName)
   at System.Threading.Thread.StartCore()
   at Microsoft.ML.Internal.Utilities.Utils.ImmediateBackgroundThreadPool.<QueueAsync>g__Enqueue|5_1(ValueTuple`3 item) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Core\Utilities\ThreadUtils.cs:line 121
   at Microsoft.ML.Data.DataViewUtils.Splitter.ConsolidateCore(IChannelProvider provider, DataViewRowCursor[] inputs, Object[]& ourPools, IChannel ch) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Data\DataViewUtils.cs:line 376
   at Microsoft.ML.Data.DataViewUtils.Splitter.Consolidate(IChannelProvider provider, DataViewRowCursor[] inputs, Int32 batchSize, Object[]& ourPools) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Data\DataViewUtils.cs:line 328
   at Microsoft.ML.Data.DataViewUtils.ConsolidateGeneric(IChannelProvider provider, DataViewRowCursor[] inputs, Int32 batchSize) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Data\DataViewUtils.cs:line 260
   at Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(DataViewRowCursor& curs, IDataView view, IEnumerable`1 columnsNeeded, IHost host, Random rand) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Data\DataViewUtils.cs:line 116
   at Microsoft.ML.Data.TransformBase.GetRowCursor(IEnumerable`1 columnsNeeded, Random rand) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Transforms\TransformBase.cs:line 85
   at Microsoft.ML.Transforms.ColumnSelectingTransformer.SelectColumnsDataTransform.GetRowCursor(IEnumerable`1 columnsNeeded, Random rand) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Transforms\ColumnSelecting.cs:line 689
   at Microsoft.ML.AutoML.DatasetDimensionsUtil.CountRows(IDataView data, UInt64 maxRows) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\DatasetDimensions\DatasetDimensionsUtil.cs:line 69
   at Microsoft.ML.AutoML.UserInputValidationUtil.ValidateTrainData(IDataView trainData, ColumnInformation columnInformation) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\Utils\UserInputValidationUtil.cs:line 71
   at Microsoft.ML.AutoML.UserInputValidationUtil.ValidateExperimentExecuteArgs(IDataView trainData, ColumnInformation columnInformation, IDataView validationData, TaskKind task) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\Utils\UserInputValidationUtil.cs:line 31
   at Microsoft.ML.AutoML.ExperimentBase`2.ExecuteTrainValidate(IDataView trainData, ColumnInformation columnInfo, IDataView validationData, IEstimator`1 preFeaturizer, IProgress`1 progressHandler) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\API\ExperimentBase.cs:line 280
   at Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.AutoML\API\ExperimentBase.cs:line 196
   at Kwork.AI.AutoML.Experiment.RunAutoMLExperiment(MLContext mlContext, ColumnInferenceResults columnInference, String logFile, BinaryClassificationTrainer trainer, Int32 timeToTrain) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src-AutoML-Runner\Kwork.AI.AutoML.Runner\Experiment.cs:line 1632
   at Kwork.AI.AutoML.Experiment.RunExperiment(String dataset, BinaryClassificationTrainer trainer, Int32 timeToTrain, String experimentLogPath, String allSummaryLogPath, String preselectedLabel, Nullable`1 optimizationMetric) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src-AutoML-Runner\Kwork.AI.AutoML.Runner\Experiment.cs:line 1088
Exception of type 'System.OutOfMemoryException' was thrown.

The below one is something like "Too few virtual address resources to complete the operation" - sorry I do not have the error in English.

System.InvalidOperationException: Exception thrown in reading
 ---> System.IO.IOException: Liian vähän virtuaalisia osoiteresursseja toiminnon suorittamiseen loppuun. : 'C:\temp\data.csv.idv'
   at System.IO.Strategies.OSFileStreamStrategy.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Strategies.BufferedFileStreamStrategy.ReadSpan(Span`1 destination, ArraySegment`1 arraySegment)
   at System.IO.FileStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at Microsoft.ML.Internal.Utilities.Utils.ReadBlock(Stream s, Byte[] buff, Int32 offset, Int32 length) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Core\Utilities\Stream.cs:line 883
   at Microsoft.ML.Data.IO.BinaryLoader.Cursor.ReadPipe`1.PrepAndSendCompressedBlock(Int64 blockIndex, Int64 blockSequence, Int32 rowCount) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\Binary\BinaryLoader.cs:line 1821
   at Microsoft.ML.Data.IO.BinaryLoader.Cursor.ReaderWorker() in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\Binary\BinaryLoader.cs:line 1426
   --- End of inner exception stack trace ---
   at Microsoft.ML.Internal.Utilities.ExceptionMarshaller.ThrowIfSet(IExceptionContext ectx) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Core\Utilities\ThreadUtils.cs:line 240
   at Microsoft.ML.Data.IO.BinaryLoader.Cursor.MoveNextCore() in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\Binary\BinaryLoader.cs:line 2001
   at Microsoft.ML.Data.RootCursorBase.MoveNext() in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Core\Data\RootCursorBase.cs:line 72
   at Microsoft.ML.Transforms.NormalizingTransformer.Train(IHostEnvironment env, IDataView data, ColumnOptionsBase[] columns) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Transforms\Normalizer.cs:line 570
   at Microsoft.ML.Transforms.NormalizingEstimator.Fit(IDataView input) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\Transforms\Normalizer.cs:line 332
   at Microsoft.ML.DataOperationsCatalog.CreateSplitColumn(IHostEnvironment env, IDataView& data, String samplingKeyColumn, Nullable`1 seed, Boolean fallbackInEnvSeed) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\DataOperationsCatalog.cs:line 584
   at Microsoft.ML.DataOperationsCatalog.TrainTestSplit(IDataView data, Double testFraction, String samplingKeyColumnName, Nullable`1 seed) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src\Microsoft.ML.Data\DataLoadSave\DataOperationsCatalog.cs:line 417
   at Kwork.AI.AutoML.Experiment.RunAutoMLExperiment(MLContext mlContext, ColumnInferenceResults columnInference, String logFile, BinaryClassificationTrainer trainer, Int32 timeToTrain) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src-AutoML-Runner\Kwork.AI.AutoML.Runner\Experiment.cs:line 1629
   at Kwork.AI.AutoML.Experiment.RunExperiment(String dataset, BinaryClassificationTrainer trainer, Int32 timeToTrain, String experimentLogPath, String allSummaryLogPath, String preselectedLabel, Nullable`1 optimizationMetric) in Q:\git-kwork-microsoft-ml\Microsoft.ML\src-AutoML-Runner\Kwork.AI.AutoML.Runner\Experiment.cs:line 1088
@ghost ghost added the untriaged New issue has not been triaged label Apr 27, 2022
@michaelgsharp
Copy link
Member

I am going to have to look more into this as FastTree doesn't require all the data in memory. It should be streaming the data in as needed. I wonder though if AutoML is trying to cache things. If its not then there is investigation work to be done here.

@michaelgsharp michaelgsharp added the AutoML.NET Automating various steps of the machine learning process label Apr 27, 2022
@torronen
Copy link
Contributor Author

torronen commented Apr 27, 2022

@michaelgsharp Our code is converting them to .ToDataFrame(-1) after reading the IDV file. I believe it puts them to RAM. However, these errors happen later, except maybe the last one.

I don't recall now why the conversion to DataFrame was done. It was probably for some data manipulation. I need to review if it is still needed.

The difference with different runs could be that the codes selects a random sampling key from 3 options and a random seed for MLContext, but I did not yet verify this. That could explain why I now have 2 computers succesfully training: 64 gb laptop + 128 gb desktop.

One of these is used as sampling key, the other columns are dropped from training data.
Sampling key 1: hour as int unix timestamp (1000's of keys)
Sampling key 2: week as string "51-2021' (100's of keys)
Sampling key 3: month as string "03-2022" (10's of keys)

@torronen
Copy link
Contributor Author

I think FastTree needs lots of virtual memory in Model Builder too (e.g. dotnet/machinelearning-modelbuilder#1875 ) If I had less than 500 Gb virtual memory the servicehub in model builder crashed sooner or later in some datasets. If RAM requirements could be eased it might help model builder, too. In Model Builder I could just increase virtual memory.

@michaelgsharp
Copy link
Member

So in Model Builder, without using the DataFrame, you still had the OOM issue? Do you remember what the parameters were for that model? Or for the model you are having issues with now? Its possible that AutoML is being to aggresive with its trees. Its also possible something isn't working right in FastTree itself.

@torronen
Copy link
Contributor Author

torronen commented May 2, 2022

Sorry, it was a bit confusing to mention Model Builder here. I just mentioned it because it consumes huge amounts of virtual memory. In case of Model Builder, ServiceHub crashes when running out of memory and I can not longer extract training parameters (the zip file still exists in the temp folder, but I do not know how to get the training parameters from it)

I think this is also first time I have had OOM issues with the C# interface of Microsoft.ML.FastTree (except the data loading issue in the other topic)

It seems the issue is somehow related to Sampling Key and/or MLContext seed. I have been able to since run the AutoML experiment without sampling key with same training data, and additional validation file.

@michaelgsharp michaelgsharp added the P2 Priority of the issue for triage purpose: Needs to be fixed at some point. label Jun 13, 2022
@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Jun 13, 2022
@ghost ghost removed the untriaged New issue has not been triaged label Jun 13, 2022
@torronen
Copy link
Contributor Author

torronen commented May 11, 2023

Just update from AutoML 2.0 (daily build, not yet include disktranspose merge from today, I will try it once I understand how to user it). I just realized now I am using SamplingKey again as in my previous comment, will update soon if it works without it.

Dataset is read from binary IDV file. I am not doing .ToDataFrame(-1) this time.

Unknown=>ReplaceMissingValues=>ConvertType=>OneHotEncoding=>Concatenate=>FastTreeBinary
Out of memory.

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Collections.Generic.List`1.set_Capacity(Int32 value)
   at System.Collections.Generic.List`1.AddWithResize(T item)
   at Microsoft.ML.Trainers.FastTree.DataConverter.ValuesList.Add(Int32 index, Double value)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.FastTreeTrainerBase`3.ConvertData(RoleMappedData trainData)
   at Microsoft.ML.Trainers.FastTree.FastTreeBinaryTrainer.TrainModelCore(TrainContext context)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.AutoML.SweepablePipelineRunner.Run(TrialSettings settings)
   at Microsoft.ML.AutoML.SweepablePipelineRunner.RunAsync(TrialSettings settings, CancellationToken ct)
   at Microsoft.ML.AutoML.AutoMLExperiment.RunAsync(CancellationToken ct)
   at System.Collections.Generic.List`1.set_Capacity(Int32 value)
   at System.Collections.Generic.List`1.AddWithResize(T item)
   at Microsoft.ML.Trainers.FastTree.DataConverter.ValuesList.Add(Int32 index, Double value)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
   at Microsoft.ML.Trainers.FastTree.FastTreeTrainerBase`3.ConvertData(RoleMappedData trainData)
   at Microsoft.ML.Trainers.FastTree.FastTreeBinaryTrainer.TrainModelCore(TrainContext context)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.AutoML.SweepablePipelineRunner.Run(TrialSettings settings)
   at Microsoft.ML.AutoML.SweepablePipelineRunner.RunAsync(TrialSettings settings, CancellationToken ct)
   at Microsoft.ML.AutoML.AutoMLExperiment.RunAsync(CancellationToken ct)

@torronen
Copy link
Contributor Author

Use of sampling key did not have any impact.

The model it tries to train first seems to be small: 4 trees x 4 leaves. I just wonder why I have these setting in 4 parameters (e6,e7,e8) from TrialSettings object. I would expect it to only have FastTree parameters in one place.

The number of rows is high, ~150 Gb binary IDV file.

Pipeline: Unknown=>ReplaceMissingValues=>ConvertType=>OneHotEncoding=>Concatenate=>FastTreeBinary

(e6:Object {"NumberOfLeaves":4,"MinimumExampleCountPerLeaf":20,"NumberOfTrees":4,"MaximumBinCountPerFeature":254,"FeatureFraction":1,"LearningRate":0.09999999999999998,"LabelColumnName":"PredictedLabel","FeatureColumnName":"Features"})
e7:Object {"NumberOfTrees":4,"NumberOfLeaves":4,"FeatureFraction":1,"LabelColumnName":"PredictedLabel","FeatureColumnName":"Features"}
e8:Object {"NumberOfLeaves":4,"MinimumExampleCountPerLeaf":20,"LearningRate":1,"NumberOfTrees":4,"SubsampleFraction":1,"MaximumBinCountPerFeature":255,"FeatureFraction":1,"L1Regularization":2E-10,"L2Regularization":1,"LabelColumnName":"PredictedLabel","FeatureColumnName":"Features"}

@torronen
Copy link
Contributor Author

torronen commented May 15, 2023

It would appear PR #6316 has fixed OOM for FastTree. Thanks @LittleLittleCloud for the good work! This is very helpful.

Questions for later (documentation etc.):

  • Is it requires to set both FastTreeOption and SearchSpace? As I understood, FastTreeOption is used for 1st trial, but maybe some of the fields are copied to SearchSpace internally (?)
  • Are there any things the process could be speed up? For example, does it matter how the pipeline before is setup?
  • How much free disk space is needed for disk transpose?

**Code to enable DiskTranspose **


      FastTreeOption option = new FastTreeOption();
            option.DiskTranspose = true;
            option.FeatureColumnName = "Features";
            option.LabelColumnName = LabelColumn;

            SearchSpace<FastTreeOption> fastTreeSS = new SearchSpace<FastTreeOption>(option);


SweepablePipeline    pipeline = ctx.Transforms
               .SelectColumns(columnsToKeep) // Keep only the specified columns
               .Append(ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation))
               .Append(ctx.Auto().BinaryClassification(labelColumnName: columnInference.ColumnInformation.LabelColumnName, fastTreeOption: option,
               fastTreeSearchSpace: fastTreeSS, useFastForest: false, useLgbm: false, useLbfgsLogisticRegression: false, useFastTree: true


            // Create AutoML experiment
            AutoMLExperiment.AutoMLExperimentSettings settings = new AutoMLExperiment.AutoMLExperimentSettings();
            AutoMLExperiment experiment = ctx.Auto().CreateExperiment();

Commentary
I am now getting OOM for LbfgsLogisticRegressionBinary and FastForestBinaryTrainer. Stack trace seems very similar. LightGBM did not yet complete due to other errors but I would expect it to also complete for 0.5TB. 1-2TB sizes at the moment crash for other reasons (#6679)

I am now experimenting with bigger dataset than before, total of 0.5 TB-2 TB from multiple CSV files. They are probably a bit too big for AutoML, because one trial may take 1-2 days. Adding more CPU cores will not help much because the most time-consuming parts do not consume much CPU. Currently, I feel like I would to limit dataset size to max 50 - 150 GB. So far I can only guess, but I think making the dataset much bigger might not make a big difference for the current non-NN algorithms. Anyway, more is more :)

@tcaivano
Copy link

tcaivano commented Jun 1, 2023

I just got an OOM using a similar setup as the above after around 150 minutes of training on a 26 GB dataset.

@LittleLittleCloud
Copy link
Contributor

@tcaivano @torronen this PR might be helpful #6714 for training super-large dataset. The gist is the more dataset might not result to a better metric. So maybe another solution for mitigating OOM error is to start from a small portion of dataset first and slowly increase the scale of dataset until it hits the turning point or run out of memory?

@torronen
Copy link
Contributor Author

torronen commented Jun 3, 2023

@LittleLittleCloud Nice, thanks. Do you know if #6710 is also relevant?

I agree with subsampling being a better approach most of the time. I just wonder if there should be some tools for making sure the subsample is representative, but that might depend on the case and require analysis outside to scope of Microsoft.ML.

It is just that finding the "OOM error point" is not a nice experience, because creating, moving, converting and beginning to run experiments with big datasets is time consuming, and then finally hitting OOM is very discouraging. So, I think fixing the root cause should be considered, even if it might not be the optimal approach ( if not already in the PR's )

I like your suggestion. Would it be feasible even to make a "subsampling tuner" ? It might start with 1% of full data, get optimal params, then 5%, then 10%, then 30%, then 100%. On each subsample size it could start search around the optimal values from previous step.

Approach 1: fixed time for each step. This would ensure there is time for the 100% sample size.
Approach 2: early stopping at the point when we can expect only minor improvements in next 2x time. This might find the near-optimum even if only some of the sample sizes were completed, if the dataset "subsamples well".

@tcaivano
Copy link

tcaivano commented Jun 3, 2023

I certainly agree with the analysis that we should probably be doing sampling most of the time; in fact, the research that I'm doing is an examination of that very claim. Regardless, it would be really useful for this to be functional in the future.

@torronen
Copy link
Contributor Author

torronen commented Jun 5, 2023

@tcaivano That's very cool. If you have public results, and if you can, please share! I have some feeling it may also improve results in some cases, such as sensor data: the weather may stay the same for extended times, and putting 100x "sunny day data" may (probably) lead to overfit. My assumption is that the data should include reasonable amounts of data from each kind of weather but at the same time the edge cases should not be lost. For example, typhoon/hurricane data, or transition from rain to sunny weather etc. But how to ensure the subsample includes such data?

@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

4 participants