Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException (2/2) #6297

Open
wil70 opened this issue Aug 19, 2022 · 12 comments
Open
Labels
AutoML.NET Automating various steps of the machine learning process in-pr

Comments

@wil70
Copy link

wil70 commented Aug 19, 2022

System Information (please complete the following information):

  • OS & Version: Win8, latest version as of this bug entry
  • ML.NET Version: 16.13.9
  • .NET Version:6.0.303

Describe the bug
When I start c# AutoML in c# I get a OutOfMemoryException after the memory reach the maximum of 64GB.
I have 64GB Ram, I have a 330GB csv file of data.

Note: I couldn't do it with ML.net CLI due to this bug #6288 so I tried to do it with c# AutoML package. I'm totally new at ML.NET, sorry in adavance for the code quality

To Reproduce
Steps to reproduce the behavior:

  1. Generate a 330GB file with 4209 columns with random data
  2. create a c# project and paste the code below
  3. See error log at the end of this message with the OutOfMemoryException

Expected behavior
I expect to be able to be able to handle 2TB files and 100K columns without any issue with ML.Net CLI and also with c# on a 64GB ram computer by streaming the data instead of loading all in memeory.

Screenshots, Code, Sample Projects
If applicable, add screenshots, code snippets, or sample projects to help explain your problem.

Additional context
Add any other context about the problem here.

I have a 330gb file (64 gb ram). I tried ML.NET CLI but hit a bug see.
So I'm now trying with c#, the bug is different than the ML.NET CLI issue as it seems to try to load everything in memory

IDataView trainingData = mlContext.Data.LoadFromTextFile(
"c:\data.csv",
separatorChar: ',', hasHeader: true, trimWhitespace: true);

        var cts = new CancellationTokenSource();
        var experimentSettings = new MulticlassExperimentSettings();
        //experimentSettings.TrainingData = trainingData;
        experimentSettings.MaxExperimentTimeInSeconds = 3600;
        experimentSettings.CancellationToken = cts.Token;
        experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Auto;

        // Cancel experiment after the user presses any key
        //CancelExperimentAfterAnyKeyPress(cts);
        experimentSettings.CacheDirectoryName = null;

        MulticlassClassificationExperiment experiment = mlContext.Auto().CreateMulticlassClassificationExperiment(experimentSettings);
        ExperimentResult<MulticlassClassificationMetrics> experimentResult = experiment.Execute(trainingData, "Entry(Text)");//, progressHandler: progressHandler);

.....
public class ModelInput
{
[LoadColumn(0), NoColumn]
public string _data0 { get; set; }

        [LoadColumn(1), NoColumn]
        public float ignoreData1 { get; set; }

        [LoadColumn(2, 4205)]
        public float _data { get; set; }

        [LoadColumn(4206),NoColumn]//(4206,4208)]
        public float _ignoreData4206 { get; set; }         
	[LoadColumn(4207), NoColumn]//(4206,4208)]
        public float _ignoreData4207 { get; set; }
        [LoadColumn(4208), NoColumn]//(4206,4208)]
        public float _ignoreData4208 { get; set; }

        [LoadColumn(4209),ColumnName("Entry(Text)")]
        public string _label { get; set; }
    }

There is a Exception of type 'System.OutOfMemoryException' was thrown.
(new System.Collections.Generic.Mscorlib_CollectionDebugView<Microsoft.ML.AutoML.RunDetail<Microsoft.ML.Data.MulticlassClassificationMetrics>>(experimentResult.RunDetails).Items[0]).Exception.StackTrace
at Microsoft.ML.Internal.Utilities.OrderedWaiter.Wait(Int64 position, CancellationToken token)
at Microsoft.ML.Data.CacheDataView.GetPermutationOrNull(Random rand)
at Microsoft.ML.Data.CacheDataView.GetRowCursorSetWaiterCore[TWaiter](TWaiter waiter, Func2 predicate, Int32 n, Random rand) at Microsoft.ML.Data.CacheDataView.GetRowCursorSet(IEnumerable1 columnsNeeded, Int32 n, Random rand)
at Microsoft.ML.Data.OneToOneTransformBase.GetRowCursorSet(IEnumerable1 columnsNeeded, Int32 n, Random rand) at Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(DataViewRowCursor& curs, IDataView view, IEnumerable1 columnsNeeded, IHost host, Random rand)
at Microsoft.ML.Data.TransformBase.GetRowCursor(IEnumerable1 columnsNeeded, Random rand) at Microsoft.ML.Trainers.TrainingCursorBase.FactoryBase1.Create(Random rand, Int32[] extraCols)
at Microsoft.ML.Trainers.OnlineLinearTrainer2.TrainCore(IChannel ch, RoleMappedData data, TrainStateBase state) at Microsoft.ML.Trainers.OnlineLinearTrainer2.TrainModelCore(TrainContext context)
at Microsoft.ML.Trainers.TrainerEstimatorBase2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor) at Microsoft.ML.Trainers.OneVersusAllTrainer.TrainOne(IChannel ch, ITrainerEstimator2 trainer, RoleMappedData data, Int32 cls)
at Microsoft.ML.Trainers.OneVersusAllTrainer.Fit(IDataView input)
at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input)
at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String groupId, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, IChannel logger)


There is a Exception of type 'System.OutOfMemoryException' was thrown.
(new System.Collections.Generic.Mscorlib_CollectionDebugView<Microsoft.ML.AutoML.RunDetail<Microsoft.ML.Data.MulticlassClassificationMetrics>>(experimentResult.RunDetails).Items[0]).Exception.InnerException.StackTrace
at Microsoft.ML.Internal.Utilities.ArrayUtils.EnsureSize[T](T[]& array, Int32 min, Int32 max, Boolean keepOld, Boolean& resized)
at Microsoft.ML.Internal.Utilities.BigArray1.AddRange(ReadOnlySpan1 src)
at Microsoft.ML.Data.CacheDataView.ColumnCache.ImplVec`1.CacheCurrent()
at Microsoft.ML.Data.CacheDataView.Filler(DataViewRowCursor cursor, ColumnCache[] caches, OrderedWaiter waiter)

@ghost ghost added the untriaged New issue has not been triaged label Aug 19, 2022
@wil70 wil70 changed the title [Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException [Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException (2/2) Aug 19, 2022
@luisquintanilla
Copy link
Contributor

luisquintanilla commented Aug 19, 2022

@LittleLittleCloud
Copy link
Contributor

@wil70 let us know if @luisquintanilla comments resolve your issue, and a tip:

  • you can disable LightGbm trainer as it uses a lot of memory when training data goes up

@ghost
Copy link

ghost commented Aug 25, 2022

This issue has been marked needs-author-action and may be missing some important information.

@wil70
Copy link
Author

wil70 commented Aug 26, 2022

Hello,

Sorry for the delay, it takes time to reproduce the issue and get the error message with big files.

  1. Every trainer completed fine with 16GB csv input file. I can start AutoML and run all the algo
    1.1. For the same entry csv file, I observed a strange thing where some algo have a confusion Matrix that look like this

     	ConfusionMatrix.PerClassPrecision: Results from All trainers with the exception of FastForestOva and FastTreeOva
     	- class 1: 952, 0, 0,
     	- class 2: 0, 971, 0,
     	- class 3: 0, 0, 805,
     
     	While other are like this
     	ConfusionMatrix.PerClassPrecision: Results from trainers FastForestOva and FastTreeOva
     	- class 1: 4487, 0, 0,
     	- class 2: 0, 4429, 0,
     	- class 3: 0, 0, 3667,
     
     1.1.a. This is a big difference in the confusion Matrix. There are 125,766 lines in the input files (no duplicates), so the algos are ignoring a lot of rows, some much more than others.
     1.1.b. I have some NaN here and there, I'm wondering if some of the trainer do not like those.
     What shall I use instead of NaN? 
     	1.1.b.1. For Data none available, ie. NaN?
     	1.1.b.2. For Infinity? I can use double.Max
     	1.1.b.3. For -Infinity?
    
  2. For files of 330GB there are a few issues
    2.1. I needed to run the Algo 1 by 1, as many crash and can't complete
    2.2 LightGbm is the only one that completed - Yeah!

     	LightGbm
     	Processing LightGbm
     	BestRun TrainerName:        LightGbmMulti
     		- MicroAccuracy:        1
     		- MacroAccuracy:        1
     		- LogLoss:              3.589831285873676E-05
     		- CLogLossReduction:    0.9999669040026492
     	
     	Class log loss:
     		- class 1: 4.208176463648898E-05
     		- class 2: 3.385601134085665E-05
     		- class 3: 3.3677191368241515E-05
     	
     	ConfusionMatrix.PerClassPrecision:
     	class 1: 1
     	class 2: 1
     	class 3: 1
     	
     	ConfusionMatrix.PerClassPrecision:
     		- class 1: 145805, 0, 0,
     		- class 2: 0, 211129, 0,
     		- class 3: 0, 0, 211780,
     	Saving model!!!
     	Done saving...
    
     	Duration: 04:25:05.6115268
    
             2.2.a. Looking at the confusion matrix, I can see it is missing a lot of rows, this is probably the same as in 1.1.b (probably NaN)
    

    2.3 All the other crash but didn't really give some stack trace except for

     	Processing LbfgsMaximumEntropy
     	Exception of type 'System.OutOfMemoryException' was thrown.
     	at System.Linq.GroupedEnumerable`2.GetEnumerator()
     	at System.Linq.Enumerable.SelectEnumerableIterator`2.ToArray()
     	at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
     	at System.Linq.OrderedEnumerable`1.ToArray()
     	at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
     	at System.Linq.Enumerable.ReverseIterator`1.MoveNext()
     	at System.Linq.Enumerable.EnumerablePartition`1.MoveNext()
     	at System.Linq.Enumerable.SelectIPartitionIterator`2.MoveNext()
     	at System.Collections.Generic.HashSet`1.UnionWith(IEnumerable`1 other)
     	at System.Collections.Generic.HashSet`1..ctor(IEnumerable`1 collection, IEqualityComparer`1 comparer)
     	at System.Collections.Generic.HashSet`1..ctor(IEnumerable`1 collection)
     	at Microsoft.ML.AutoML.PipelineSuggester.OrderTrainersByNumTrials(IEnumerable`1 history, IEnumerable`1 selectedTrainers)
     	at Microsoft.ML.AutoML.PipelineSuggester.GetNextInferredPipeline(MLContext context, IEnumerable`1 history, DatasetColumnInfo[] columns, TaskKind task, Boolean isMaximizingMetric, CacheBeforeTrainer cacheBeforeTrainer, IChannel logger, IEnumerable`1 trainerAllowList)
     	at Microsoft.ML.AutoML.Experiment`2.Execute()
     	at Microsoft.ML.AutoML.ExperimentBase`2.Execute(ColumnInformation columnInfo, DatasetColumnInfo[] columns, IEstimator`1 preFeaturizer, IProgress`1 progressHandler, IRunner`1 runner)
     	at Microsoft.ML.AutoML.ExperimentBase`2.ExecuteTrainValidate(IDataView trainData, ColumnInformation columnInfo, IDataView validationData, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
     	at Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
     	at Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, String labelColumnName, String samplingKeyColumn, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
     	at ML1.Program.ML4(MulticlassClassificationTrainer trainerID) in G:\Users\Wilhelm\dev\MachineLearning\ML2\Program.cs:line 76
     	
     	Duration: 01:41:49.7475104
    

    2.4. I tried to catch more details error for all the other aglo in debug mode via Visual Studio 2022 but that doesn't go well, VS crash etc...
    I also tried with "debug only my code" but same...

  3. I would love to be able to run on 2TB csv files with more than 20,000+ columns/attributes.

  4. It seems the CLI is having some memory issue too - see: [Issue, ML.net CLI] 330GB csv file of data cause a OutOfMemoryException (1/2) #6288

Thanks

Wilhelm

cc: @dakersnar @LittleLittleCloud @luisquintanilla @michaelgsharp

@luisquintanilla
Copy link
Contributor

Thanks for that detailed report @wil70.

@LittleLittleCloud can you please take a look.

@LittleLittleCloud
Copy link
Contributor

@wil70
I'm looking into this issue right now. I made a few changes in AutoML.Net which

  • disable cache
  • enable diskConvert in fast tree

and hopefully it can resolve oom exception you have. I just launched an experiment to verify that yesterday and it's still running,, will get back to this thread if the running is succeed.

@wil70
Copy link
Author

wil70 commented Sep 8, 2022

Nice, exciting, let me know when done and I can run a test.
Thank you @LittleLittleCloud

I'm wondering if that will also solve #6288

Note: My next steps after 330GB will be to aim for 2TB+ file - TY!

@michaelgsharp
Copy link
Member

@LittleLittleCloud after you other PR goes in is this issue good to be closed?

@michaelgsharp michaelgsharp added AutoML.NET Automating various steps of the machine learning process in-pr and removed needs-further-triage labels Oct 10, 2022
@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Oct 31, 2022

@michaelgsharp Nope, still looking into that

@luisquintanilla maybe we should provide a memory-saving automl solution as we have a lot of similar issues on OOM error, in both model builder and automl.net

like this one dotnet/machinelearning-modelbuilder#2328

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Nov 1, 2022

@michaelgsharp Nope, still looking into that

@luisquintanilla maybe we should provide a memory-saving automl solution as we have a lot of similar issues on OOM error, in both model builder and automl.net

like this one dotnet/machinelearning-modelbuilder#2328

@LittleLittleCloud control of or more efficient resource management is something we definitely want to address.

Not sure if I'm misremembering. I know that you're able to inspect the amount of resources used by each of the trials. Is there a setting when you configure an experiment where you're able to cap the amount of resources used?

As far as I know, today the options for managing resources are:

  • Choose a different algorithm / disable algorithms. I believe larger tree-based models consume more resources.
  • Customize search space. Similar to choosing algorithms, if you control the search space for the models that may take more resources, you can in a sense manage resource consumption.

Are there any others?

@torronen
Copy link
Contributor

torronen commented Nov 7, 2022

Suggestions (guesses) for future features:

  • Optimizing CSV conversion to binary, if not already optimized. It has been my bottleneck in Microsoft.ML. Currently, I do the conversion on highest RAM machine, other machines use binary only.
  • Reuse of the binary file when training from same dataset (I think currently, it re-creates binary file for every restart of training)
  • Allow changing ratio of train-to-test. If dataset is huge and there is not enough memory, then maybe users would be willing to move more to test dataset to save memory.
  • I am not sure, but I am guessing there might be something about automatic feature engineering of text which consumes lots of memory. Perhaps it could be disabled in some cases.

Maybe docs could include a suggestion to add a fixed-sized paging file. I had problems when I allowed Windows to manage it, but after adding a fixed 500-600 gb paging file I no longer experience running out of physical memory with tree-based algorithms. In my datasets it could go as high as 650 GB. I now always add a paging file save to 2 physical nvme drives, and it has been ok. Paging file is able to stripe between the drives so putting to 2 drives is faster. I also tried for a few months with 1TB and 2TB RAM servers, but it did not have a big impact in comparison to nvme paging files, especially considering the cost of such servers. I mostly used Microsoft.ML but also tried with Model Builder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AutoML.NET Automating various steps of the machine learning process in-pr
Projects
None yet
Development

No branches or pull requests

6 participants