[Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException (2/2)

**System Information (please complete the following information):**

-  OS & Version: Win8, latest version as of this bug entry
- ML.NET Version: 16.13.9
- .NET Version:6.0.303

**Describe the bug**
When I start c# AutoML in c# I get a OutOfMemoryException after the memory reach the maximum of 64GB.
I have 64GB Ram, I have a 330GB csv file of data.

Note: I couldn't do it with ML.net CLI due to this bug https://github.com/dotnet/machinelearning/issues/6288 so I tried to do it with c# AutoML package. I'm totally new at ML.NET, sorry in adavance for the code quality


**To Reproduce**
Steps to reproduce the behavior:
1. Generate a 330GB file with 4209 columns with random data
2. create a c# project and paste the code below
3. See error log at the end of this message with the OutOfMemoryException

**Expected behavior**
I expect to be able to be able to handle 2TB files and 100K columns without any issue with ML.Net CLI and also with c# on a 64GB ram computer by streaming the data instead of loading all in memeory.
  
**Screenshots, Code, Sample Projects**
If applicable, add screenshots, code snippets, or sample projects to help explain your problem.

**Additional context**
Add any other context about the problem here.



I have a 330gb file (64 gb ram). I tried ML.NET CLI but hit a bug see.
So I'm now trying with c#, the bug is different than the ML.NET CLI issue as it seems to try to load everything in memory 

IDataView trainingData = mlContext.Data.LoadFromTextFile<ModelInput>(
                "c:\data.csv", 
                separatorChar: ',', hasHeader: true, trimWhitespace: true);

            var cts = new CancellationTokenSource();
            var experimentSettings = new MulticlassExperimentSettings();
            //experimentSettings.TrainingData = trainingData;
            experimentSettings.MaxExperimentTimeInSeconds = 3600;
            experimentSettings.CancellationToken = cts.Token;
            experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Auto;

            // Cancel experiment after the user presses any key
            //CancelExperimentAfterAnyKeyPress(cts);
            experimentSettings.CacheDirectoryName = null;

            MulticlassClassificationExperiment experiment = mlContext.Auto().CreateMulticlassClassificationExperiment(experimentSettings);
            ExperimentResult<MulticlassClassificationMetrics> experimentResult = experiment.Execute(trainingData, "Entry(Text)");//, progressHandler: progressHandler);


.....
 public class ModelInput
        {
            [LoadColumn(0), NoColumn]
            public string _data0 { get; set; }

            [LoadColumn(1), NoColumn]
            public float ignoreData1 { get; set; }

            [LoadColumn(2, 4205)]
            public float _data { get; set; }

            [LoadColumn(4206),NoColumn]//(4206,4208)]
            public float _ignoreData4206 { get; set; }         
		[LoadColumn(4207), NoColumn]//(4206,4208)]
            public float _ignoreData4207 { get; set; }
            [LoadColumn(4208), NoColumn]//(4206,4208)]
            public float _ignoreData4208 { get; set; }

            [LoadColumn(4209),ColumnName("Entry(Text)")]
            public string _label { get; set; }
        }

------

There is a Exception of type 'System.OutOfMemoryException' was thrown.
(new System.Collections.Generic.Mscorlib_CollectionDebugView<Microsoft.ML.AutoML.RunDetail<Microsoft.ML.Data.MulticlassClassificationMetrics>>(experimentResult.RunDetails).Items[0]).Exception.StackTrace
   at Microsoft.ML.Internal.Utilities.OrderedWaiter.Wait(Int64 position, CancellationToken token)
   at Microsoft.ML.Data.CacheDataView.GetPermutationOrNull(Random rand)
   at Microsoft.ML.Data.CacheDataView.GetRowCursorSetWaiterCore[TWaiter](TWaiter waiter, Func`2 predicate, Int32 n, Random rand)
   at Microsoft.ML.Data.CacheDataView.GetRowCursorSet(IEnumerable`1 columnsNeeded, Int32 n, Random rand)
   at Microsoft.ML.Data.OneToOneTransformBase.GetRowCursorSet(IEnumerable`1 columnsNeeded, Int32 n, Random rand)
   at Microsoft.ML.Data.DataViewUtils.TryCreateConsolidatingCursor(DataViewRowCursor& curs, IDataView view, IEnumerable`1 columnsNeeded, IHost host, Random rand)
   at Microsoft.ML.Data.TransformBase.GetRowCursor(IEnumerable`1 columnsNeeded, Random rand)
   at Microsoft.ML.Trainers.TrainingCursorBase.FactoryBase`1.Create(Random rand, Int32[] extraCols)
   at Microsoft.ML.Trainers.OnlineLinearTrainer`2.TrainCore(IChannel ch, RoleMappedData data, TrainStateBase state)
   at Microsoft.ML.Trainers.OnlineLinearTrainer`2.TrainModelCore(TrainContext context)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Trainers.OneVersusAllTrainer.TrainOne(IChannel ch, ITrainerEstimator`2 trainer, RoleMappedData data, Int32 cls)
   at Microsoft.ML.Trainers.OneVersusAllTrainer.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String groupId, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, IChannel logger)

-----
There is a Exception of type 'System.OutOfMemoryException' was thrown.
(new System.Collections.Generic.Mscorlib_CollectionDebugView<Microsoft.ML.AutoML.RunDetail<Microsoft.ML.Data.MulticlassClassificationMetrics>>(experimentResult.RunDetails).Items[0]).Exception.InnerException.StackTrace
   at Microsoft.ML.Internal.Utilities.ArrayUtils.EnsureSize[T](T[]& array, Int32 min, Int32 max, Boolean keepOld, Boolean& resized)
   at Microsoft.ML.Internal.Utilities.BigArray`1.AddRange(ReadOnlySpan`1 src)
   at Microsoft.ML.Data.CacheDataView.ColumnCache.ImplVec`1.CacheCurrent()
   at Microsoft.ML.Data.CacheDataView.Filler(DataViewRowCursor cursor, ColumnCache[] caches, OrderedWaiter waiter)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException (2/2) #6297

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue, ML.net C#] 330GB csv file of data cause a OutOfMemoryException (2/2) #6297

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions