AutoML Regression Experiment Ends Unsuccessfully #6654

superichmann · 2023-05-04T14:55:14Z

System Information:

OS & Version: Windows 10
microsoft.ml\3.0.0-preview.23229.2
microsoft.ml.automl\0.21.0-preview.23229.2
microsoft.ml.onedal\0.21.0-preview.23229.2
updated from https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-libraries/nuget/v3/index.json
.NET Version: e.g. .NET 6.0

Describe the bug
After updating to latest build through dotnet-libraries nuget in order to apply the fix for #6565 the same AutoML regression experiments that were completed in a 10 seconds are now incomplete, I have increased to 800 seconds and still receive the error :
RegressionMetric.MeanAbsoluteError
train: 1667 rows
test: 16 rows
Message:

    System.AggregateException: One or more errors occurred. (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) ---> System.AggregateException: One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity) ---> System.TimeoutException: Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity

  Stack Trace: 
    AutoMLExperiment.RunAsync(CancellationToken ct)
    --- End of inner exception stack trace ---
    Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
    Task`1.GetResultCore(Boolean waitCompletionNotification)
    AutoMLExperiment.Run()
    RegressionExperiment.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)

If there are best practices for the new version please tell me what they are. maybe I am doing something wrong
Expected behavior
At least one model should be found?

Additional context
The long exception message is due to the fact that the trials are sent from Parallel.ForEach

The text was updated successfully, but these errors were encountered:

LittleLittleCloud · 2023-05-05T00:48:54Z

It's probably because your dataset is large and need more time to run (maybe your dataset has a lot of columns?) Or all trials's result is NaN.. You can set MaxModels in experiment setting to 1 to see how much time it takes to finish one trial

superichmann · 2023-05-05T07:16:24Z

I have changed the threads to 1 simultaneously with 12 seconds and it worked.. the problem is that I have 2000 time series so the time to train will be 12*2000 seconds :\ and as well it produced worst score then the old automl (non nightly) with many threads that finished running after 20 minutes.
I dont know what to do

LittleLittleCloud · 2023-05-05T17:05:47Z

@superichmann Let me see if I understand your question correctly: The exception goes away after you're using the latest nightly build. And it takes about 12 seconds for the first thread to finish. The problem is the training result is really bad for the first trial. And you don't want to increase the budget because you have too many datasets to run?

superichmann · 2023-05-07T16:06:22Z

The 12 seconds trial is single thread and it finishes training every time. the problem is that some of them just explored 1 model so I might need to raise the time.

in general I would like to finish all of the training as fast as possible without damaging the accuracy so I am trying to run it through multiple threads but it might be a bad practice :/ what do you think?

LittleLittleCloud · 2023-05-07T18:46:30Z

It's questionable if multi-thread training would help boost training speed. AFAIK some trainers already uses multi-thread to train a model so you might want to configure the max-parallel thread number carefully so as to make sure the total # of running thread won't exceed the thread number of CPU

Meanwhile, in multi-thread case, it might be helpful to set budget using MaxModels instead of using TrainingTimeInSeconds.. 20 should be a good number for CFO tuner to find a decent parameter under most situations.

Also, may I ask more detailed information over 2000 time series dataset? Are they just different datasets and you want to use them to train 2000 different models for your project?

superichmann · 2023-05-08T09:09:43Z

Thanks @LittleLittleCloud for the valuable info! I am running on the max cpu (16) but as I wrote here it doesnt work well.. maybe I will reduce to 4 or 8.
As I saw in ExperimentSettings.cs for main branch, MaxModels is an internal int and cannot be set through constructor :| in default constructor it is set to MaxModels = int.MaxValue;

I am trying it out on this data from kaggle comp. its stores sales data which involves ~ 40 stores that each have ~ 40 departments..
and my current strategy is to slice it with store_nbr and department which leaves us with 1780~ time series. before I sliced the data the score was much lower on automl and on kaggle submitting score. what do you think?

I guess I should play with ForecastBySSa but I really don't understand how it works :| and as well it does not have automl support, am I right?

LittleLittleCloud · 2023-05-08T19:33:00Z

@superichmann I just expose the MaxModel to public in this PR[#6663],, sorry for the confusing.

I guess I should play with ForecastBySSa but I really don't understand how it works :| and as well it does not have automl support, am I right?

forecastBySsa is used for univariate forecasting, while your dataset seems to be a multivariate one. So ForecastBySSA might not be the trainer that you're looking for. Considering that categorical features like store_nbr and department are provided, maybe you can try tree-base method like fast-tree or lightGBM.

Data pre-processing

You can transfer a time-serious forecasting problem to a regression problem via using previous sales data as features. For example if you want to predict the sale amount on day t, you can put sale amount on day t-1, t-2, .... t-n as features. Plus, adding meta-info like dayOfWeek, isPaymentDay, sale amount of relavent goods should also be helpful in predicting sale amount.

Also, I like your idea of creating different model for different stores and family. But I'll probably try creating different models for dfferent stores first, as the sales data for other product familiy might also be useful (for example, sales data for MEAT and DELI might be relative)

superichmann · 2023-05-09T12:16:43Z

Thanks I am now experimenting with MaxModels.

On some experiments, I see in my database (I Use CreateDatabaseLoader) numerous SELECTs with exactly the same SELECT statement from AutoML.. maybe you should add cache to the AutoML across all experiments?

After a check this particular hang caused by RefitBestPipeline.. I dont know why. now tried again and all is good.

Still maybe it will be good to add a cache for the data from db..

LittleLittleCloud · 2023-05-09T17:10:08Z

@superichmann Maybe you can try cache dataset before sending to AutoML so it won't retrieve data from database for each trial?

var cachedTrain = context.Data.Cache(input, ...);

superichmann · 2023-05-09T18:34:06Z

from a fast experiment it works and also my runtime is faster 3 times :] the problem is that the scores are worsened :[ (I am talking about further test that I am running on later data by myself by fitting the final estimator to unseen data)

Any explanation for that? Thanks again for your much help 💯

var experimentSettings = new RegressionExperimentSettings();
experimentSettings.MaxModels = 30;
experimentSettings.OptimizingMetric = RegressionMetric.MeanAbsoluteError;;
experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Off;
experimentSettings.CacheDirectoryName = null;
RegressionExperiment experiment = mlContext.Auto().CreateRegressionExperiment(experimentSettings);
idvTrain = mlContext.Data.Cache(idvTrain);
idvTest = mlContext.Data.Cache(idvTest);
experimentResult = experiment.Execute(idvTrain, idvTest, DEFAULT_TARGET_COL, preFeaturizer: preDoubleToSingle);

LittleLittleCloud · 2023-05-09T18:55:38Z

@superichmann Can you tell me how much worse it is. And in the meantime, can you share with us the best result && trainer when using old AutoML.

superichmann · 2023-05-09T19:25:51Z

I can't reproduce it for some reason :\ I will update when I could. Is there any other cache mechanism that I am not aware of? Is there anything I need to do when initializing mlcontext? I just create new()
private static MLContext mlContext = new MLContext();
and all the time I get different results for similar experiments.

Another thing that happened, on RegressionMetric.MeanAbsoluteError AutoML I get model with score: 0 that predicts unseen data in a good way.. any explanatoin?

LittleLittleCloud · 2023-05-09T19:35:03Z

Is there anything I need to do when initializing mlcontext? I just create new()
private static MLContext mlContext = new MLContext();
and all the time I get different results for similar experiments.

@superichmann You can pass a seed in MLContext.

Another thing that happened, on RegressionMetric.MeanAbsoluteError AutoML I get model with score: 0 that predicts unseen data in a good way.. any explanatoin?

I notice that in your test dataset, all sales values are 0 except the 12th row. So that's might be why the predicting mae is 0.

michaelgsharp · 2024-01-24T18:04:08Z

Closing since it seems the issue is resolved.

ghost added the untriaged New issue has not been triaged label May 4, 2023

superichmann changed the title ~~AutoML Regression Experiment ends Without Successfull Trial~~ AutoML Regression Experiment ends With Unsuccessful Trial May 4, 2023

superichmann changed the title ~~AutoML Regression Experiment ends With Unsuccessful Trial~~ AutoML Regression Experiment Ends Unsuccessfully May 4, 2023

superichmann mentioned this issue May 4, 2023

AutoML Regression Experiment Crash #6644

Closed

JakeRadMSFT assigned LittleLittleCloud May 5, 2023

JakeRadMSFT added the AutoML.NET Automating various steps of the machine learning process label May 5, 2023

michaelgsharp closed this as completed Jan 24, 2024

ghost removed the untriaged New issue has not been triaged label Jan 24, 2024

github-actions bot locked and limited conversation to collaborators Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML Regression Experiment Ends Unsuccessfully #6654

AutoML Regression Experiment Ends Unsuccessfully #6654

superichmann commented May 4, 2023 •

edited

Loading

LittleLittleCloud commented May 5, 2023 •

edited

Loading

superichmann commented May 5, 2023

LittleLittleCloud commented May 5, 2023

superichmann commented May 7, 2023

LittleLittleCloud commented May 7, 2023 •

edited

Loading

superichmann commented May 8, 2023 •

edited

Loading

LittleLittleCloud commented May 8, 2023 •

edited

Loading

superichmann commented May 9, 2023 •

edited

Loading

LittleLittleCloud commented May 9, 2023

superichmann commented May 9, 2023 •

edited

Loading

LittleLittleCloud commented May 9, 2023

superichmann commented May 9, 2023

LittleLittleCloud commented May 9, 2023

michaelgsharp commented Jan 24, 2024

AutoML Regression Experiment Ends Unsuccessfully #6654

AutoML Regression Experiment Ends Unsuccessfully #6654

Comments

superichmann commented May 4, 2023 • edited Loading

LittleLittleCloud commented May 5, 2023 • edited Loading

superichmann commented May 5, 2023

LittleLittleCloud commented May 5, 2023

superichmann commented May 7, 2023

LittleLittleCloud commented May 7, 2023 • edited Loading

superichmann commented May 8, 2023 • edited Loading

LittleLittleCloud commented May 8, 2023 • edited Loading

Data pre-processing

superichmann commented May 9, 2023 • edited Loading

LittleLittleCloud commented May 9, 2023

superichmann commented May 9, 2023 • edited Loading

LittleLittleCloud commented May 9, 2023

superichmann commented May 9, 2023

LittleLittleCloud commented May 9, 2023

michaelgsharp commented Jan 24, 2024

superichmann commented May 4, 2023 •

edited

Loading

LittleLittleCloud commented May 5, 2023 •

edited

Loading

LittleLittleCloud commented May 7, 2023 •

edited

Loading

superichmann commented May 8, 2023 •

edited

Loading

LittleLittleCloud commented May 8, 2023 •

edited

Loading

superichmann commented May 9, 2023 •

edited

Loading

superichmann commented May 9, 2023 •

edited

Loading