Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML Regression Experiment Ends Unsuccessfully #6654

Closed
superichmann opened this issue May 4, 2023 · 14 comments
Closed

AutoML Regression Experiment Ends Unsuccessfully #6654

superichmann opened this issue May 4, 2023 · 14 comments
Assignees
Labels
AutoML.NET Automating various steps of the machine learning process

Comments

@superichmann
Copy link

superichmann commented May 4, 2023

System Information:

Describe the bug
After updating to latest build through dotnet-libraries nuget in order to apply the fix for #6565 the same AutoML regression experiments that were completed in a 10 seconds are now incomplete, I have increased to 800 seconds and still receive the error :
RegressionMetric.MeanAbsoluteError
train: 1667 rows
test: 16 rows
Message: 

    System.AggregateException: One or more errors occurred. (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) (One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity)) ---> System.AggregateException: One or more errors occurred. (Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity) ---> System.TimeoutException: Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity

  Stack Trace: 
    AutoMLExperiment.RunAsync(CancellationToken ct)
    --- End of inner exception stack trace ---
    Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
    Task`1.GetResultCore(Boolean waitCompletionNotification)
    AutoMLExperiment.Run()
    RegressionExperiment.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)

If there are best practices for the new version please tell me what they are. maybe I am doing something wrong
Expected behavior
At least one model should be found?

Additional context
The long exception message is due to the fact that the trials are sent from Parallel.ForEach

@ghost ghost added the untriaged New issue has not been triaged label May 4, 2023
@superichmann superichmann changed the title AutoML Regression Experiment ends Without Successfull Trial AutoML Regression Experiment ends With Unsuccessful Trial May 4, 2023
@superichmann superichmann changed the title AutoML Regression Experiment ends With Unsuccessful Trial AutoML Regression Experiment Ends Unsuccessfully May 4, 2023
@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented May 5, 2023

It's probably because your dataset is large and need more time to run (maybe your dataset has a lot of columns?) Or all trials's result is NaN.. You can set MaxModels in experiment setting to 1 to see how much time it takes to finish one trial

@superichmann
Copy link
Author

I have changed the threads to 1 simultaneously with 12 seconds and it worked.. the problem is that I have 2000 time series so the time to train will be 12*2000 seconds :\ and as well it produced worst score then the old automl (non nightly) with many threads that finished running after 20 minutes.
I dont know what to do

@LittleLittleCloud
Copy link
Contributor

@superichmann Let me see if I understand your question correctly: The exception goes away after you're using the latest nightly build. And it takes about 12 seconds for the first thread to finish. The problem is the training result is really bad for the first trial. And you don't want to increase the budget because you have too many datasets to run?

@JakeRadMSFT JakeRadMSFT added the AutoML.NET Automating various steps of the machine learning process label May 5, 2023
@superichmann
Copy link
Author

The 12 seconds trial is single thread and it finishes training every time. the problem is that some of them just explored 1 model so I might need to raise the time.

in general I would like to finish all of the training as fast as possible without damaging the accuracy so I am trying to run it through multiple threads but it might be a bad practice :/ what do you think?

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented May 7, 2023

It's questionable if multi-thread training would help boost training speed. AFAIK some trainers already uses multi-thread to train a model so you might want to configure the max-parallel thread number carefully so as to make sure the total # of running thread won't exceed the thread number of CPU

Meanwhile, in multi-thread case, it might be helpful to set budget using MaxModels instead of using TrainingTimeInSeconds.. 20 should be a good number for CFO tuner to find a decent parameter under most situations.

Also, may I ask more detailed information over 2000 time series dataset? Are they just different datasets and you want to use them to train 2000 different models for your project?

@superichmann
Copy link
Author

superichmann commented May 8, 2023

Thanks @LittleLittleCloud for the valuable info! I am running on the max cpu (16) but as I wrote here it doesnt work well.. maybe I will reduce to 4 or 8.
As I saw in ExperimentSettings.cs for main branch, MaxModels is an internal int and cannot be set through constructor :| in default constructor it is set to MaxModels = int.MaxValue;

I am trying it out on this data from kaggle comp. its stores sales data which involves ~ 40 stores that each have ~ 40 departments..
and my current strategy is to slice it with store_nbr and department which leaves us with 1780~ time series. before I sliced the data the score was much lower on automl and on kaggle submitting score. what do you think?

I guess I should play with ForecastBySSa but I really don't understand how it works :| and as well it does not have automl support, am I right?

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented May 8, 2023

@superichmann I just expose the MaxModel to public in this PR[#6663],, sorry for the confusing.

I guess I should play with ForecastBySSa but I really don't understand how it works :| and as well it does not have automl support, am I right?

forecastBySsa is used for univariate forecasting, while your dataset seems to be a multivariate one. So ForecastBySSA might not be the trainer that you're looking for. Considering that categorical features like store_nbr and department are provided, maybe you can try tree-base method like fast-tree or lightGBM.

Data pre-processing

You can transfer a time-serious forecasting problem to a regression problem via using previous sales data as features. For example if you want to predict the sale amount on day t, you can put sale amount on day t-1, t-2, .... t-n as features. Plus, adding meta-info like dayOfWeek, isPaymentDay, sale amount of relavent goods should also be helpful in predicting sale amount.

Also, I like your idea of creating different model for different stores and family. But I'll probably try creating different models for dfferent stores first, as the sales data for other product familiy might also be useful (for example, sales data for MEAT and DELI might be relative)

@superichmann
Copy link
Author

superichmann commented May 9, 2023

Thanks I am now experimenting with MaxModels.

On some experiments, I see in my database (I Use CreateDatabaseLoader) numerous SELECTs with exactly the same SELECT statement from AutoML.. maybe you should add cache to the AutoML across all experiments?

After a check this particular hang caused by RefitBestPipeline.. I dont know why. now tried again and all is good.

Still maybe it will be good to add a cache for the data from db..

@LittleLittleCloud
Copy link
Contributor

@superichmann Maybe you can try cache dataset before sending to AutoML so it won't retrieve data from database for each trial?

var cachedTrain = context.Data.Cache(input, ...);

@superichmann
Copy link
Author

superichmann commented May 9, 2023

from a fast experiment it works and also my runtime is faster 3 times :] the problem is that the scores are worsened :[ (I am talking about further test that I am running on later data by myself by fitting the final estimator to unseen data)

Any explanation for that? Thanks again for your much help 💯

var experimentSettings = new RegressionExperimentSettings();
experimentSettings.MaxModels = 30;
experimentSettings.OptimizingMetric = RegressionMetric.MeanAbsoluteError;;
experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Off;
experimentSettings.CacheDirectoryName = null;
RegressionExperiment experiment = mlContext.Auto().CreateRegressionExperiment(experimentSettings);
idvTrain = mlContext.Data.Cache(idvTrain);
idvTest = mlContext.Data.Cache(idvTest);
experimentResult = experiment.Execute(idvTrain, idvTest, DEFAULT_TARGET_COL, preFeaturizer: preDoubleToSingle);

@LittleLittleCloud
Copy link
Contributor

@superichmann Can you tell me how much worse it is. And in the meantime, can you share with us the best result && trainer when using old AutoML.

@superichmann
Copy link
Author

I can't reproduce it for some reason :\ I will update when I could. Is there any other cache mechanism that I am not aware of? Is there anything I need to do when initializing mlcontext? I just create new()
private static MLContext mlContext = new MLContext();
and all the time I get different results for similar experiments.

Another thing that happened, on RegressionMetric.MeanAbsoluteError AutoML I get model with score: 0 that predicts unseen data in a good way.. any explanatoin?

@LittleLittleCloud
Copy link
Contributor

Is there anything I need to do when initializing mlcontext? I just create new()
private static MLContext mlContext = new MLContext();
and all the time I get different results for similar experiments.

@superichmann You can pass a seed in MLContext.

Another thing that happened, on RegressionMetric.MeanAbsoluteError AutoML I get model with score: 0 that predicts unseen data in a good way.. any explanatoin?

I notice that in your test dataset, all sales values are 0 except the 12th row. So that's might be why the predicting mae is 0.

@michaelgsharp
Copy link
Member

Closing since it seems the issue is resolved.

@ghost ghost removed the untriaged New issue has not been triaged label Jan 24, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process
Projects
None yet
Development

No branches or pull requests

4 participants