-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159
Comments
Thank you for reporting this.
this way we would read all cache all data in memory, which would speed up training, and also allow us properly shuffle data during training. |
OK, I see what AppendCacheCheckpoint does - it caches the whole Enumerable (gradually) so the cache is used for re-reads/shuffles. The 'Airlines' data is just some public test data I can use for demos. I have more than a billion rows of real data in an Azure SQL DB that I am using in my day job. Even if I'm training on a small subset of this (I'm using 1%) I don't think caching it in memory is a good idea... I'd like to be able to give the trainer a 100K rows at a time or somesuch and dispose of it before moving on to the next tranche. I have been trying to figure out a fix for myself but there is a lot of code to understand... Could it be that the IDataView needs to determine if CanShuffle is true for this source? But since I used 'Where' the underlying Enumerable is seen as a Linq object? So therefore (for now) ensure that CanShuffle is false for all Streaming IDataViews… Maybe later [insert clever code here] make more nuanced decisions... |
The glass is still cloudy, but I have noticed that the LinearTrainerBase.PrepareDataFromTrainingExamples method in SdcaBinary.cs always sets ForceShuffle to be true when constructing it's RowShufflingTransformer. It ignores the original streaming dataview's CanShuffle flag (which is always false for any streaming dataview). That might make sense - the RowShufflingTransformer is supposed to abstract away the implementation of shuffles. And it works for other streaming sources and the trainer WANTS shuffled data to work good. The actual problem is that EF Core DBContexts aren't thread-safe, so if cursors from multiple threads exhaust the pool and more than one tries to access the Enumerable to get more: 'boom'. Maybe the answer is to add another flag to the source IDataView ('IsNotThreadSafe') and then implement a lock when fetching more pool data (in that specific case)? |
Adding @eerhardt - Please, take a look to this issue. Not directly related to the current DatabaseLoader we're implementing, but we should fix LoadFromEnumerable() when possible, as well. Adding also @divega who mentioned this bug in a related discussion, as well. LoadFromEnumerable doesn't always work properly because it assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case, like when using EF Core DbContext. Related specs PR: #3857 @endintiers proposal is the following: Thoughts? |
@CESARDELATORRE I wouldn't say it's totally unrelated to DatabaseLoader. In my mind the obvious thing is for DatabaseLoader to leverage the existing stack under LoadFromEnumerable and various parts of IDataView and the cache loader... Unless there is some reason that generic database as a source is fundamentally different? But, yes, depending on what the spec said, assuming that the IEnumerable implementation is thread-safe is a bug. |
The exception I get with the original code (using EF Core 2.2) says:
See https://github.com/aspnet/EntityFramework.Docs/blob/master/entity-framework/core/miscellaneous/configuring-dbcontext.md#avoiding-dbcontext-threading-issues for more details. In ML.NET you are using the SDCA algorithm, which uses multiple threads by default. This is in direct conflict with the data source that you are reading from which is prohibiting multi-threaded access. I don't see how something in ML.NET could solve this, there is an inherent mismatch between the algorithm and the data source. You will have to handle it on your own by doing one or more of the following:
As for it working in EF Core 3.0, I see it working as well, but I am not certain it is intended given the above dotnet/efcore#14455 @divega @ajcvickers - any comment here? Should EF Core 3.0 be throwing an exception when multiple threads are accessing the same DbContext, like it did in earlier versions? |
@eerhardt, @CESARDELATORRE we had a very long discussion with Tom and others about this when we were preparing the sample at https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/DatabaseIntegration. Notice that in https://github.com/dotnet/machinelearning-samples/blob/874a1fec13bbd51c543755048a4da6cfdd618d3a/samples/csharp/getting-started/DatabaseIntegration/DatabaseIntegration/Program.cs#L44-L60 we make sure each new enumeration of the data happen on a separate DbContext and DbConnection. I would recommend that we point customers to this sample when questions like this arise, since we already made the investment to work out all the major kinks there. Re the fact that ML.NET makes the assumption that IEnumerable<T> instances are thread safe, the more I think about it, the more I am convinced that it is just a bug. It is only true for a subset of implementations (e.g. in-memory collections that are not being modified) but as @ajcvickers likes to remind me, all abstractions in .NET are not thread-safe unless explicitly stated. I think besides pointing people to the existing sample, other things the ML.NET team could consider are:
|
Re how this impacts DatabaseLoader, although it shouldn't need to use LoadFromEnumerable() because going directly from a DbDataReader to an IDataView should be more efficient, we will run into the same fundamental problem because DbConnection implementations are not thread-safe, and even worse, many of them don't even support MARS (multiple active result sets). |
100% agree with @divega |
@eerhardt >> I don't see how something in ML.NET could solve this Something like that anyway :-) There has to be a queue for cache update requests. |
@divega @ajcvickers >> need to use LoadFromEnumerable() Although (thinking) since all that DatabaseLoader is doing is trying to make it easier for devs to create an IEnumerable (the end game here) from some database, adding an additional method at the IDataView level could be seen as needlessly making that interface more complex. The 'database wizard' stuff could easily sit on top of LoadFromEnumerable? (am I missing some subtlety here?) |
@endintiers Maybe I wasn't clear, but this is exactly what I meant when I said:
The thing is that with most databases using multiple connections can be a cheaper approach than marshalling, at least from the perspective the number of changes that user code or ML.NET may need (finding out if it is ultimately more efficient at runtime would require some testing, but database drivers are usually optimized to be used this way).
Possibly what you are missing is why we want to skip IEnumerable<T> and LoadFromEnumerable in the database loader: DbDataReader and IDataView are similar in that they work like sliding windows over the data. Neither has the concept of a .Current object as IEnumerator<T> does, hence when reading data there is no need to allocate and initialize an object to be returned by it on every iteration. Going from DbDataReader to IDataView directly can save a lot of runtime work and allocations. |
@eerhardt - But I think that documenting it and pointing to a sample is just a short term workaround. The fact that ML.NET makes the assumption that IEnumerable instances are thread safe feels completely like a bug. The way to fix a bug is by fixing the API implementation not making possible that users will get into trouble. Documenting a workaround for a bug is only a short term workaround. As suggested by Diego, I also believe we need to change/fix the API implementation so it won't be possible for the user to get into this trouble, either by:
Also, for better performance we're going to have the new DatabaseLoader (for relational databases). However, for non-SQL databases or other data sources we still need the LoadFromEnumerable() approach. It is not something we can simply deprecate. My point is that it is better if we create "fences/guards" in the API itself than just documenting it and show a sample because the chances that developers will get into this trouble (like when using EF if not following the example we created) are pretty high... |
@endintiers - Question related to SQL databases but in a different page researching possible customer scenarios and differences in features and issues depending on that: When you use a database to train ML.NET models, what RDBMS do you use? SQL Server? Other? Do you use that database in Azure such as using Azure SQL Database? Or do you use SQL Server or any other RDBMS on-prem? This question is not directly related to this issue so we can follow up by email if you'd like. Can you send me an email to 'cesardl at microsoft.com' or through Twitter to @CESARDELATORRE when possible? - Thanks! :) I'll remove this comment as soon as we're talking offline. :) |
This issue is already documented and we suggest that IEnumerable needs to be thread-safe, I would like to mark this as enhancement and mark this as P2 CC @harishsk |
System information
Issue
The EF Core problem could be addressed by allowing multiple dbcontexts to be provided but this is likely to be an issue with any non thread-safe IDataView source...
Source code / logs
Source at: https://github.com/endintiers/Unearth.Demo.ML.FromDB shows this working in 0.8, failing in 0.9 and a work-around for 0.9 (single-threading the training).
The text was updated successfully, but these errors were encountered: