-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early Draft specs doc for DatabaseLoader in ML.NET #3857
Early Draft specs doc for DatabaseLoader in ML.NET #3857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I did a pass and left some questions.
docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md
Outdated
Show resolved
Hide resolved
docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some observations. Mostly that one could build so that any database with an ADO.NET provider works and that it's possible for the library code to take a dependency only to the factory methods and code, such as DbCommand
, DbConnection and so on. Also pointers to concrete, functioning code that does this and seem provide the desired interface of var result = await db.QueryAsync<TResult>(SELECT * FROM Something);
.
Also, maybe this way building the retries could be the problem of external libraries such as Polly or whatever one wants to use. Or provide oneself a retry function (say, a sigmoid shaped one).
- `NpgsqlConnection` (PostgreSQL) | ||
- `OracleConnection` (Oracle) | ||
- `MySqlConnection` (MySql) | ||
- etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just note DbConnection and note you are compatible with any database that has a ADO.NET provider, I think. It is https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L255 much code to do plus some odd pieces to load the connector library and do like var result = await storage.ExecuteQueryAsync<TResult>(SELECT * FROM xyz);
.
You can see an example of ADO.NET at https://blogs.msdn.microsoft.com/adonet/2012/07/15/using-sqldatareaders-new-async-methods-in-net-4-5-part-2-examples/ that streams photos (that might be a case for this framework).
- `MySqlConnection` (MySql) | ||
- etc. | ||
|
||
The specific type (`SqlConnection`, `NpgsqlConnection`, `OracleConnection`, etc.) is specified as a generic parameter so references to the needed NuGet packages will be provided by the user when trying to compile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add another comment to reiterate do not need to take a dependency on these concrete types in the libraries. Just in case. Only connection string is needed and some other piece of knowledge which connector library to load if the user doesn't want to provide it (you probably want to have the loading mechanism to be the same nevertheless).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those types would be specified in the user's code, not in ML.NET code. Our parameters are Interfaces or base classes.
We prefer the users to need to provide a type (such as SqlConnection
, NpgsqlConnection
, OracleConnection
) in their code so at that time they'll need to provide the dependency to their specific chosen database provider, that's the only reason we're specifying to provide a DbConnection type instead of a string with the provider such as "SQLServerProvider" or "OracleProvider", but then they'll need to know what NuGet assembly to add, etc. If not provided properly they'll get an execution exception. Instead, by providing a type (such as NpgsqlConnection
) it won't compile until you add the needed NuGet package.
MLContext mlContext = new MLContext(); | ||
|
||
string conString = YOUR_CONNECTION_STRING; | ||
using (SqlConnection connection = new SqlConnection(conString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to point out that if you don't want to take a dependency on a specific ADO.NET library, you can use the factory functions and wrap it like at https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L26. It looks to me that .LoadFromDbRead
looks like https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorageExtensions.cs#L186, which would reduce this to new RelationalStorage
and then using it to do the queries like var result = await storage.ReadAsync(Select * from InputMLModelDataset
). There are test for timeout, streaming etc. too if you want to take a look. I don't say you need to do it like this, but it's close to minimum code you can do and the interface is what noted here. It can also support IAsyncEnumerable
trivially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting. 👍 Not sure if we want to go that path, but we'll evaluate it, for sure.
It might be interesting for our internal implementation code so we wouldn't need to take direct dependency on specific ADO.NET providers (as opposed to user's code where it is safer to take dependency to catch errors in compilation time). Thanks for these pointers to Orleans project! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CESARDELATORRE No problem. To be clear, I don't mean you need to use that code or do something similar. Maybe more like give ideas how it could look like for what I think is a fairly minimal way to create something that produces at the end strongly typed classes while supporting all the features the various databases can have.
I wrote that part of the code so I'm somewhat familiar with the problems. For instance, some drivers (or their versions) have bugs that they are not necessarily asynchronous. Or they just don't support streaming without some extra work. Maybe the largest thing here is that I indeed wouldn't like to see this tied to any specific ORM since what would happen is that I'd pipe plain SQL through it and use for serializing data and likely end up with some inefficiencies along the way (multiple resultsets?). I don't know if people want to bring their own ORMs or just write efficient SQL that likely involves analytic/windowing functions too.
That "efficient SQL" covers also extensions like https://www.pipelinedb.com/ (PostgreSQL) or PolyBase etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @CESARDELATORRE ,
I think that as a pre-requisite for this, we should fix IDataView LoadFromEnumerable so it works for all Enumerables. Developers can easily make Enumerables from almost anything - including any database.
Then we can create a set of helpers to create Enumerables from abstract 'Databases' (with concrete implementations). This sounds like the kind of problem that has been solved many times before? Once this is implemented then this can just feed the resulting Enumerables to LoadFromEnumerable... This is how AutoML would feed in training and test data (from databases or any kind of structured store?).
LoadFromEnumerable doesn't always work at the moment because code further on assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case (ex. EF Core Context). See: https://github.com/endintiers/Unearth.Demo.ML.FromDB
I understand why limiting concurrency for trainers was a bad idea. We shouldn't. In the cases when training from huge amounts of data (in a DB for example) it is slooowww.
So I propose starting by adding a switch to LoadFromEnumerable 'singleThreaded=true' which would intercept IDataView cache update requests and force them onto a single thread. This may slow down the trainers a bit but nowhere near as much as single-threading them (where possible) and we can improve performance with eager loading.
Doing that first would remove a lot of complexity from the DatabaseLoader task (threading no longer an issue).
- Chris
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get the initial draft in, as we are starting the work on this feature. We can iterate on it as we learn more doing the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's enough for me to get started (the objective is correct, some interesting thinking points). Bit cheeky but I'll be a heavy consumer. In any case: "code is truth" lets see what v0.1 looks like!
I add to the comments about some specific pain points I have had and seen others have:
So something like
Instead of using In that sense a small core and a set of extension methods seem to cut the problem and maybe lead one to the idea of extending the core (like having a heuristic on how to bulk load data). Not everyone likes composing with functions and extensions, however. |
* Early Draft specs doc for DatabaseLoader in ML.NET * Specifying implementaion as .NET Standard 2.0 * Support connect from within the RDBMS server * Updated changes related to embeded connection * Updates on Azure SQL DB support * Minor updates on SQL CLR * Minor typo fixed * Refactored some code naming
Early Draft specs doc for a
DatabaseLoader
component in ML.NET.Feel free to provide feedback. This specs document will be evolving significantly since it is in a very early draft state.
Main objective is to load data into an IDataView from relational databases (such as SQL Server, Azure SQL Database, PostgreSQL, MySQL, Oracle, etc.) in a very easy way, one line of code in most cases by simply specifying the connection string, the database
table/view/sql-sentence
and what database provider/connection type it is using.Currently in ML.NET 1.0 or 1.1 we can only do the following:
MLContext.Data.LoadFromTextFile()
, etc.MLContext.Data.LoadFromEnumerable()
, usually coming from a database, but the developer/user is responsible for how to load/query the database. It is not straightforward such as when reading from a file.High level design goal:
The intention of the new component is to make a lot easier to implement 'model training scenarios pulling/streaming data from large database tables' while transparently solving complicated cases such as 'transient errors' in the cloud (database connections broken as a result) when using Azure SQL Database.