Early Draft specs doc for DatabaseLoader in ML.NET #3857

CESARDELATORRE · 2019-06-12T19:17:58Z

Early Draft specs doc for a DatabaseLoader component in ML.NET.
Feel free to provide feedback. This specs document will be evolving significantly since it is in a very early draft state.

Main objective is to load data into an IDataView from relational databases (such as SQL Server, Azure SQL Database, PostgreSQL, MySQL, Oracle, etc.) in a very easy way, one line of code in most cases by simply specifying the connection string, the database table/view/sql-sentence and what database provider/connection type it is using.

Currently in ML.NET 1.0 or 1.1 we can only do the following:

Load data from files by using MLContext.Data.LoadFromTextFile(), etc.
Load data from an IEnumerable collection with MLContext.Data.LoadFromEnumerable(), usually coming from a database, but the developer/user is responsible for how to load/query the database. It is not straightforward such as when reading from a file.

High level design goal:
The intention of the new component is to make a lot easier to implement 'model training scenarios pulling/streaming data from large database tables' while transparently solving complicated cases such as 'transient errors' in the cloud (database connections broken as a result) when using Azure SQL Database.

markusweimer

Thanks for this! I did a pass and left some questions.

docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md

veikkoeeva

Some observations. Mostly that one could build so that any database with an ADO.NET provider works and that it's possible for the library code to take a dependency only to the factory methods and code, such as DbCommand, DbConnection and so on. Also pointers to concrete, functioning code that does this and seem provide the desired interface of var result = await db.QueryAsync<TResult>(SELECT * FROM Something);.

Also, maybe this way building the retries could be the problem of external libraries such as Polly or whatever one wants to use. Or provide oneself a retry function (say, a sigmoid shaped one).

veikkoeeva · 2019-06-14T09:48:17Z

docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md

+- `NpgsqlConnection` (PostgreSQL)
+- `OracleConnection` (Oracle)
+- `MySqlConnection` (MySql)
+- etc.


You could just note DbConnection and note you are compatible with any database that has a ADO.NET provider, I think. It is https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L255 much code to do plus some odd pieces to load the connector library and do like var result = await storage.ExecuteQueryAsync<TResult>(SELECT * FROM xyz);.

You can see an example of ADO.NET at https://blogs.msdn.microsoft.com/adonet/2012/07/15/using-sqldatareaders-new-async-methods-in-net-4-5-part-2-examples/ that streams photos (that might be a case for this framework).

veikkoeeva · 2019-06-14T09:49:13Z

docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md

+- `MySqlConnection` (MySql)
+- etc.
+
+The specific type (`SqlConnection`, `NpgsqlConnection`, `OracleConnection`, etc.) is specified as a generic parameter so references to the needed NuGet packages will be provided by the user when trying to compile.


I'll add another comment to reiterate do not need to take a dependency on these concrete types in the libraries. Just in case. Only connection string is needed and some other piece of knowledge which connector library to load if the user doesn't want to provide it (you probably want to have the loading mechanism to be the same nevertheless).

Those types would be specified in the user's code, not in ML.NET code. Our parameters are Interfaces or base classes.
We prefer the users to need to provide a type (such as SqlConnection, NpgsqlConnection, OracleConnection) in their code so at that time they'll need to provide the dependency to their specific chosen database provider, that's the only reason we're specifying to provide a DbConnection type instead of a string with the provider such as "SQLServerProvider" or "OracleProvider", but then they'll need to know what NuGet assembly to add, etc. If not provided properly they'll get an execution exception. Instead, by providing a type (such as NpgsqlConnection) it won't compile until you add the needed NuGet package.

veikkoeeva · 2019-06-14T09:56:44Z

docs/specs/mlnet-database-loader/mlnet-database-loader-specs.md

+MLContext mlContext = new MLContext();
+
+string conString = YOUR_CONNECTION_STRING;
+using (SqlConnection connection = new SqlConnection(conString))


Just want to point out that if you don't want to take a dependency on a specific ADO.NET library, you can use the factory functions and wrap it like at https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L26. It looks to me that .LoadFromDbRead looks like https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorageExtensions.cs#L186, which would reduce this to new RelationalStorage and then using it to do the queries like var result = await storage.ReadAsync(Select * from InputMLModelDataset). There are test for timeout, streaming etc. too if you want to take a look. I don't say you need to do it like this, but it's close to minimum code you can do and the interface is what noted here. It can also support IAsyncEnumerable trivially.

This is very interesting. 👍 Not sure if we want to go that path, but we'll evaluate it, for sure.
It might be interesting for our internal implementation code so we wouldn't need to take direct dependency on specific ADO.NET providers (as opposed to user's code where it is safer to take dependency to catch errors in compilation time). Thanks for these pointers to Orleans project! 👍

@CESARDELATORRE No problem. To be clear, I don't mean you need to use that code or do something similar. Maybe more like give ideas how it could look like for what I think is a fairly minimal way to create something that produces at the end strongly typed classes while supporting all the features the various databases can have.

I wrote that part of the code so I'm somewhat familiar with the problems. For instance, some drivers (or their versions) have bugs that they are not necessarily asynchronous. Or they just don't support streaming without some extra work. Maybe the largest thing here is that I indeed wouldn't like to see this tied to any specific ORM since what would happen is that I'd pipe plain SQL through it and use for serializing data and likely end up with some inefficiencies along the way (multiple resultsets?). I don't know if people want to bring their own ORMs or just write efficient SQL that likely involves analytic/windowing functions too.

That "efficient SQL" covers also extensions like https://www.pipelinedb.com/ (PostgreSQL) or PolyBase etc.

Hi @CESARDELATORRE ,

I think that as a pre-requisite for this, we should fix IDataView LoadFromEnumerable so it works for all Enumerables. Developers can easily make Enumerables from almost anything - including any database.

Then we can create a set of helpers to create Enumerables from abstract 'Databases' (with concrete implementations). This sounds like the kind of problem that has been solved many times before? Once this is implemented then this can just feed the resulting Enumerables to LoadFromEnumerable... This is how AutoML would feed in training and test data (from databases or any kind of structured store?).

LoadFromEnumerable doesn't always work at the moment because code further on assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case (ex. EF Core Context). See: https://github.com/endintiers/Unearth.Demo.ML.FromDB

I understand why limiting concurrency for trainers was a bad idea. We shouldn't. In the cases when training from huge amounts of data (in a DB for example) it is slooowww.

So I propose starting by adding a switch to LoadFromEnumerable 'singleThreaded=true' which would intercept IDataView cache update requests and force them onto a single thread. This may slow down the trainers a bit but nowhere near as much as single-threading them (where possible) and we can improve performance with eager loading.

Doing that first would remove a lot of complexity from the DatabaseLoader task (threading no longer an issue).

Chris

eerhardt

We should get the initial draft in, as we are starting the work on this feature. We can iterate on it as we learn more doing the implementation.

endintiers

That's enough for me to get started (the objective is correct, some interesting thinking points). Bit cheeky but I'll be a heavy consumer. In any case: "code is truth" lets see what v0.1 looks like!

veikkoeeva · 2019-07-16T16:38:49Z

I add to the comments about some specific pain points I have had and seen others have:

Have a quick and easy way to get access to database.
Being able to write complex queries like row pattern recognition or graph/JSON.
Being able to insert data efficiently to the database. I mean something like here or OPEN ROWSET or something alike.
Ability to stream large blobs.

So something like

using(var db = new Database("connectionstring"))
{
    var res = db.QueryAsync<List<dynamic>>("SELECT * FROM Table1; SELECT * FROM Table2;");
}

Instead of using dynamic could be a class too (since there's multiple result sets, one might want to provide a Func to construct the objects (see the already linked pieces)). This way there's a quick way without too much scafolding and thinking to get working with a database and one can transport the queries across the tools.

In that sense a small core and a set of extension methods seem to cut the problem and maybe lead one to the idea of extending the core (like having a heuristic on how to bulk load data). Not everyone likes composing with functions and extensions, however.

* Early Draft specs doc for DatabaseLoader in ML.NET * Specifying implementaion as .NET Standard 2.0 * Support connect from within the RDBMS server * Updated changes related to embeded connection * Updates on Azure SQL DB support * Minor updates on SQL CLR * Minor typo fixed * Refactored some code naming

Early Draft specs doc for DatabaseLoader in ML.NET

f688503

markusweimer reviewed Jun 13, 2019

View reviewed changes

CESARDELATORRE added 7 commits June 13, 2019 13:43

Specifying implementaion as .NET Standard 2.0

410f6c1

Support connect from within the RDBMS server

3201078

Updated changes related to embeded connection

5bffa88

Updates on Azure SQL DB support

9991533

Minor updates on SQL CLR

2330e65

Minor typo fixed

b6d2703

Refactored some code naming

85261fe

veikkoeeva reviewed Jun 14, 2019

View reviewed changes

artidoro mentioned this pull request Jul 2, 2019

Q: ML.NET within SQLCLR #2571

Closed

CESARDELATORRE requested a review from eerhardt July 15, 2019 22:31

CESARDELATORRE mentioned this pull request Jul 15, 2019

AggregateException/InvalidOperationException when training from IEnumerable backed by EF Core Context #2159

Open

eerhardt approved these changes Jul 15, 2019

View reviewed changes

endintiers approved these changes Jul 15, 2019

View reviewed changes

CESARDELATORRE merged commit fb80f72 into dotnet:master Jul 17, 2019

ghost locked as resolved and limited conversation to collaborators Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Draft specs doc for DatabaseLoader in ML.NET #3857

Early Draft specs doc for DatabaseLoader in ML.NET #3857

CESARDELATORRE commented Jun 12, 2019 •

edited

Loading

markusweimer left a comment

veikkoeeva left a comment •

edited

Loading

veikkoeeva Jun 14, 2019 •

edited

Loading

veikkoeeva Jun 14, 2019

CESARDELATORRE Jun 14, 2019 •

edited

Loading

veikkoeeva Jun 14, 2019 •

edited

Loading

CESARDELATORRE Jun 14, 2019

veikkoeeva Jun 17, 2019 •

edited

Loading

endintiers Jul 3, 2019 •

edited

Loading

eerhardt left a comment

endintiers left a comment

veikkoeeva commented Jul 16, 2019 •

edited

Loading

Early Draft specs doc for DatabaseLoader in ML.NET #3857

Early Draft specs doc for DatabaseLoader in ML.NET #3857

Conversation

CESARDELATORRE commented Jun 12, 2019 • edited Loading

markusweimer left a comment

Choose a reason for hiding this comment

veikkoeeva left a comment • edited Loading

Choose a reason for hiding this comment

veikkoeeva Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

veikkoeeva Jun 14, 2019

Choose a reason for hiding this comment

CESARDELATORRE Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

veikkoeeva Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

CESARDELATORRE Jun 14, 2019

Choose a reason for hiding this comment

veikkoeeva Jun 17, 2019 • edited Loading

Choose a reason for hiding this comment

endintiers Jul 3, 2019 • edited Loading

Choose a reason for hiding this comment

eerhardt left a comment

Choose a reason for hiding this comment

endintiers left a comment

Choose a reason for hiding this comment

veikkoeeva commented Jul 16, 2019 • edited Loading

CESARDELATORRE commented Jun 12, 2019 •

edited

Loading

veikkoeeva left a comment •

edited

Loading

veikkoeeva Jun 14, 2019 •

edited

Loading

CESARDELATORRE Jun 14, 2019 •

edited

Loading

veikkoeeva Jun 14, 2019 •

edited

Loading

veikkoeeva Jun 17, 2019 •

edited

Loading

endintiers Jul 3, 2019 •

edited

Loading

veikkoeeva commented Jul 16, 2019 •

edited

Loading