Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite RevEng: Sample data to determine CLR type #8824

Closed
Tracked by #22950 ...
bricelam opened this issue Jun 12, 2017 · 28 comments · Fixed by #30816
Closed
Tracked by #22950 ...

SQLite RevEng: Sample data to determine CLR type #8824

bricelam opened this issue Jun 12, 2017 · 28 comments · Fixed by #30816
Assignees
Labels
area-scaffolding area-sqlite closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. punted-for-3.0 punted-for-6.0 type-enhancement
Milestone

Comments

@bricelam
Copy link
Contributor

bricelam commented Jun 12, 2017

Reverse engineering SQLite column types is tricky due to it's dynamic typing. The column type hold no real meaning.

Today, we apply the same type affinity rules that SQLite does to the specified column type name to put it into one of the four primitive types (which map to long, double, string & byte[]).

I think we could do a much better job by sampling the data to determine a suitable CLR type. This would allow us to reverse engineer types that need to be coerced like decimal, DateTime and Guid. We could also provide a more natural numeric type like int if the data fits.

@bricelam
Copy link
Contributor Author

bricelam commented Jun 19, 2017

@ErikEJ was that a 👎 to doing this? or because SQLite is weird.

@ErikEJ
Copy link
Contributor

ErikEJ commented Jun 20, 2017

It was actually to doing it, it will give users a false sense of strong typing

@bricelam
Copy link
Contributor Author

bricelam commented Sep 25, 2017

I imagine something like this to get the average field type:

SELECT typeof(column1)
FROM table1
WHERE typeof(column1) != 'null'
GROUP BY typeof(column1)
ORDER BY count() DESC
LIMIT 1;

If INTEGER or REAL, use max() and min() to get the range.
If BLOB, use length() to see if it's a GUID.
If TEXT, use datetime() and time() to see if it parses. Maybe length() can indicate a GUID. Maybe CAST..AS REAL can indicate a decimal.

@jonreis
Copy link

jonreis commented Mar 12, 2018

Hello @bricelam, question regarding where this is going please. If I have the following table definition:

CREATE TABLE [Schema] (
[SchemaId] INTEGER NOT NULL
, [SchemaVersion] nvarchar(16) NOT NULL
, [DatabaseVersion] int NOT NULL
, [Expression] ntext NULL
, [Description] nvarchar(1024) NULL
, [ReadOnly] bit NOT NULL
, [ModifiedDateTime] datetime NOT NULL
, CONSTRAINT [sqlite_master_PK_Schema] PRIMARY KEY ([SchemaId])

In EF6
bit => bool
int => int
datetime => datetime

EF Core
bit => string
int => long
datetime => string

Are you saying this is how it will be in the future, or is it going to get fixed post 2.1 preview 1?

I'm assuming EF Core would follow the same rules as EF6, which seemed to work well in this regard:

https://system.data.sqlite.org/index.html/artifact/18f60c317bbdfb16

///


/// SQLite is inherently un-typed. All datatypes in SQLite are natively strings. The definition of the columns of a table
/// and the affinity of returned types are all we have to go on to type-restrict data in the reader.

@bricelam
Copy link
Contributor Author

bricelam commented Mar 12, 2018

I'm assuming EF Core would follow the same rules as EF6

No. In SQLite, column type names mean nothing*. System.Data.SQLite added additional semantics to them to map them CLR types.

Early on, we decided that Microsoft.Data.Sqlite and EF Core's SQLite provider should not add these additional semantics instead exposing the underlying SQLite behavior more directly.

This issue is about looking the actual data to determine the CLR type. The column type names would continue to just be application-defined strings with application-only semantics.

To illustrate the problem, EF6 and System.Data.SQLite map columns with type FLOAT to System.Double, but most database created outside of EF6 probably intended those columns to map to System.Single. For further discussion, see aspnet/Microsoft.Data.Sqlite#457

By continuing to ignore the column type and going directly to the data, we could see that a column contains only 1s and 0s and map it to System.Bool. The column type could be whatever the application thinks is best: BIT, BOOL, BOOLEAN, FLAG, YESNO, SWITCH, etc.

* Column type names are actually used by SQLite to determine type affinity, but that really only affects the on-disk format.

@bricelam
Copy link
Contributor Author

For additional reference, System.Data.SQLite's column to EDM type mapping is specified here.

@bricelam
Copy link
Contributor Author

The Microsoft.Data.Sqlite Data Type Mappings might also be interesting.

@jonreis
Copy link

jonreis commented Mar 12, 2018

Hi Brice. Thank you for the reply. Please bear with me I am new to EF Core and I just spent a couple of days attempting to port a relatively simple data model from EF6 to EF Core 2.1, just to see where the pain points will be. I was expecting a simpler port, but found that it was more difficult due not being able to be involved in the code generation (solved with EF Core Power Tools) and the changes to how data types are being presented by EF Core.

This issue is about looking the actual data to determine the CLR type. The column type names would continue to just be application-defined strings with application-only semantics.

So I can only reverse engineer a database that is fully populated? The old way of defining datatypes within the table definition seems less error-prone and more declarative. If I define a column as

, [ModifiedDateTime] datetime NOT NULL

If I use Dapper, EF6, or any other micro-ORM I get a datetime. Why would I want the data to come back as a string and write code to make the conversion manually? Also, according the Sqlite documentation, datetimes can be stored as REAL, INTEGER, or STRING in the database. How are you going to determine that a REAL, or INTEGER is really meant to be a datetime?

The column type names would continue to just be application-defined strings with application-only semantics.

What would the application want to do with the column type names? I'm pretty sure they want their ORM to use it to interpret the types for them rather than having to write the conversion code themselves?

One last point on this. With EF6, when I moved my application from SqlServerCe to Sqlite, I did not have to change anything. I was pretty much insulated by EF to the specifics of how things were stored in the database. Now what I hear is that this is no longer the case. We need to change our applications to conform to the semantics of the database. Is this correct?

@bricelam
Copy link
Contributor Author

bricelam commented Mar 13, 2018

You can change the CLR types after reverse engineering.

How are you going to determine that a REAL, or INTEGER is really meant to be a datetime?

We won't. You'll have to update the type manually if you know they're datetime values.

they want their ORM to use it to interpret the types for them

We could apply heuristics, but since the type names could literally be anything (including no type name), we would constantly be getting issues to update the heuristics to match/not match every new type name somebody's application decided to use. SQLite does not apply semantics to these names so we don't want to either.

Reverse Engineering is a best guess at what we think you want your domain model to look like. Since SQLite only has four primitive data types, there is A LOT of information lost when looking just at the database. This issue is about making a more educated guess, but at the end of the day it's still just a guess and you'll probably have to provide the missing information by manually updating the model after it's generated.

This is one place where reverse engineer templates can help. If you have the column type names, you can map to the appropriate CLR types in the template.

@jonreis
Copy link

jonreis commented Mar 13, 2018

Thank you for your patience on this Brice. I think we have been talking at cross-purposes here. I have no problem modifying my model after the reverse engineering. That makes sense to me. What I thought was been said is that I could not in general use a bool with SQLite because it doesn't support that type.

So here is what happens if I use a bool in my model. Perhaps it is because I am configuring things wrong.

SQLite Column Definition

[ReadOnly] bit NOT NULL

SELECT typeof(Readonly) FROM Schema returns 'integer'

Reverse engineering turns this to a string (not sure why its not an integer), I edit it and turn it to a bool

public bool ReadOnly {get;set;}

Configured in the context as:

            entity.Property(e => e.ReadOnly)
                .IsRequired()
                .HasColumnType("bit");

When I attempt to access this entity from the data model, I get the following:

System.InvalidOperationException: 'An exception occurred while reading a database value for property 'Schema.ReadOnly'. The expected type was 'System.Boolean' but the actual value was of type 'System.Boolean'.'

@bricelam
Copy link
Contributor Author

bricelam commented Mar 13, 2018

Oh weird. Yes, that should definitely work. Could you submit a new issue with a repro? It may be hitting a bug somewhere in query.

@jonreis
Copy link

jonreis commented Mar 13, 2018

Not sure why it is being reverse engineered as a string when the type in Sqlite is an integer.

I will log a new issue with a sample project.

Thanks for your help.

@jonreis
Copy link

jonreis commented Mar 13, 2018

@bricelam logged as #11258

@roji
Copy link
Member

roji commented Dec 11, 2022

Some notes on the table above:

  • CURRENCY, DECIMAL, MONEY, NUMBER, NUMERIC should get mapped to decimal anyway via type affinity (so they don't need to be listed in the table)?
  • Maybe add DateTimeOffset?
  • Should SINGLE be mapped to .NET double (as opposed to float)?
  • Support TIMESPAN for .NET TimeSpan alongside TIME?

So just to confirm... If I create a column with type DATETIME in SQLite, I get an affinity of NUMERIC affinity (rules), which means SQLite tries to convert inputs to NUMERIC (but stores them as-is if it can't convert). But according to the above we'd scaffold it as a .NET DateTime. That sounds OK (not pushing back) - just making sure.

BTW we should probably exempt STRICT tables from data sampling? Type ANY in strict table is particularly interesting, it really seems to correspond to CLR object, since it's meant to hold any type (vs. non-strict tables, where there's still a specific "preferred" column type, even if anything can still be stored...).

This is all quite intricate 😅

@ErikEJ
Copy link
Contributor

ErikEJ commented Dec 11, 2022

Add DateOnly/TimeOnly?

@roji
Copy link
Member

roji commented Dec 11, 2022

Absolutely!

@bricelam
Copy link
Contributor Author

CURRENCY, DECIMAL, MONEY, NUMBER, NUMERIC should get mapped to decimal anyway via type affinity (so they don't need to be listed in the table)?

No. They get mapped to the NUMERIC affinity which results in a mix of INTEGER, REAL, and TEXT values.

Maybe add DateTimeOffset?
Add DateOnly/TimeOnly?
Support TIMESPAN for .NET TimeSpan alongside TIME?

I don't think we should add explicit .NET type names here. There should be good precedence for an existing database using the type names before we think about handling them. Remember that looking at the actual data will catch most cases. These are just a last ditch effort to give you something better than the affinity--especially where the affinity for these types would be objectively wrong. It's also worth noting that all the type names here are handled by System.Data.SQLite and related .NET drivers.

Should SINGLE be mapped to .NET double (as opposed to float)?

I don't think it should. No other floating-point column would result in float. This would make for an odd case where only when you have no data in a column would it potentially get mapped to float (as opposed to `double)

I create a column with type DATETIME in SQLite, I get an affinity of NUMERIC affinity...

...yes, good so far.

...which means SQLite tries to convert inputs to NUMERIC (but stores them as-is if it can't convert).

There is no NUMERIC type--it's just an affinity. It results in INTEGER, REAL, or TEXT values depending on which one is lossless.

But according to the above we'd scaffold it as a .NET DateTime.

Yes. If there is no data, and the column type is DATETIME.

we should probably exempt STRICT tables

Probably.

Type ANY in strict table is particularly interesting, it really seems to correspond to CLR object

Yes. We need to handle this in more than just scaffolding. See #28628

This is all quite intricate

Very intricate. I'll try to present everything in a design meeting so we can really understand and discuss it. It's worth re-stating that the goal in all of this is just to produce better, more expected EF models. All of it is guesswork. We'll never be perfect; we're just trying to be better. At the end of the day, SQLite truly only supports four types.

@bricelam
Copy link
Contributor Author

bricelam commented Dec 13, 2022

Here are the initial rules based on the actual value (not column) types.

  • INTEGER
    • When min() = 0 and max() = 1 and cout() > 2
      • bool
    • When max() > int.MaxValue
      • long
    • Otherwise
      • int
  • REAL
    • double
  • TEXT
    • When format is 'yyyy-MM-dd'
      • DateOnly
    • When format is 'yyyy-MM-dd HH:mm:ss[.FFFFFFF]'
      • DateTime
    • When format is 'yyyy-MM-dd HH:mm:ss[.FFFFFFF]zzz'
      • DateTimeOffset
    • When format is '0.0###########################'
      • decimal
    • When format is '00000000-0000-0000-0000-000000000000'
      • Guid
    • When format is '[-][d.]hh:mm:ss[.fffffff]'
      • TimeSpan
    • Otherwise
      • string
  • BLOB
    • byte[]

Notes:

  • We'll never scaffold byte, sbyte, short, or ushort preferring int instead
  • We'll never scaffold uint preferring long instead
  • We'll never scaffold ulong because values just overflow in the database
  • We'll never scaffold float preferring double instead
  • We'll never scaffold char preferring string instead
  • We'll never scaffold TimeOnly because its format is ambiguous with TimeSpan
  • We'll never scaffold Guid for BLOB values

@roji
Copy link
Member

roji commented Dec 14, 2022

@bricelam no need to bring to design just on my account - I don't want to hold this back. As you say, this is only about improving scaffolding, and nothing we do here will be perfect anyway.

Here are the initial rules based on the actual value (not column) types.

BTW what would we do here if the column contains mixed values?

@bricelam
Copy link
Contributor Author

bricelam commented Dec 15, 2022

I don't think we support object properties today on SQLite (but we should as part of #28628), so I was just planning to pick a type.

@ajcvickers
Copy link
Contributor

@bricelam Just FYI. for SQL Server, we have special type mapping code to allow object properties mapped to SqlVariant types.

@bricelam
Copy link
Contributor Author

📝 Design meeting notes

  • Require at least three non-NULL values before choosing bool
  • After looking at the data, use the column name to further refine the type to things like short, float, etc.
  • Use .NET type names too if they don't conflict with known SQL types
  • Avoid aggregate (i.e. table scan) queries as much as possible for perf

@ErikEJ
Copy link
Contributor

ErikEJ commented Dec 17, 2022

Add some opt out/override?

bricelam added a commit to bricelam/efcore that referenced this issue May 2, 2023
…er CLR type

Here's a table highlighting some of the improvements.

SQL Type | Example Value                          | Old .NET Type | New .NET Type
-------- | -------------------------------------- | ------------- | -------------
BOOLEAN  | 0                                      | long          | bool
SMALLINT | 0                                      | long          | short
INT      | 0                                      | long          | int
BIGINT   | 0                                      | long          | long
TEXT     | '0.0'                                  | string        | decimal
TEXT     | '1970-01-01'                           | string        | DateOnly
TEXT     | '1970-01-01 00:00:00'                  | string        | DateTime
TEXT     | '00:00:00'                             | string        | TimeSpan
TEXT     | '00000000-0000-0000-0000-000000000000' | string        | Guid
STRING   | 'ABC'                                  | byte[]        | string

Resolves dotnet#8824
@bricelam bricelam added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label May 2, 2023
@bricelam bricelam modified the milestones: Backlog, 8.0.0 May 2, 2023
@bricelam
Copy link
Contributor Author

bricelam commented May 2, 2023

Add some opt out/override?

No good way to do this, but I'll discuss with the team.

bricelam added a commit to bricelam/efcore that referenced this issue May 4, 2023
…er CLR type

Here's a table highlighting some of the improvements.

SQL Type | Example Value                          | Old .NET Type | New .NET Type
-------- | -------------------------------------- | ------------- | -------------
BOOLEAN  | 0                                      | long          | bool
SMALLINT | 0                                      | long          | short
INT      | 0                                      | long          | int
BIGINT   | 0                                      | long          | long
TEXT     | '0.0'                                  | string        | decimal
TEXT     | '1970-01-01'                           | string        | DateOnly
TEXT     | '1970-01-01 00:00:00'                  | string        | DateTime
TEXT     | '00:00:00'                             | string        | TimeSpan
TEXT     | '00000000-0000-0000-0000-000000000000' | string        | Guid
STRING   | 'ABC'                                  | byte[]        | string

Resolves dotnet#8824
bricelam added a commit to bricelam/efcore that referenced this issue May 9, 2023
…R type

Here's a table highlighting some of the improvements.

Column type | Sample value                           | Before | After
----------- | -------------------------------------- | ------ | -----
BOOLEAN     | 0                                      | byte[] | bool
SMALLINT    | 0                                      | long   | short
INT         | 0                                      | long   | int
BIGINT      | 0                                      | long   | long
TEXT        | '0.0'                                  | string | decimal
TEXT        | '1970-01-01'                           | string | DateOnly
TEXT        | '1970-01-01 00:00:00'                  | string | DateTime
TEXT        | '00:00:00'                             | string | TimeSpan
TEXT        | '00000000-0000-0000-0000-000000000000' | string | Guid
STRING      | 'ABC'                                  | byte[] | string

Resolves dotnet#8824
bricelam added a commit to bricelam/efcore that referenced this issue May 9, 2023
…R type

Here's a table highlighting some of the improvements.

Column type | Sample value                           | Before | After
----------- | -------------------------------------- | ------ | -----
BOOLEAN     | 0                                      | byte[] | bool
SMALLINT    | 0                                      | long   | short
INT         | 0                                      | long   | int
BIGINT      | 0                                      | long   | long
TEXT        | '0.0'                                  | string | decimal
TEXT        | '1970-01-01'                           | string | DateOnly
TEXT        | '1970-01-01 00:00:00'                  | string | DateTime
TEXT        | '00:00:00'                             | string | TimeSpan
TEXT        | '00000000-0000-0000-0000-000000000000' | string | Guid
STRING      | 'ABC'                                  | byte[] | string

Resolves dotnet#8824
@ajcvickers ajcvickers modified the milestones: 8.0.0, 8.0.0-preview4, 8.0.0-preview5 May 26, 2023
@ajcvickers ajcvickers modified the milestones: 8.0.0-preview5, 8.0.0 Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-scaffolding area-sqlite closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. punted-for-3.0 punted-for-6.0 type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants