Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Sequential GUID generation #20528

Closed
wants to merge 13 commits into from
Closed

Correct Sequential GUID generation #20528

wants to merge 13 commits into from

Conversation

peroyhav
Copy link

@peroyhav peroyhav commented Apr 4, 2020

Fix issue #20527

@dnfclas
Copy link

dnfclas commented Apr 4, 2020

CLA assistant check
All CLA requirements met.

@peroyhav peroyhav changed the title Correct Sequential GUID generation WIP: Correct Sequential GUID generation Apr 6, 2020
@peroyhav
Copy link
Author

peroyhav commented Apr 6, 2020

After reviewing #19124 I moved this fix to WIP, since it only solves parts of that as wel.
It will ensure that a single instance will now generate sequential GUID's however, after digging a bit further, it looks like SQL server uses a deterministic part of the GUID's and that the sequentiality of the GUID's break when changing between query windows and executing queries.

This PR increments the first bytes of the GUIDs in the same order as SQL server, but it looks like there is a bit more to it than my first assumption.

To solve the deterministic portion of the GUID is somewhat harder, we either need access to some hardware identifier that can be hashed, or a way to store the identifier per application instance. In adition the generator seems to run from a singleton instance in SQL server, that is invalidated whenever context changes.
The following SQL, shows the order these sequential GUID's are generated across databases.

CREATE TABLE TestDB.dbo.tmp(ID UNIQUEIDENTIFIER NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID())
INSERT INTO TestDB.dbo.tmp VALUES(DEFAULT)
CREATE TABLE TestDB2.dbo.tmp(ID UNIQUEIDENTIFIER NOT NULL PRIMARY KEY DEFAULT NEWSEQUENTIALID())
INSERT INTO TestDB2.dbo.tmp VALUES(DEFAULT)
INSERT INTO TestDB.dbo.tmp VALUES(DEFAULT)
INSERT INTO TestDB2.dbo.tmp VALUES(DEFAULT)

SELECT * FROM TestDB.dbo.tmp

SELECT * FROM TestDB.dbo.tmp
UNION SELECT * FROM TestDB2.dbo.tmp

DROP TABLE TestDB.dbo.tmp
DROP TABLE TestDB2.dbo.tmp

When inserting a set of values sequentially with a bit of time between, it seems that the GUID's also drift with time, so there is some time aspect to the sequence as well.

From what I can infer, it looks like the last 10 bytes of the GUID is constant for any given server instance even throughout restarts of the server. This is inferred based on that I've only witnessed change to the first 6 bytes.
Further on, it looks like byte 6-7 always spells "EA11" for all examples I've come across, not sure whether this is a coincidence or if it is a part of the standard for SQL server sequential ID's.
Maybe @roji, @ajcvickers or someone else can provide some insights?

@peroyhav
Copy link
Author

peroyhav commented Apr 7, 2020

I've looked a bit into this today, and Have landed on two possible proposals for solving the "constant per machine part" of the GUID
alt 1:
Create a new Pseudo random generator seeded with the hash value of the Macine name from Environment.MachineName concatenated with the assembly name, this could provide a simple solution, but testing for correctness is a bit hard, since the test would be a duplicate of the same algorithm to ensure multiple computers can execute and pass the test.
alt 2:
Same as above, but use the connection instead of the machine name in the seed and keep a dictionary of counters matched to the connection string, I'm not sure how to pass this to the instance, but this would have the benefit of allowing different "node ID's" per Database / Connection, allowing more spread in the sequences.

Neither solution is optimal, but I believe they will be adequate. Regardless, the counter fields should probably be shared across instances(read static variables) to ensure correctness in sequential generation across instances.

…nd implemented static instantiation, All current tests pass again.

Missing test for validation of _nodeId generation.
@peroyhav
Copy link
Author

peroyhav commented Apr 7, 2020

I made the changes according to alternative 1 now and pushed to the branch.
Not fully pleased with the fact I ended up with two static variables, but it was a quick and dirty fix that should look familiar to the values generated by SQL server.

@peroyhav
Copy link
Author

peroyhav commented Apr 7, 2020

This solution should now match the behavior of SQL server and thereby fix #19124 and #20527.

@peroyhav peroyhav changed the title WIP: Correct Sequential GUID generation Correct Sequential GUID generation Apr 7, 2020
@ajcvickers
Copy link
Member

@peroyhav Thanks for your work on this. @roji will take a look in more detail, but it might take a few days.

Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peroyhav thanks for this in-depth analysis and work. I've taken a look and here are some thoughts - keep in mind I'm not at all an SQL Server expert, and probably don't know any more than you.

tl;dr although we should fix the sequentiality to match SQL Server, I'm not convinced we need to generate the same GUIDs which SQL Server would have generated.

So first, there's the performance issue with the GUIDs generated by EF Core, as @lauxjpn mentioned in #19124 (comment) - while SQL Server increments GUIDs on the left side, the EF generator increments them much more to the right. This most probably causes inefficiencies with GUID indexing in SQL Server, and we need to fix that.

Regardless of that, our current sequential GUID generator doesn't make use of any host-specific data in the GUIDs it generates (MAC address seems to be the standard for this as opposed to hostname, as in UUID/GUID Version 1 & 2). I can't see any evidence that SQL Server cares about this at all - after all, nobody is forced to use NEWSEQUENTIALID. Do you have any particular reason to believe we should be trying to precisely match the SQL Server generation structure? If not, I'm assuming that any host-specific data is inserted by SQL Server purely to reduce/eliminate collisions, where multiple machines generating GUIDs at the same time would generate the same GUID.

Now, when GUIDs are only being generated inside SQL Server, having a part of the GUID that represents host-specific data is just a constant, modulu server failovers etc. (I admit I don't see a compelling reason to do so, because GUIDs are being generated in only one place, so there isn't any actual risk of collision). But when generating at the client, there can be multiple clients generating at the same time. On the one hand, including a MAC address would reduce/eliminate the chances of a collision, but on the other hand would fragment the GUIDs being generated (N sequential blocks instead of 1), so potentially bad for perf again. This can be exacerbated by the client programs going down and up frequently, which doesn't happen with the server. If we really want to push this logic to the edge, then the process ID should also be included in the GUID to avoid two processes on the same machine from generating duplicate GUIDs, thereby fragmenting the GUID space even more (but I don't think I mean that seriously).

AFAIK we've never received a complaint of GUID collision. I suspect this is because there are typically "few" EF Core application instances running against the same database - the MAC address scheme inside the GUID was created to achieve global uniqueness, which could be considered an overkill here. FWIW NHibernate has a similar sequential GUID generation routine without any host-specific data (a lot of discussion seems to point back to this article). Here's what I suggest:

  • I'll try to get more information internally NEWSEQUENTIALID and what SQL Server actually cares about.
  • Let's see what the team thinks about the importance of inserting host-specific data to avoid collisions between simultaneously-running EF Core application instances, /cc @dotnet/efcore. On second look it seems clear this isn't important - we're already integrating random data into the GUID, so a risk of collision is virtually nil.
  • I really think at this point that this class doesn't belong in EFCore, but should move to the SQL Server provider (breaking change).

Let me know what you think of the above. In the meantime, see some minor review comments on your PR (though let's wait and decide what to do before you continue work on it).

{
// Assembly name and machine name is assumed to be be constant enough for this use.
// In adition, this means that multiple assemblies will hopefully generate GUIDs with different Node ID's.
var assemblyName = System.Reflection.Assembly.GetExecutingAssembly().GetName().Name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the value of including the assembly name, given that it's always expected to be constant (i.e. Microsoft.EntityFrameworkCore)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @roji the executing Assembly name should be the same as the entry assembly, not the current assembly. If it is always the same, it's not as intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were maybe thinking of GetEntryAssembly as opposed to GetExecutingAssembly? The latter should return the assembly of the executing code, in this case the SequentialGuidValueGenerator (so it should also return Microsoft.EntityFrameworkCore).

But stepping back, as you wrote below there isn't really any actual danger of collisions here, so I'm not sure what value there is in including the entry assembly (or even the hostname).

src/EFCore/ValueGeneration/SequentialGuidValueGenerator.cs Outdated Show resolved Hide resolved
var guidBytes = Guid.NewGuid().ToByteArray();
var counterBytes = BitConverter.GetBytes(Interlocked.Increment(ref _counter));
var currentCount = _counter = ((++_counter & 0x0000FFFFFFFFFFFF) | (0xEA110000L << 32));
Span<long> guidValue = stackalloc long[] { currentCount, _nodeId };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the memory reduction you did here.

There's a good chance this could be simplified a bit by just using Span<byte> and MemoryMarshal.Write to write what you want. For extra credit, BinaryPrimitives also contains nice and efficient methods for endian conversion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I didn't intend to compress memory footprint though. I just think it's more readable to manipulate small slices of a span value instead of copying to byte arrays and reversing in order to get the correct byte ordering compared to SQL server. The same result could have been gained by copying byte by byte to an array, but I prefer the brevity of doing it this way as I believed it would reduce clutter and thereby increase readability. Since a .NET GUID consists of a 32 bit integer, followed by two 16 bit integers and then a 64 bit binary sequence, I'm not sure BinaryPrimitives would be a good way of manipulating the values without splitting it in the class requiring more safeguards for the increment part.

@roji
Copy link
Member

roji commented Apr 12, 2020

One last comment after some additional research... Our current GUID generator generates GUIDs as follows:

64623a52-0b19-4b5e-7a75-08d7df10a0da
36ea360f-8e86-4255-7a76-08d7df10a0da
d8ad6e5b-d453-4285-7a77-08d7df10a0da
80399355-def3-4d45-7a78-08d7df10a0da
102e53a3-32a8-4694-7a79-08d7df10a0da

There's reason to believe that the only thing that matters for SQL Server perf is the last portion of the GUID; the Jimmy Nilsson's COMB algorithm, which once again is what NHibernate uses, simply makes the last 6 bytes be the timestamp, which is also what we do. In other words, our current algorithm may be just fine, unless we decide we care about integrating host-specific information in there to avoid collisions.

I'll simply run some benchmarks in the coming days and report back.

@peroyhav
Copy link
Author

One last comment after some additional research... Our current GUID generator generates GUIDs as follows:

64623a52-0b19-4b5e-7a75-08d7df10a0da
36ea360f-8e86-4255-7a76-08d7df10a0da
d8ad6e5b-d453-4285-7a77-08d7df10a0da
80399355-def3-4d45-7a78-08d7df10a0da
102e53a3-32a8-4694-7a79-08d7df10a0da

There's reason to believe that the only thing that matters for SQL Server perf is the last portion of the GUID; the Jimmy Nilsson's COMB algorithm, which once again is what NHibernate uses, simply makes the last 6 bytes be the timestamp, which is also what we do. In other words, our current algorithm may be just fine, unless we decide we care about integrating host-specific information in there to avoid collisions.

I'll simply run some benchmarks in the coming days and report back.

I don't think there is any danger of collisions here, since the UIDs have 8 random bytes in the begining. and even with a time resolution of one second, there would need to be a ridiculous amount of IDs generated before even the birthday paradox becomes relevant.

But since the documentation states it is generated like SQL server, one should at least change the documentation to state what the implementation actually does.

Regardless, there might be a possibility of the work I've done to be implemented in a SequentialRFC4122GuidValueGenerator class if not in the SequentialGuidValueGenerator, as the implementation in peroyhav:20200404_Issue20527_01 only requires a few adjustments in order to conform to RFC4122. Since SQL server also seems to, for the most part, conform to RFC4122; Might there be some optimization regarding indexing with multiple nodes?

@roji
Copy link
Member

roji commented Apr 13, 2020

But since the documentation states it is generated like SQL server, one should at least change the documentation to state what the implementation actually does.

I agree that the documentation is currently wrong, in that we don't at all generate the GUIDs as SQL Server's NEWSEQUENTIALID. We should fix this.

Regardless, there might be a possibility of the work I've done to be implemented in a SequentialRFC4122GuidValueGenerator class if not in the SequentialGuidValueGenerator, as the implementation in peroyhav:20200404_Issue20527_01 only requires a few adjustments in order to conform to RFC4122. Since SQL server also seems to, for the most part, conform to RFC4122; Might there be some optimization regarding indexing with multiple nodes?

IMO this would make sense if some advantage can be shown to it, e.g. better performance - but I don't think that's the case.

Here are some benchmark results I've run some benchmarks comparing NEWID, NEWSEQUENTIALID, and EF Core's current value generation. I basically ran the same benchmarks detailed in this old post. Here are the index statistics for each scenario after inserting 5000 rows:

NEWID

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
CLUSTERED INDEX 2 270 67.01168766987892 99.62962962962963 5000

NEWSEQUENTIALID

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
CLUSTERED INDEX 2 186 97.28619965406475 0.5376344086021506 5000

Current EF Core SequentialGuidValueGenerator

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
CLUSTERED INDEX 2 35 99.79526809982704 2.857142857142857 5000

Looking at avg_page_space_used_in_percent, it would seem that the EF Core generator is even slightly more efficient than SQL Server's NEWSEQUENTIALID (although fragmentation seems slightly worse).

Query for extracting index statistics
select index_type_desc, index_depth, page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent, record_count
from sys.dm_db_index_physical_stats(db_id(), object_id(@TableName), null, null, 'detailed')
where index_level = 0;
T-SQL code for NEWID and NEWSEQUENCEID
-- newid
create table TestTable (
   id uniqueidentifier default newid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');

-- newsequentialid
create table TestTable (
   id uniqueidentifier default newsequentialid() not null primary key clustered,
   sequence int not null,
   data char(250) not null default '');

declare @count int;
set @count = 0;
while @count < 5000 begin
    insert TestTable (sequence)
    values (@count);

    set @count = @count + 1;
end;
EF Core insertion code
class Program
{
    static void Main(string[] args)
    {
        using var ctx = new BlogContext();
        ctx.Database.EnsureDeleted();
        ctx.Database.EnsureCreated();

        for (var i = 0; i < 5000; i++)
        {
            ctx.Blogs.Add(new Blog { Name = "Foo" + i });
            ctx.SaveChanges();
        }
    }
}

public class BlogContext : DbContext
{
    public DbSet<Blog> Blogs { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        => optionsBuilder.UseSqlServer(@...);
}

public class Blog
{
    public Guid Id { get; set; }
    public string Name { get; set; }
}

If my benchmarks are in any way non-exhaustive or if you think there's another reason to change our current GUID generation, please don't hesitate to post that and we can continue the conversation. But barring that, I don't see a reason to change anything (except for correcting the docs and possibly moving the generator into the SQL Server provider).

PS there's a bit of value in reducing the allocations in the current generator by using stack allocation and spans (even if it otherwise stays the same), if you'd like to submit that.

@lauxjpn
Copy link
Contributor

lauxjpn commented Apr 13, 2020

@roji Do you have statistics about the time it takes to insert those 5000 rows using the different methods? Since this test uses a clustered index, this might be interesting to compare.

Running the tests (including time) against a non-clustered index might be interesting as well.

Generating actual sequential GUIDs (or at least sequential in batches/a specific range) should be much faster for clustered indices than only partially sequential ones.

E.g. if the MAC (or something similar) is used as the main inter-machine collision avoidance mechanism, and the current date/time (ticks) are used as the main sequential mechanism and only the rest is padded with random bits (to avoid collisions below the resolution of the hardware clock etc.), then inserting rows should be quite efficient and fast on clustered indices as long as the bytes are stored and compared for the index in this order.

This might be important, because SQL Server (or SSMS?) creates clustered indices for primary keys by default, even if they column type is uniqueidentifier.

@roji
Copy link
Member

roji commented Apr 13, 2020

@lauxjpn

@roji Do you have statistics about the time it takes to insert those 5000 rows using the different methods? Since this test uses a clustered index, this might be interesting to compare.

Surprisingly, in my tests insertion times were almost identical between NEWID and NEWSEQUENTIALID (17s460ms and 17s979ms respectively, for inserting 10k rows), even though the resulting index had very different characteristics... I'm not nearly enough of an SQL Server expert to understand what that means, but I'm assuming that with NEWID the bad characteristics are likely to result in suboptimal perf later, when querying.

Running the tests (including time) against a non-clustered index might be interesting as well.

I ran this, and the results seem identical across all three algorithms (as expected):

NEWID:

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
HEAP 1 186 97.28619965406475 0 5000

NEWSEQUENTIALID

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
HEAP 1 186 97.28619965406475 0 5000

EF Core

(the clustered index is an unrelated INT with IDENTITY)

index_type_desc index_depth page_count avg_page_space_used_in_percent avg_fragmentation_in_percent record_count
CLUSTERED INDEX 2 38 98.41728440820361 2.631578947368421 5000
NONCLUSTERED INDEX 2 25 98.8139362490734 4 5000

Generating actual sequential GUIDs (or at least sequential in batches/a specific range) should be much faster for clustered indices than only partially sequential ones.

I understand the logic of that, but that's precisely what NEWSEQUENTIALID is doing and it's not faring (much) better than the current EF Core partially-sequential algorithm. It's possible that the making the 1st 6 bytes sequential (as EF Core does) is enough to achieve most (or practically all) of the benefit here, as there aren't likely to be very many values sharing that. The 2.3% fragmentation difference in NEWSEQUENTIALID's favor may be there because it's doing a slightly better job on this, although at the same time EF Core has a better avg_page_space_used_in_percent (and I'm not sure whether 2.3% is significant in any way).

At the end of the day, if we're considering changing the EF Core behavior, it would be good to see benchmark results that show an actual improvement for that proposal vs. the current implementation (am very willing to help with that, of course)

@ErikEJ
Copy link
Contributor

ErikEJ commented Apr 13, 2020

@roji The interesting part is the page count - assume your tested with 100% fill factor. As you show, EF Core algo managed to squeeze 5000 rows into only 35 pages! Less pages = less buffer space required. Fragmentation only matters if you are fetching numerous rows, not for individual rows, but obviuosly high fragmentation means more pages in memory over time, and low PLE (page life expectancy). EF Core current algo looks like a clear winner to me!

@lauxjpn
Copy link
Contributor

lauxjpn commented Apr 13, 2020

At the end of the day, if we're considering changing the EF Core behavior, it would be good to see benchmark results that show an actual improvement for that proposal vs. the current implementation

I agree. If it ain't broken, don't fix it.

There's reason to believe that the only thing that matters for SQL Server perf is the last portion of the GUID

That is quite interesting. So if SQL Server uses the MAC address as the last part of the GUID and EF Core just uses the timestamp, there really is not much practical difference, but with EF Core's algorithm, there is the benefit of no potential information leak in form of the MAC address.

@peroyhav
Copy link
Author

peroyhav commented Apr 13, 2020

I'm not sure what is the MSB vs LSB parts of the GUIDS as stored in SQL server, but would assume that the curent increment increments the LSB part of the table, and that we might get fragmentation first when we have multiple inserts from multiple locations at approximately the same time. The chances of collisions are minute regardless. How does the index look if one creates 500 records from 100 programs approximately simultaneously? I would assume that is the reason for how sequential IDs are generated in the first place. Since it is supposed to maintain better performance in DB clusters as well. Collisions are more or less non existent regardless, so it must be a reason that SQL server uses the compute method in the RFC.

@roji
Copy link
Member

roji commented Apr 13, 2020

Thanks for jumping in @ErikEJ and @lauxjpn, good to have more eyes on this.

How does the index look if one creates 500 records from 100 programs approximately simultaneously?

The nice thing about the current EF Core generator, is that the 1st 6 bytes (which are significant for the clustering) represent the time, which is the same across all clients (assuming good time sync of course), so things should work out relatively well. I think this is another argument in favor of that approach, rather than anything integrating host-specific data in the significant parts, which would cause more discontinuity.

I'm thinking this is different from the SQL Server clustering situation; although there's discontinuity on failover (because it's a different machine), those events are likely to be more rare than client applications starting and stopping. Perhaps more importantly, with clustering you have a very low total number of servers, whereas with clients you could have many...

The chances of collisions are minute regardless.

Agreed.

@peroyhav at this point do you think there's a good reason to change anything in the way we generate GUIDs?

@peroyhav
Copy link
Author

@roji The only reason I can see that might be left, is whether to follow the RFC standard or not.
I believe the GUIDs in the framework already conforms to it. As Guid.NewGuid() returns GUIDs of version 4 from the standard, and both SQL server and other official .NET framework structures follow the RFC.
Even though SQL server either seems to have reordered the 7th and 8th byte according to the standard for sequential IDs, or that both that have them reordered for random IDs.

If so, the documentation should be changed to something like this:

/// <summary>
///     Generates sequential <see cref="Guid" /> values index optimized for Microsoft SQL server.
///     This is useful when entities are being saved to a database where sequential
///     GUIDs will provide a performance benefit. The generated values are non-temporary, meaning they will
///     be saved to the database.
/// </summary>

@roji
Copy link
Member

roji commented Apr 14, 2020

As you say, SQL Server seems to orders bytes around in any case, so it doesn't seem that NEWSEQUENTIALID is RFC-conforming in any case. Note the docs on NEWSEQUENTIALID, with a link to this .NET shuffling code to be applied to the output of Windows UuidCreateSequential. Even if NEWSEQUENTIALID did generate RFC-comformant GUIDs, I'm not sure that on its own would be a sufficient reason for us to change the EF Core implementation...

Am waiting on some more internal feedback on this question, but in the absence of any further comments/arguments/data I think we can keep things as they are (and change the docs).

@roji
Copy link
Member

roji commented Apr 20, 2020

I haven't received any new information for the moment, so I'll go ahead and close this - if we get any evidence that the value generator is sub-optimal, we can revisit.

I've submitted #20696 to correct the documentation for SequentialGuidValueGenerator.

Thanks for your considerable work and analysis on this, @peroyhav - this has been very helpful even if nothing ultimately needed to get merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants