-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] GetManyByKeys - Am I missing something? It´s fast! #29676
Comments
@mkalinski93 Can you put the code in the a small, runnable application so we can dee exactly how it is used? |
Here you are. |
@mkalinski93 Thanks; I'm quite surprised this is fast. I will discuss with the team and get back to you. |
Note for triage: SQL generated is:
|
Procedure cache pollution |
I have noticed the generated query, too. It´s definitely a large number and it increases its size by the amount of keys you´re querying for. And if it can´t be more difficult as it is,... I have to deal with several database providers, too. Since we are generating that huge amount of sql statements, I can try rewrite the query that it actually is parameterized and chunk it down to the sql parameter limit. That way we would avoid cache pollution and still keep the benefits - but may lose performance. Benchmarks will tell. |
I´ve tried it in a really simple (maybe stupid manner) to see how parametrized queries would impact on the performance. IEnumerable<object[]> keyChunk = missingKeys.Chunk(2000);
foreach (object[] objects in keyChunk) {
string query = $"SELECT * FROM {tableName} WHERE [{primaryKeyName}] IN ({string.Join(", ", objects.Select((t, i) => $"@P{i}"))})";
retrievedValues.AddRange(await _context.Set<TEntity>().FromSqlRaw(query, objects).AsNoTracking(asNoTracking).ToListAsync(cancellationToken: cancellationToken));
//Expression<Func<TEntity, bool>> expression = ExpressionTreeResolver.GenerateWhereInExpression<TEntity>(objects,primaryKeyName);
//retrievedValues.AddRange(await _context.Set<TEntity>().AsNoTracking(asNoTracking).Where(expression).ToListAsync(cancellationToken));
} Generated Query for a single batch (I don´t know why it is limited to 1000 - even though I have a chunk size of 2000)
I still would consider the performance as good |
@mkalinski93 I'm a bit confused - your latest benchmarks tests a single query with 1000 parameterized values, whereas the original one seems to test multiple queries with a far larger number. If you're trying to see how parameterization affects performance, you really need to compare two pieces of code which do exactly the same thing, except for one parameterizing and one not. In addition, in the 1st sample you seem to have quite a bit of code interacting with EF's local view - but apparently not in the 2nd (that again invalidates any comparison). If you're interested in the effect of parameterization, I'd recommend focusing on that and removing any local view-related code. To summarize, when doing any performance measurements, it's usually a good idea to isolate exactly what it is that you want to measure, and remove anything else. |
hey @roji, it´s literally the same code, except the part I´ve attached above. I´m sorry for not including the entirety of the code. Initial Code public virtual async ValueTask<ICollection<TEntity>> GetObjectsByKeysAsync<TEntity>(ICollection keys, bool asNoTracking, CancellationToken cancellationToken = default) where TEntity : BaseEntity, new() {
//Check for local entries first and exclude them from the query
LocalView<TEntity> localSet = _context.Set<TEntity>().Local;
List<TEntity> localEntities = localSet.Where(x => keys.OfType<object>().Any(y => x.KeyValue.Equals(y))).ToList();
List<object> missingKeys = keys.OfType<object>().Where(x => !localEntities.Any(y => y.KeyValue.Equals(x))).ToList();
if (!missingKeys.Any())
return localEntities;
//some keys are still missing
string primaryKeyName = _sessionConfigurationProvider.MetadataStore.GetKeyName<TEntity>();
List<TEntity> retrievedValues = new List<TEntity>();
//this chunk is irrelevant to the query, because it seems that efcore generates multiple queries when reaching a certain amount of objects
//I´ve kept it, because otherwise at around 100.000 objects, the dbcommand runs into a timeout
IEnumerable<object[]> keyChunk = missingKeys.Chunk(50000);
foreach (object[] objects in keyChunk) {
Expression<Func<TEntity, bool>> expression = ExpressionTreeResolver.GenerateWhereInExpression<TEntity>(objects,primaryKeyName);
retrievedValues.AddRange(await _context.Set<TEntity>().AsNoTracking(asNoTracking).Where(expression).ToListAsync(cancellationToken));
}
return retrievedValues;
} Prototyped code for parameters (current) public virtual async ValueTask<ICollection<TEntity>> GetObjectsByKeysAsync<TEntity>(ICollection keys, bool asNoTracking, CancellationToken cancellationToken = default) where TEntity : BaseEntity, new() {
//Check for local entries first and exclude them from the query
LocalView<TEntity> localSet = _context.Set<TEntity>().Local;
List<TEntity> localEntities = localSet.Where(x => keys.OfType<object>().Any(y => x.KeyValue.Equals(y))).ToList();
List<object> missingKeys = keys.OfType<object>().Where(x => !localEntities.Any(y => y.KeyValue.Equals(x))).ToList();
if (!missingKeys.Any())
return localEntities;
//some keys are still missing
string primaryKeyName = _sessionConfigurationProvider.MetadataStore.GetKeyName<TEntity>();
string tableName = _sessionConfigurationProvider.MetadataStore.GetTableName<TEntity>();
List<TEntity> retrievedValues = new List<TEntity>();
//this chunk is irrelevant to the query, because it seems that EfCore generates multiple queries when reaching a certain amount of objects
//I´ve kept it, because otherwise at around 100.000 objects, the DbCommand runs into a timeout
IEnumerable<object[]> keyChunk = missingKeys.Chunk(2000);
foreach (object[] objects in keyChunk) {
string query = $"SELECT * FROM {tableName} WHERE [{primaryKeyName}] IN ({string.Join(", ", objects.Select((t, i) => $"@P{i}"))})";
retrievedValues.AddRange(await _context.Set<TEntity>().FromSqlRaw(query, objects).AsNoTracking(asNoTracking).ToListAsync(cancellationToken: cancellationToken));
//Expression<Func<TEntity, bool>> expression = ExpressionTreeResolver.GenerateWhereInExpression<TEntity>(objects,primaryKeyName);
//retrievedValues.AddRange(await _context.Set<TEntity>().AsNoTracking(asNoTracking).Where(expression).ToListAsync(cancellationToken));
}
return retrievedValues;
} I´m still using the same values, the same amount and so on. Basically this was just a reaction to ErikEj's comment. |
So am I understanding you correctly that you're seeing the parameterized query performing worse than the non-parameterized one, where both are exactly the same except for the parameterization? |
@roji Note that in the benchmarks in the attached project, the context is never actually tracking any entities, so while the code to look up local values would probably be quite slow, it is not measured in these benchmarks, which essentially just do a single no-tracking query with a big Contains. I was going to talk about this with the team in triage. :-) |
This perf benefit is why we don't parameterize, even though both would cause cache pollution. :-) Remember you were going to look into this. |
Yeah, that's why I'm surprised here... In my mental model there's no reason for a parameter to perform worse than a context (but apparently my mental model is wrong, at least for SQL Server...). I definitely want to play around with this a bit to see. |
(though this may no longer be relevant specifically for the IN problem (#13617), assuming we go with a string/JSON inner join... |
Generally speaking, I´m not saying that x is faster than y. I´m just trying to implement a function and hope to achieve a good balance between performance and "low wtf count". As I´ve achieved the speed, I was surprised - this is why I´ve created this issue in the first place. To be precise, I´m no expert in any way. I can just test something and have an opinion. But this opinion can be flawed - this is why I´m asking you. Erik commented and pointed out that it produces cache pollution. So I was digging into it and tried to alter the code to see some differences. That´s why I´ve posted the previous comments.
It actually should be tracked This is the benchmark [Benchmark]
public async Task FORM_GetByKeys25000Async() {
await EfCoreUnitOfWork.GetObjectsByKeysAsync<Product>(KeysProvider.TestKeys25000, false //asNoTracking);
} In the GetObjectsByKeys method there is this line //AsNoTracking is false
retrievedValues.AddRange(await _context.Set<TEntity>().AsNoTracking(asNoTracking).Where(expression).ToListAsync(cancellationToken)); It is using an extension method namespace GGM.ORM.Extensions;
public static class IQueryableExtensions {
public static IQueryable<TEntity> AsNoTracking<TEntity>(this IQueryable<TEntity> queryable, bool value) where TEntity : BaseEntity {
//if value is true, then use AsNoTracking() - otherwise don´t
return value ? queryable.AsNoTracking() : queryable;
}
} The reason I´m checking for local entities is that I want to extend this method in the future. |
@mkalinski93 sure thing - and thanks for your comments (I wasn't trying to imply anything was wrong/problematic in what you wrote). I think we went off on a tangent here with the parameterization question, which doesn't seem to have anything to do with your original question. I'll do some minimal benchmarking specifically for the parameterization, but we should concentrate on your original code, whether it's correct and whether the performance indicates anything to be concerned about. |
Final note on the parameterization: I did the benchmarking and posted all the results in #13617 (comment). tl;dr in SQL Server parameterization indeed has a significant negative impact in this scenario. |
Thanks @ErikEJ, I forgot about that - will rerun the benchmarks for SQL Server with and without that flag. |
I have filed #29841 for the parameter optimization. Thanks, Erik! @mkalinski93 A couple of observations after discussing with the team:
|
@mkalinski93 There is also the STRING_SPLIT option: https://erikej.github.io/efcore/sqlserver/2021/11/22/efcore-sqlserver-cache-pollution.html |
Hey @ajcvickers, thank you for your response. The linear search on the tracked entities however, can be refactored to be more efficient. I have got a few ideas in mind already. |
thanks, I´m going to look into it, maybe this can help - at least for mssql. |
Hey,
I´ve recently posted some question here and there in the efcore repo.
After some testing/benchmarking and refactoring... I came to the conclusion that the performance is huge!
We took our current ORM to a comparison. Same Keys, same data, same structure and so on.
We generated 100.000 Product objects with no navigation properties or whatsoever.
Also we took AsNoTracking and Tracking into account which wasn´t such of a big deal.
*WithTracking - 25.000 objects - no Includes
As you can see, the first two rows represents a custom implementation of GetByKeys in efcore. In this scenario, we´ve took 25.000 keys! (I did it with 50.000 and 100.000 items, too)
It´s literally 3 times faster than our current ORM!
Holy smoke!
Now, I came to the conclusion that if a price is too good to be true, then there might be something off. That´s why I am here to ask you, if my implementation cries to be refactored asap.
This is the code
This is the ExpressionTree
It just works. But I´m literally scared to use it :D
I would really like to get to know if that thing smells or could actually be used.
The text was updated successfully, but these errors were encountered: