-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect string matching behavior on SQL Server with whitespace patterns #19402
Comments
@nesterenko-kv Please post a small, runnable project or code listing that reproduces the behavior you are seeing so that we can investigate. |
@ajcvickers repro. Let me know, if you need more details. |
Note for team: I was able to reproduce this with 3.1. More minimal repro and output: public class Venue
{
public Guid Id { get; set; }
public string Name { get; set; }
}
public class BloggingContext : DbContext
{
private readonly ILoggerFactory Logger
= LoggerFactory.Create(c => c.AddConsole());//.SetMinimumLevel(LogLevel.Debug));
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder
.UseLoggerFactory(Logger)
.EnableSensitiveDataLogging()
.UseSqlServer(@"Server=(localdb)\mssqllocaldb;Database=Test;ConnectRetryCount=0");
}
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<Venue>();
}
}
public class Program
{
public static async Task Main()
{
using (var context = new BloggingContext())
{
context.Database.EnsureDeleted();
context.Database.EnsureCreated();
context.Add(new Venue { Name = "Test" });
await context.SaveChangesAsync();
}
using (var context = new BloggingContext())
{
var test = " ";
var venues = await context.Set<Venue>().Where(x => x.Name.Contains(test)).ToListAsync();
Console.WriteLine($"Matched {venues.Count}");
}
}
}
|
Notes from team discussion:
|
Notes from further team discussion:
|
SummaryOn SQL Server, However, We should ideally verify also on Sqlite and MySQL, though I'd have to spend more time to understand how to analyze queries there. We can postpone dealing with this to see whether we end up doing #17598 for 5.0 (which I'd really like to). SQL ServerBehavior around trailing spacesCREATE TABLE text_tests (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
INSERT INTO text_tests (name) VALUES ('');
INSERT INTO text_tests (name) VALUES (' ');
SELECT COUNT(*) FROM text_tests WHERE name=''; -- Returns both
SELECT LEN(name) FROM text_tests; -- Returns 0 for both
SELECT COUNT(*) FROM text_tests WHERE name LIKE ''; -- Returns only the 1st
SELECT '|' + name + '|' FROM text_tests; -- Returns || for the first, | | for the second Index usage-- Create schema and insert data
CREATE TABLE data (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
CREATE INDEX IX_name ON data(name);
BEGIN TRANSACTION;
DECLARE @i INT = 0;
WHILE @i < 50000
BEGIN
INSERT INTO data (name) VALUES (CAST(@i AS NVARCHAR(MAX)));
SET @i = @i + 1;
END;
INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES (' ');
DECLARE @j INT = @i;
WHILE @j < 100000
BEGIN
INSERT INTO data (name) VALUES (CAST(@j AS NVARCHAR(MAX)));
SET @j = @j + 1;
END;
COMMIT;
-- Queries
SET SHOWPLAN_ALL ON;
SELECT id FROM data WHERE name=''; -- Does an index seek with IX_name, TotalSubtreeCost=0.0032842
SELECT id FROM data WHERE LEN(name) = 0; -- Scans the entire table, TotalSubtreeCost=0.48661754
SELECT id FROM data WHERE name LIKE ''; -- Does an index scan with IX_name, TotalSubtreeCost=0.0032842
SET SHOWPLAN_ALL OFF; PostgreSQLBehavior around trailing spacesDROP TABLE IF EXISTS text_types;
CREATE TABLE text_types (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
text TEXT,
varchar VARCHAR(10),
char VARCHAR(10));
INSERT INTO text_types (text, varchar, char) VALUES ('', '', '');
INSERT INTO text_types (text, varchar, char) VALUES (' ', ' ', ' ');
SELECT * FROM text_types WHERE text = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE varchar = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE char = ''; -- Returns 1st row only
SELECT LENGTH(text), LENGTH(varchar), LENGTH(char) FROM text_types; -- 1st row all zeros, 2nd row all 2s
SELECT '|' || text || '|' || varchar || '|' || char || '|' FROM text_types; -- 1st row no spaces, 2nd row 2 spaces everywhere Index usage-- Create schema and insert data
DROP TABLE IF EXISTS data;
CREATE TABLE data (id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY , name TEXT);
CREATE INDEX IX_name ON data(name);
DO $$BEGIN
FOR i IN 1..50000 LOOP
INSERT INTO data (name) VALUES (i::TEXT);
END LOOP;
INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES (' ');
FOR i IN 50001..100000 LOOP
INSERT INTO data (name) VALUES (i::TEXT);
END LOOP;
END$$;
-- Queries - same index usage results as SQL Server
EXPLAIN SELECT * FROM data WHERE name='';
EXPLAIN SELECT * FROM data WHERE LENGTH(name) = 0;
EXPLAIN SELECT * FROM data WHERE name LIKE ''; SqliteBehavior around trailing spacesCREATE TABLE text_tests (id INT, name TEXT);
INSERT INTO text_tests (id, name) VALUES (1, '');
INSERT INTO text_tests (id, name) VALUES (2, ' ');
SELECT id FROM text_tests WHERE name=''; -- Returns only id=1
SELECT LENGTH(name) FROM text_tests; -- Returns 0 and 2
SELECT id FROM text_tests WHERE name LIKE ''; -- Returns only id=1
SELECT '|' || name || '|' FROM text_tests; -- Returns || for the first, | | for the second Index usageCREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT);
CREATE INDEX IX_name ON data(name);
-- Insert 1..50000
INSERT INTO data
SELECT x, x FROM (
WITH RECURSIVE
cnt(x) AS (
SELECT 1
UNION ALL
SELECT x+1 FROM cnt
LIMIT 50000
)
SELECT x FROM cnt
);
INSERT INTO data (id, name) VALUES (200000, '');
INSERT INTO data (id, name) VALUES (200001, ' ');
-- Insert 50001..100000
INSERT INTO data
SELECT x, x FROM (
WITH RECURSIVE
cnt(x) AS (
SELECT 50001
UNION ALL
SELECT x+1 FROM cnt
LIMIT 50000
)
SELECT x FROM cnt
);
SELECT COUNT(*) FROM data;
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name=''; -- SEARCH TABLE data USING COVERING INDEX IX_name (name=?)
EXPLAIN QUERY PLAN SELECT * FROM data WHERE LENGTH(name) = 0; -- SCAN TABLE data
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name LIKE ''; -- SCAN TABLE data |
Did the research on Sqlite behavior (added above):
@smitpatel @maumar any thoughts here? |
Why cannot we just convert |
Great workaround, thanks. If we go down this route, shouldn't we:
|
Since Equality can be generated outside of SqlTranslator.VisitBinary too. So SqlTranslator wouldn't work. Option would be either SqlExpressionFactory.Equal generates different for string or post-processor would take care of rewriting. PostProcessor would be slightly more preferable as it avoids .Equal not returning SqlBinary. |
Post-processor indeed sounds better to me (but should be before relational command cache etc.). Will prepare this. |
Since on SQL Server the equality operator ignores trailing whitespace, we can use LIKE when comparing to constants. Fixes #19402 Note: this is WIP, two tests are failing.
Since on SQL Server the equality operator ignores trailing whitespace, we can use LIKE when comparing to constants. Fixes #19402 Note: this is WIP, two tests are failing.
Since on SQL Server the equality operator ignores trailing whitespace, we can use LIKE when comparing to constants. Fixes #19402 Note: this is WIP, two tests are failing.
I did some research on Contains (what this issue is about originally). It turns out that PostgreSQL, MySQL and Sqlite find empty strings at the beginning of all strings, just like .NET - but not SQL Server. Empty string searches across databasesSQL ServerSELECT CHARINDEX('f', 'foo'); -- 1
SELECT CHARINDEX('x', 'foo'); -- 0
SELECT CHARINDEX('', 'foo'); -- 0!
SELECT CHARINDEX('', ''); -- 0!
SELECT CHARINDEX('', ' '); -- 0! SqliteSELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR(' ', ''); -- 1 MySQLSELECT INSTR('foo', 'f'); -- 1
SELECT INSTR('foo', 'x'); -- 0
SELECT INSTR('foo', ''); -- 1
SELECT INSTR('', ''); -- 1
SELECT INSTR(' ', ''); -- 1 PostgreSQLSELECT STRPOS('foo', 'f'); -- 1
SELECT STRPOS('foo', 'x'); -- 0
SELECT STRPOS('foo', ''); -- 1
SELECT STRPOS('', ''); -- 1
SELECT STRPOS(' ', ''); -- 1
SELECT CASE WHEN LEFT('f', LEN(' ')) = ' ' THEN 1 ELSE 0 END; -- returns true, since whitespace is disregarded We can't use LIKE to take the whitespace into account because operands may be non-constants. So unless someone has a new idea we can say we won't fix this. |
EF6 translation for Contains on SQL Server: SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name]
FROM [dbo].[Venues] AS [Extent1]
WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~' |
Thanks! Any idea what happens if the pattern parameter has wildcards? This could only work if EF6 escapes them client side before sending the parameter? This could correspond to the "client-side parameter transformations" I mentioned in #17598, we it's definitely another approach to the problem. |
@roji What query (or queries) should I run to get an idea of the wildcards issue? |
Anything where the pattern being searched contains a LIKE wildcard. For example: |
Set up a quick EF6 test project for Contains (with .NET Core on Linux!), results are below. tl;dr EF6 seems to be doing the right thing, always using LIKE for Contains, escaping wildcards for both constants and parameters. For columns it falls back to CHARINDEX, but does not compensate for empty string (so it has a false negative bug). For EF Core, here's what I think we should do:
Constant pattern:
Parameter pattern:
Parameterized pattern with wildcards:
Column pattern:
|
Design decisions:
|
It is disappointing to see that the design decision is that this won't be fixed any time soon. I'm guessing there is a good reason that the translation logic can't test for the specific scenario of trailing whitespace in an otherwise not empty pattern and then implement a workaround for the sql len() with trailing whitespace issue, as seen on stack overflow?: Example row: FirstName column value: 'St-phen' Translates to false positive: But this would translate correctly when using this StartsWith translation: LEN(CAST(@__fn_1 AS nvarchar(MAX)) + 'x') - 1 On the plus side, this Linq workaround seems to work for me, for my use case, by causing the translation to not use len() when it ends with a space. Just be aware you need a string.IsNullOrEmpty check beforehand, in that case you need a different query anyway: |
@zachrybaker my message above may have been a bit inaccurate - in some cases, the translations of Contains, StartsWith and EndsWith will work correctly with regards to trailing whitespace (i.e. when patterns are constant we will generate LIKE, which doesn't ignore trailing whitespace). Once we implement client-side parameter transformations (which we do intend to do), the same technique can be used for parameter patterns as well. Re other workaround with len... First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern: SELECT CASE WHEN LEFT('h', LEN(CAST('h ' AS nvarchar(MAX)) + 'x') - 1) = 'h ' THEN 1 ELSE 0 END; -- 1 It's possible to add a condition for that too, but at some point that kind of complexity does not seem appropriate for translating a very common .NET method like StartsWith (and properly has some perf overhead, with all the casting, concatenation, etc. etc). In any case, as you say, it's always possible for people to express this themselves with EF.Functions.Like. The larger picture here is that it's impossible to handle this well for much more basic cases without a significant perf impact - I'm thinking about simple equality. When users compare two VARCHAR columns, trailing SQL will be truncated. Any attempt to compensate for that with elaborate logic such as the above would prevent index usage - which is the last thing you want to do for the default text equality operator. Since it's not possible to fix this for basic equality, I don't think it makes sense to go so far for StartsWith either. |
* Translate to LIKE instead of CHARINDEX for constant patterns * Change empty string compensation from equality to LIKE Fixes #19402
* Translate to LIKE instead of CHARINDEX for constant patterns * Change empty string compensation from equality to LIKE Fixes #19402
* Translate to LIKE instead of CHARINDEX for constant patterns * Change empty string compensation from equality to LIKE Fixes #19402
* Translate to LIKE instead of CHARINDEX for constant patterns * Change empty string compensation from equality to LIKE Fixes #19402
The current SQL Server is capable of seeking on an index with a parameterized This is implemented by SQL Server inserting an internal function into the query plan. This function takes the (dynamic) I'd be curious if I believe that For a simple equality comparison ( |
Yes, that is well-known (see discussion above, especially #19402 (comment)). Note that we only generate The reason we can't use LIKE for non-constant parameters, is that if the pattern contains contains wildcard characters, incorrect results will be returns. So if you attempt to search for names starting with an underscore, you will get all row since underscore will be interpreted by LIKE as a wilcard. For constants we can escape these character in the pattern, but not for non-constants. There's a plan going forward (#17598) to allow us to transform parameter values before sending them to SQL Server, which would allow us to escape wildcard characters in parameters, similar to what we do today with constants. However, that will not happen for 5.0; note also that we'll never be able to do that for column patterns (but that's probably a rare case). Hope that explains the situation. |
Re perf, note that as far as I'm aware, indexes can be used in some cases when the LIKE pattern matches at the beginning of the string, but indexes are never used for other patterns (e.g. generalized Contains); that makes sense when you think about how indexes work. However, I haven't personally done any testing on this. Regardless, the idea in general is to try to use LIKE where possible, but that requires escaping as explained above. |
Thanks for explaining, that makes sense. Right, I guess only parameter expressions can feasibly be escaped. |
I'm getting wrong results from database when trying to filter string (
Contains
,StartsWith
,EndsWith
, etc) with spaces.Steps to reproduce
I'm build query like:
Which gets translated to:
This query returns all rows from database when
@__test_0
contains spaces.Further technical details
EF Core version: 3.1.0
Database provider: Microsoft.EntityFrameworkCore.SqlServer
Target framework: .NET Core 3.1
The text was updated successfully, but these errors were encountered: