Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Collations and Case Sensitivity #3743

Merged
merged 3 commits into from
Feb 23, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ Text processing in databases can be complex, and requires more user attention th

## Introduction to collations

A fundamental concept in text processing is the *collation*, which is a set of rules determining how text values are ordered and compared for equality. For example, while a case-insensitive collation disregards differences between upper- and lower-case letters for the purposes of equality comparison, a case-sensitive collation does not. However, since case-sensitivity is culture-sensitive (e.g. `i` and `I` represent different letters in Turkish), there exist multiple case-insensitive collations, each with its own set of rules. The scope of collations also extends beyond case-sensitivity, to other aspects of character data; in German, for example, it is sometimes (but not always) desirable to treat `ä` and `ae` as identical. Finally, collations also define how text values are *ordered*: while German places `ä` after `a`, Swedish places it at the end of the alphabet.
A fundamental concept in text processing is the *collation*, which is a set of rules determining how text values are ordered and compared for equality. For example, while a case-insensitive collation disregards differences between upper- and lower-case letters for the purposes of equality comparison, a case-sensitive collation does not. However, since case-sensitivity is culture-sensitive (e.g. `i` and `I` represent different letters in Turkish), multiple case-insensitive collations exist, each with there own set of rules. The scope of collations also extends beyond case-sensitivity, to other aspects of character data; in German, for example, it is sometimes (but not always) desirable to treat `ä` and `ae` as identical. Finally, collations also define how text values are *ordered*: while German places `ä` after `a`, Swedish places it at the end of the alphabet.
jessegood marked this conversation as resolved.
Show resolved Hide resolved

All text operations in a database use a collation - whether explicitly or implicitly - to determine how the operation compares and orders strings. The actual list of available collations and their naming schemes is database-specific; consult [the section below](#database-specific-information) for links to relevant documentation pages of various databases. Fortunately, database do generally allow a default collation to be defined at the database or column level, and to explicitly specify which collation should be use for specific operations in a query.
All text operations in a database use a collation - whether explicitly or implicitly - to determine how the operation compares and orders strings. The actual list of available collations and their naming schemes is database-specific; consult [the section below](#database-specific-information) for links to relevant documentation pages of various databases. Fortunately, databases generally allow a default collation to be defined at the database or column level, and to explicitly specify which collation should be used for specific operations in a query.
jessegood marked this conversation as resolved.
Show resolved Hide resolved

## Database collation

In most database systems, a default collation is defined at the database level; unless overridden, that collation implicitly applies to all text operations occurring within that database. The database collation is typically set at database creation time (via the `CREATE DATABASE` DDL statement), and if not specified, defaults to a some server-level value determined at setup time. For example, the default server-level collation in SQL Server is `SQL_Latin1_General_CP1_CI_AS`, which is a case-insensitive, accent-sensitive collation. Although database systems usually do permit altering the collation of an existing database, doing so can lead to complications; it is recommended to pick a collation before database creation.
In most database systems, a default collation is defined at the database level; unless overridden, that collation implicitly applies to all text operations occurring within that database. The database collation is typically set at database creation time (via the `CREATE DATABASE` DDL statement), and if not specified, defaults to a some server-level value determined at setup time. For example, the default server-level collation in SQL Server for the "English (United States)" locale is `SQL_Latin1_General_CP1_CI_AS`, which is a case-insensitive, accent-sensitive collation. Although database systems usually do permit altering the collation of an existing database, doing so can lead to complications; it is recommended to pick a collation before database creation.
jessegood marked this conversation as resolved.
Show resolved Hide resolved

When using EF Core migrations to manage your database schema, the following in your model's `OnModelCreating` method configures a SQL Server database to use a case-sensitive collation:

Expand Down Expand Up @@ -61,7 +61,7 @@ Note that some databases allow the collation to be defined when creating an inde

In .NET, string equality is case-sensitive by default: `s1 == s2` performs an ordinal comparison that requires the strings to be identical. Because the default collation of databases varies, and because it is desirable for simple equality to use indexes, EF Core makes no attempt to translate simple equality to a database case-sensitive operation: C# equality is translated directly to SQL equality, which may or may not be case-sensitive, depending on the specific database in use and its collation configuration.

In addition, .NET provides overloads of [`string.Equals`](/dotnet/api/system.string.equals) accepting a [`StringComparison`](/dotnet/api/system.stringcomparison) enum, which allows specifying case-sensitivity and culture for the comparison. By design, EF Core refrains from translating these overloads to SQL, and attempting to use them will result in an exception. For one thing, EF Core does know not which case-sensitive or case-insensitive collation should be used. More importantly, applying a collation would in most cases prevent index usage, significantly impacting performance for a very basic and commonly-used .NET construct. To force a query to use case-sensitive or case-insensitive comparison, specify a collation explicitly via `EF.Functions.Collate` as [detailed above](#explicit-collations-and-indexes).
In addition, .NET provides overloads of [`string.Equals`](/dotnet/api/system.string.equals) accepting a [`StringComparison`](/dotnet/api/system.stringcomparison) enum, which allows specifying case-sensitivity and a culture for the comparison. By design, EF Core refrains from translating these overloads to SQL, and attempting to use them will result in an exception. For one thing, EF Core does know not which case-sensitive or case-insensitive collation should be used. More importantly, applying a collation would in most cases prevent index usage, significantly impacting performance for a very basic and commonly-used .NET construct. To force a query to use case-sensitive or case-insensitive comparison, specify a collation explicitly via `EF.Functions.Collate` as [detailed above](#explicit-collations-and-indexes).

## Additional resources

Expand Down