Description
Describe the bug
Scenario: SqlBulkCopy
Source SQL Server >= 2019 with UTF-8 support.
Target SQL Server >= 2019 with UTF-8 support.
Source Schema: UTF-8 Encoded Varchar(max) column with one of the UTF-8 Collation.
Target Schema: Same as source
When the source table is copied to target table with Streaming turned on SqlBulkCopy, then a BOM is prefixed to the data, which causes the data in the target to change. This causes unexpected side effects while working with data since the original data has lost its fidelity.
Scope of the problem: UTF8 encoded source table with VARCHAR(MAX) column, copied to target SQL server using Streaming.
To reproduce
Include a complete code listing (or project/solution) that we can run to reproduce the issue.
Partial code listings, or multiple fragments of code, will slow down our response or cause us to push the issue back to you to provide code to reproduce the issue.
drop table if exists utf8_data, utf8_data_bak;
create table utf8_data (
str_col varchar(max) collate Latin1_General_100_CS_AS_KS_WS_SC_UTF8
);
create table utf8_data_bak (
str_col varchar(max) collate Latin1_General_100_CS_AS_KS_WS_SC_UTF8
);
insert into utf8_data select top(10) name from sys.tables;
static void Main(string[] args)
{
String sourceConnectionString = @"Server=172.20.107.10;Database=ScrapDb;user id=<>;encrypt=false";
String targetConnectionString = @"Server=172.20.107.10;Database=ScrapDb;user id=<sa>;encrypt=false";
SqlConnection sqlsourceConnection = new SqlConnection(sourceConnectionString);
SqlConnection sqlTargetConnection = new SqlConnection(targetConnectionString);
sqlsourceConnection.Open();
sqlTargetConnection.Open();
SqlCommand sqc = sqlsourceConnection.CreateCommand();
sqc.CommandText = $"SELECT str_col FROM utf8_data;";
const int bulktimeout = 3600;
SqlCommand sqd = sqlsourceConnection.CreateCommand();
sqd.CommandText = $"TRUNCATE TABLE utf8_data_bak;";
sqd.ExecuteNonQuery();
SqlBulkCopy sbc = new SqlBulkCopy(sqlTargetConnection);
sbc.DestinationTableName = $"utf8_data_bak";
sbc.BulkCopyTimeout = bulktimeout;
sbc.BatchSize = 1000;
sbc.EnableStreaming = true;
long crows = 0;
int ccols = 0;
try
{
using (SqlDataReader sdr = sqc.ExecuteReader())
{
ccols = sdr.FieldCount;
sbc.WriteToServer(sdr);
//crows += sbc.RowsCopied64;
}
sbc.Close();
Console.WriteLine("Columns count: {0}, Rows Inserted: {0}", ccols, crows);
Console.WriteLine("Source Table: {0}", "utf8_data");
sqc.CommandText = $"SELECT str_col, datalength(str_col) FROM utf8_data;";
using (SqlDataReader sdr = sqc.ExecuteReader())
{
Console.WriteLine("str_col type: {0}", sdr.GetDataTypeName(0));
Console.WriteLine();
while (sdr.Read())
{
Console.WriteLine("{0},{1}", sdr.GetSqlString(0), sdr.GetSqlInt64(1));
}
}
Console.WriteLine();
Console.WriteLine("Destination Table: {0}", "utf8_data_bak");
sqd.CommandText = $"SELECT str_col, datalength(str_col) FROM utf8_data_bak;";
using (SqlDataReader sdr = sqd.ExecuteReader())
{
Console.WriteLine("str_col type: {0}", sdr.GetDataTypeName(0));
Console.WriteLine();
while (sdr.Read())
{
Console.WriteLine("{0},{1}", sdr.GetSqlString(0), sdr.GetSqlInt64(1));
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
sqlsourceConnection.Close();
sqlTargetConnection.Close();
}
}
}
The above code copies data from the source to the target table, however when we look at the bytes of the data, a UTF-8 BOM prefix is added.
To validate, run the following query on the target table
select cast(str_col as varbinary) from utf8_data_bak;
The output is
0xEFBBBF757466385F64617461
0xEFBBBF757466385F646174615F62616B
The first 3 bytes EFBBBF
is unexpected. This BOM shouldn't be prefixed to the data.
The query on source table yields.
select cast(str_col as varbinary) from utf8_data;
0x757466385F64617461
0x757466385F646174615F62616B
I tested this with MDS 1.0 to the latest and the source of this problem is the inclusion of UTF8 support for SQL 2019.
Expected behavior
no BOM should be pre-fixed in the target table's data.
Further technical details
Microsoft.Data.SqlClient version: Tested on 1.0 - 6.x. Exists on all.
.NET target: .NET Framework 4.6.2, .NET 8.0
SQL Server version: SQL Server 2022, SQL Server 2018
Operating system: Client on Win 11
Additional context
This is likely happening because UTF8 encoding by default adds a BOM.
The data is sent to SQL server using StringWriter
with the encoding of the column received from the server. The server provides UTF8 encoding, which is translated to C# UTF8. The same encoding is used to transmit the data to SQL server.
TdsParser.cs
using (ConstrainedTextWriter writer = new ConstrainedTextWriter(new StreamWriter(new TdsOutputStream(this, stateObj, null), encoding), size))
A likely fix could be to check if the encoding is UTF8 and to reinitialize the encoding and indicate that no bom is emitted
if (encoding == UTF8Encoding.UTF8)
{
encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false); // no BOM
}