-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve performance of Environment.GetEnvironmentVariable #1725
Conversation
This change improves performance of Environment.GetEnvironmentVariable in release build by about 25%. To do that, I have heavily refactored the UTF8ToUnicode and UnicodeToUTF8 functions and replaced malloc in GetEnvironmentVariableW by the recently checked in StackString. To make sure the UTF8ToUnicode and UnicodeToUTF8 functions work properly, I have added two PAL tests that extensively test the conversion including hopefully all corner cases.
@adityamandaleeka can you take a look please? |
FYI: @ianhays, @stephentoub, @ellismg |
It may be better to take the managed UTF8 encoder/decoder implementation from src\mscorlib\src\System\Text\UTF8Encoding.cs (with minimal changes to make it C++, and otherwise fit here). From cursory look, src\mscorlib\src\System\Text\UTF8Encoding.cs implementation is faster than what you got; and it is also pretty well tested. |
@jkotas I did some testing of the managed decoder vs my one that I've very slightly enhanced based on the idea of reading / writing multiple characters at once.
Large document:
Based on these results, I'd prefer keeping my refactored version. Especially based on the fact that we use the functions to translate filenames and environment variable names / contents where I'd expect mostly ascii characters with some two byte encoded ones. Does it make sense? |
These numbers do not make sense. Can you share code for your benchmark? There is number of factors that can explain them - code quality diff between JIT and C/C++ compiler, ... - hard to tell without seeing the actual benchmark. |
Here is the managed code: class Program
{
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Usage: [unicode|utf8] Utf8PerfTestManaged file");
return;
}
bool sourceIsUtf8 = args[0] == "utf8";
using (FileStream fs = File.OpenRead(args[1]))
{
byte[] source = new byte[fs.Length];
fs.Read(source, 0, (int)fs.Length);
byte[] destination = null;
var sw = new Stopwatch();
sw.Restart();
if (sourceIsUtf8) {
destination = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, source);
}
else {
destination = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, source);
}
Console.WriteLine(sw.Elapsed);
using (FileStream fs2 = File.OpenWrite(@"D:\test\Utf8PerfTestManaged\targetmanaged.txt"))
{
fs2.Write(destination, 0, destination.Length);
}
}
}
} |
This is low performance convenience method - it will run the UTF8 encoder two times, and allocate big two arrays on the GC heap along the way. I doubt your unmanaged equivalent does that... You should be benchmarking the high performance entrypoints like |
Maintaining two structurally different optimized mutations of the UTF8 encoding/decoding core algorithm does not make sense. The core algorithm used in UTF8Encoding.cs has been bug-fixed and fine-tuned by number of people over the years, and it is replicated in number of places in different contexts (across Microsoft codebases). I want us to:
|
@jkotas ok, I understand your point on unifying the implementation and it makes sense. The goal of my work was to do a simple refactoring to speed it up. But I ended up doing small incremental changes over the days in spare moments and in the end it got to the completely refactored state. Out of curiosity, not to push my solution, I've changed the managed benchmark to use the low level functions (well, not the ones with the pointers, since they are not public, but the ones that get arrays and just pin them before calling the low level stuff. Here are the results:
Large document:
The new source is here. It does exactly what the native one does - go through the data twice, once to get the necessary size of the destination buffer and then again to perform the conversion. class Program
{
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Usage: [unicode|utf8] Utf8PerfTestManaged file");
return;
}
bool sourceIsUtf8 = args[0] == "utf8";
if (sourceIsUtf8) {
byte[] source = File.ReadAllBytes(args[1]);
char[] destination = new char[source.Length * 4];
var sw = new Stopwatch();
UTF8Encoding utf8Encoding = new UTF8Encoding();
for (int i = 0; i < 10; i++) {
sw.Restart();
int c = utf8Encoding.GetCharCount(source, 0, source.Length);
utf8Encoding.GetChars(source, 0, source.Length, destination, 0);
Console.WriteLine(sw.Elapsed);
}
}
else
{
string source = File.ReadAllText(args[1]);
byte[] destination = new byte[source.Length * 4];
var sw = new Stopwatch();
UTF8Encoding utf8Encoding = new UTF8Encoding();
for (int i = 0; i < 10; i++) {
sw.Restart();
int c = utf8Encoding.GetByteCount(source);
utf8Encoding.GetBytes(source, 0, source.Length, destination, 0);
Console.WriteLine(sw.Elapsed);
}
}
}
} |
@janvorli, is this PR still relevant? |
No, it is not, since @wtgodbe has already ported the C# version. |
This change improves performance of Environment.GetEnvironmentVariable in release
build by about 25%.
To do that, I have heavily refactored the UTF8ToUnicode and UnicodeToUTF8 functions
and replaced malloc in GetEnvironmentVariableW by the recently checked in StackString.
To make sure the UTF8ToUnicode and UnicodeToUTF8 functions work properly, I have
added two PAL tests that extensively test the conversion including hopefully all
corner cases.