-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State / Direction of C# as a High-Performance Language #10378
Comments
I believe this was also requested in #161 |
I don't believe that these issues can be solved without direct CLR support. The CLR limits reference types to the heap. Even C++/CLI is forced to abide by that restriction and the stack semantics syntax still allocates on the heap. The GC also provides no facility to directly target specific instances. I wonder how much C# could make a |
I've added some more related issues to the above list, which hadn't been mentioned yet. |
I am in a very similar position to @ilexp, and generally interested in the performance of my code, and knowing how to write efficient code. So I'd second the importance of this discussion. I also think the summary and points in the original post are quite good, and have nothing to add at the moment. Small note on using About memory locality: I was under the impression that if I Looking forward to seeing what others have to say on this topic! |
There was a really nice prototype done by @xoofx showing the perf improvements of allowing |
Microsoft Research many years ago experimented with using some unused bits on each object as access counters. The research hacked the heap to re-organized mostly used objects so that they ended up on the same page. He showed in a sample XML parser that C# code was faster than optimized C++. The talk he gave on it was called, "Making C# faster than C#". The researcher that developed the technique left MS and the research apparently died with him. He had a long list of other, similar improvements that he was planning on trying. None of which, I believe, saw daylight. Perhaps this work should be resuscitated so that the promise (made in the beginning: remember how the JITer was going to ultra-optimize for your hardware??) can be realized. |
We are in the crowded boat of using c# with Unity3d, which may finally be moving toward a newer CLR sometime soon, so this discussion is of great interest to us. Thanks for starting it. The request to have at least some hinting to the GC, even if not direct control, is at the top of our iist. As the programmer, we are in a position to declaratively "help" the GC but have no opportunity to do so. |
"game development... has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization." C# gets in the way because that's what it was designed to do. If you want to write code that disregards correctness in favour of performance, you should be writing that code in a language that doesn't enforce correctness (C/C++), not trying to make a correctness-enforcing language less so. Especially since scenarios where performance is preferable to correctness, is an extremely tiny minority of C# use cases. |
@IanKemp that's a very narrow view of C#. There are languages like Rust that try to maximize correctness without run-time overhead, so it's not one vs the other. While C# is a garbage-collected language by design, with all the benefits and penalties that it brings, there's no reason why we cannot ask for performance-oriented improvements, like cache-friendly allocations of collections of reference types or deterministic deallocation, for example. Even LOB applications have performance bottlenecks, not just computer games or science-related scripts. |
@IanKemp Are you saying that |
Hey, people...try this: write a function will result in no garbage collections....something with a bunch of math in it for example. Write the exact same code in C++. See which is faster. The C++ compiler will always generate as fast or faster code (usually faster). The Intel compiler is most often even faster...it has nothing to do with the language. For example. I wrote a PCM audio mixer is C#, C++ and compile with the .Net, MS, and Intel compilers. The code in question had no GC, no boundary checks, no excuses. C#: slowest In this example the Intel compiler recognized that computation could be replaced by SSE2 instructions. the Microsoft compiler wasn't so smart, but it was smarter than the .Net compiler/JITer. So I keep hearing talk about adding extensions to the language to help the GC do things more efficiently, but it seems to me ...the language isn't the problem. Even if those suggestion are taken we're still hamstrung by an intentionally slow code generating compiler/jitter. It's the compiler and the GC that should be doing a better job. See: #4331 I'm really tired of the C++ guys saying, "we don't use it because it's too slow" when there is _very little reason _for it to be slow. BTW: I'm in the camp of people that doesn't care how long the JITer takes to do its job. Most of the world's code runs on servers...why isn't it optimized to do so? |
I completely agree with all of the mentioned improvements. These are in my opinion absolutely mandatory. Using C# in high performance applications is the right was. It makes code much easier to read if there would be at least some of the suggested improvements. Currently we have to "leave" the language to C++ or C to create things there are not possible in C#, and i don't mean assembler instructions but very simple pointer operations on blittable Data types or generics. So not to leave the language i created unreadable Code fragments just not to use unmanaged code because then i am dependent on x86 and x64. |
From a gamedev perspective, it would be neat if there was a way to tell the runtime to perform extended JIT optimization using framework API. Let's say by default, there is only the regular, fast optimization, the application starts up quickly and all behaves as usual. Then I enter the loading screen, because I'll have to load levels and assets anyway - now would be an excellent time to tell the runtime to JIT optimize the heck out of everything, because the user is waiting anyway and expecting to do so. This could happen on a per-method, per-class or per-Assembly level. Maybe you don't need 90% of the code to be optimized that well, but that one method, class or Assembly should be. As far as server applications go, they could very well do the same in the initialization phase. Same for audio, image and video processing software. Extended JIT optimization could be a very powerful opt-in and on runtimes that do not support this, the API commands can still just fall back to not having any effect. Maybe it would even be possible to somehow cache the super-optimized machine code somewhere, so it doesn't need to be re-done at the next startup unless modified or copied to a different machine. Maybe partial caches would be possible, so even if not all code is super-JITed yet, at least the parts that are will be available. Which would be a lot more convenient and portable than pre-compiling an Assembly to native machine code, simply because Assemblies can run anywhere and native machine code can not. All that said, I think both allowing the JIT to do a better job and allowing developers to write more efficient code in the first place would be equally welcome. I don't think this should be an either / or decision. |
While advocating for many years about performance for C#, I completely concur with the fact that It would be great to see more investments in this area. Most notably on the following 3 axes:
Unfortunately, there are also some breaking change scenarios that would require to fork the language/runtime to correctly address some of the intrinsic weakness of the current language/runtime model (e.g things that have been done for Midori for their Error Model or safe native code for example...etc.) |
@SunnyWar I think there's enough enough room to optimize both code generation for math and GC. As to which one should have higher priority, keep in mind that it's relatively easy to workaround bad performance in math by PInvoking native code or using And since you mention servers, a big part of their performance are things like "how long it take to allocate a buffer", not "how long does it take to execute math-heavy code". |
I'm adding JIT tiering to the list of features I see as required to make C# a truly high performance language. It is one of the highest impact changes that can be done at the CLR level. JIT tiering has impact on the C# language design (counter-intuitively). A strong second tier JIT can optimize away abstractions. This can cause C# features to become truly cost free. For example, if escape analysis and stack allocation of ref types was consistently working the C# language could take a more liberal stance on allocations. If devirtualization was working better (right now: not all all in RyuJIT) abstractions such as I imagine records and pattern matching a features that tend to cause more allocations and more runtime type tests. These are very amenable to advanced optimizations. |
Born out of a recent discussion with others, I think its time to review the "unsafe" syntax. The discussion can be summarized as "Does 'unsafe' even matter anymore?" .Net is moving "out of the security business" with CoreCLR. In a game development scenario, most of the work involves pointers to blocks of data. It would help if there was less syntactic verbosity in using pointers directly.
This is completely useless on the billions of ARM devices out there in the world. With regard to the GC discussion, I do not think that further GC abuse/workarounds are the solution. Instead there needs to be a deterministic alloc/ctor/dtor/free pattern. Typically this is done with reference counting. Today's systems are mutli-core, and today's programs are multi-threaded. "Stop the world" is a very expensive operation. In conclusion, what is actually desired is the C# language and libraries but on top of a next-generation runtime better suited for the needs of "real-time" (deterministic) development such as games. That is currently beyond the scope of CoreCLR. However, with everything finally open source, its now possible to gather a like minded group to pursue research into it as a different project. |
I'm doing a lot of high-perf / low latency work in C#. One thing that would be "the killer feature" for perf work is for them to get .NET Native fully working. I know it's close, but the recent community standups have said that it won't be part of v1.0 RTM and they're rethinking the usage for it. The VS C++ compiler is amazing at auto-vectorizing, dead code elimination, constant folding, etc. It just does this better than I can hand-optimize C# in its limited ways. I believe traditional JIT compiling (not just RyuJIT) just doesn't have enough time to do all of those optimizations at run-time. I would be in favor of giving up additional compile time, portability, and reflection in exchange for better runtime performance; and I suspect those that are contributing to this thread here probably feels the same way. For those that aren't, then you still have RyuJIT. Second, if there were some tuning knobs available for the CLR itself. |
Adding a proposal for Heap objects with custom allocator and explicit delete. That way latency-sensitive code can take control of allocation and deallocation while integrating nicely with an otherwise safe managed application. It's basically a nicer and more practical |
@OtherCrashOverride @GSPP Destructible Types? #161 |
Ideally, we want to get rid of IDisposable entirely and directly call the dtor (finalizer) when the object is no longer in use (garbage). Without this, the GC still has to stop all threads of execution to trace object use and the dtor is always called on a different thread of execution. This implies we need to add reference counting and modify the compiler to increment and decrement the count as appropriate such as when a variable is copied or goes out of scope. You could then, for example, hint that you would like to allocate an object on the stack and then have it automatically 'boxed' (promoted) to a heap value if its reference count is greater than zero when it goes out of scope. This would eliminate "escape analysis" requirements. Of course, all this is speculation at this point. But the theoretical benefits warrant research and exploration in a separate project. I suspect there is much more to gain from redesigning the runtime than there is from adding more rules and complication to the language. |
@OtherCrashOverride I've also come to the conclusion that a reference counting solution is critical for solving a number of problems. For example, some years ago I wrote message passing service using an Actor model. The problem I ran into right away is I was allocating millions of small objects (for messaging coming in) and the GC pressure to clean after they went out was horrid. I ended up wrapping them in a reference counting object to essentially cache them. It solved the problem BUT I was back to old, ugly, COM days of having to insure every Actor behaved and did an AddRef/Release for every message it processed. It worked..but it was ugly and I still dream of a day I can have a CLR managed reference countable object with an overloadable OnRelease, so that I can put it back in the queue when the count==0 rather than let it be GC'd. |
Don't want to detail the rest of it in this general overview thread, just regarding this specific point of @OtherCrashOverride's posting:
As a general direction of design with regard to future "efficient code" additions, I think it would be a good thing to keep most or even all of them - both language features and specialized API - hidden away just enough so nobody can stumble upon them accidentally, following the overall "pit of success" rule if you will. I would very much like to avoid a situation where improving 0.1% of performance critical code would lead to an overall increase in complexity and confusion for the 99.9% of regular code. Removing the safety belt in C# needs to be a conscious and (ideally) local decision, so as long as you don't screw up in that specific code area, it should be transparent to all the other code in your project, or other projects using your library. |
That would require you to find and update all existing references to that object. While the GC already does that when compacting, I doubt doing it potentially at every method return would be efficient. |
Today, the system imposes limitations on us that are purely historical and in no way limit how things can be done in the future. |
Joe Duffy has a great talk that covers (amongst other things) what would need to be done to optimise LINQ, escape analysis, stack allocation etc (slides are available if you don't want to watch the whole thing) |
I have a general question about for loops. I was trying to optimize my mathematical routines which are mostly done on arrays of different types. As discussed very very often here the problem is that I cannot have pointers to generics so I had to duplicate my functions for all primitive types. However I have accepted - or better resigned - on this topic since it seems it will never come. Nevertheless I have also tried the same - also discussed here - with IL code which works fine for my solution, but also there it would be nice to have some IL inline assembler, like the old asm {} keyword in C++, also here I guess it will never come. What does currently bother me is when looking how a for loop is "converted" into IL code. From my old assembler knowledge there was the LOOP keyword where a simple addition was done in the AX,BC with the CX as count register. In IL it seems that all loops are converted to IF...GOTO statements which I feel very umcomfortable with, since I think no jitter will ever recognize that an IF...GOTO statement can be converted to the LOOP construct in x86 architecture. I guess that doing the loops with IF...GOTO costs much more than the x86 LOOP. What does the jitter do to optimize loops? I'm I right or wrong on this topic.? |
@msedi by building all loops in IL roughly the same way the jitter can search for a common pattern to optimize. Indeed the core CLR (and I assume desktop as well) does identify a number of such possible loops. For example:
|
@msedi Apparently, And finding loops is easy for the JIT, you just have to find a cycle in the control flow graph generated from the IL. |
Wonderful talk by Joe Duffy. I felt happy to hear that they're [apparently] tackling all those problems we're discussing here. And geez, I was at least impressed to hear that some applications from Microsoft (!) are 60% of the time in GC. 60%!! My god. |
@andre-ss6 hits the nail on the head. Of course not all performance issues are due to allocations. But unlike most performance issues, which have sane solutions in C#, if you run into 99% time spent in GC then you're pretty much stuffed. What are your options at this stage? In C# as it stands today, pretty much the only option is to use arrays of structs. But any time you need to refer to one of those structs, you either go unsafe and use pointers, or you write extremely unreadable code. Both options are bad. If C# had AST macros, the code to access such "references" could be vastly more readable without any performance penalty added by the abstraction. One of the bigger improvements on code that's already well-optimized comes from abandoning all the nice and convenient features like List<T>, LINQ or the foreach loop. The fact that these are expensive in tight code is unfortunate, but what is worse is that there is no way to rewrite these in a way that's comparable in readability - and that's another thing AST macros could help with. Obviously the AST macros feature would need to be designed very carefully and would require a major time investment. But if I had a vote on the subject of the one single thing that would make fast C# less of a pain, AST macros would get my vote. P.S. I was replying to Andre's comment from almost a month ago. What are the chances he'd comment again minutes before me?! |
@rstarkov Hmm, I would object to calling a codebase that's using LINQ "well-optimized." That's basically saying, "I'm not allocating anything, except for all these allocations!" :) |
I'm happy to see the ValueTask. I hope they make it into the Dataflow blocks. I wrote a audio router a few years ago. After profiling I found it spent most of it's time in the GC cleaning up tasks....and there was nothing I could about it without completely throwing out the Dataflow pipeline (basically the guts of the whole thing). |
@rstarkov you use Ref returns and locals in C# 7 with Visual Studio “15” Preview 4, though alas you can't use it will .NET Core currently. However, it is coming and should address this particular issue. |
@SunnyWar Were your transforms synchronous or asynchronous? If they were asynchronous, then you probably can't avoid allocating |
@benaadams Technically the NuGet packages are available, but it'd probably require building the .NET CLI repo from source |
@agocke compiling it is one thing and important for CI; but development work doesn't flow so well when the UI tooling doesn't understand it very well and highlights errors :-/ |
@benaadams Duh, I totally forgot about the IDE :-p |
@SunnyWar, @svick also if you aren't careful in dataflow you can wind up with many allocations related to closures, lambdas and function pointers even if they were synchronous (it seems pretty impossible to avoid at least some in any case; sometimes it might even be reasonable to hold on to references intentionally to lighten GC in particular places). |
@rstarkov |
@agocke The fact that a codebase which uses LINQ is not "well-optimized" is exactly the problem. There's no reason in principle why, at least in the more simple cases (which are probably the majority of cases), the compiler couldn't do stack allocation, in-lining, loop fusion, and so forth to produce fast, imperative code. Broadly speaking, isn't that (a big part of) why we have a compiler - so we can write expressive, maintainable code, and let the machine rewrite it as something ugly and fast? Don't get me wrong, I'm not expecting the compiler to completely free me from having to optimize, but optimizing some of the most common uses of Linq2Objects seems like relatively low-hanging fruit that would benefit a huge number of C# devs. |
@mattwarren That Joe Duffy talk is amazing, thanks for sharing! To what degree is this work already in progress with the C# compiler, as opposed to just in experimental projects like Midori? In particular, the stuff he's talking about at around 23:00 seems a lot like what people here are asking for as far as LINQ optimizations. Is there an issue in this GitHub repo that tracks the progress on that? |
@timgoodman there are things here and there dotnet/coreclr#6653 |
@benaadams Thanks. I guess I'm not sure why this sort of thing would be under coreclr. The kinds of changes that Joe Duffy was describing seem like compiler optimizations - shouldn't they belong in roslyn or maybe llilc? |
Ah, never mind, I hadn't realized that the coreclr repo contains the JIT compiler. I guess that's where this sort of optimization would need to happen for it to apply to calls to System.Linq methods. |
Great effort! Sounds to me a bit silly to point it out, but I've noticed in the codebase only one register colorizer: LSRA (Linear Scan) one. Is it possible to set at least for flags like AggresiveInline to a different register allocator? Maybe BackTracking (the LLVM new one) or a full register allocator? |
It would be great to be minimal CHA or at least for sealed classes to be devirtualized or internal classes in assembly that are not overriden to be considered sealed. Use this information to devirtualize (more aggresively) methods. Very often using ToString, and so on, cannot be safely devirtualized because there is the possiblity that the methods to be overriden. But in many assemblies private/internal classes are easier to be tracked if they are overriden, especially as assemblies do have the types and relations local. This operation should increase by a bit the starting time, but it could be enabled into a "performance mode" tier. |
Hi All, I mostly did this because I couldn't see Java score better than C#, but my mental issues are not the subject of this issue. The main improvement with this version (e.g. where most of the fat came off) is the use of ref-return dictionary instead of the .NET Try as I might I couldn't find a proper discussion of adding new data-structures / new functionality to existing data structures that would add ref-return APIs to Is anyone here aware of a discussion / decision regarding this? It feels too weird for the Roslyn team to drop this very nice new language feature and leave the whole data structures in BCL part out of it, that I feel I should ask if anyone here, whom I assume are very knowledgeable about the hi-perf situation of .NET / C# could elaborate on where we currently stand...? |
@jcouv Yep, have been excitedly watching the new developments in C# and they definitely address some of the points. Others still remain to be discussed or addressed, but the big unsafe / slice / span part is done and discussion has been diverted to the individual issues in CoreCLR and CSharpLang. Closing this, looking forward to future improvements. |
I worked on highly optimized code, including hand optimized assembly code, in the video game industry for 13 years. |
I've been following recent development of C# as a language and it seems that there is a strong focus on providing the means to write code more efficiently. This is definitely neat. But what about providing ways to write more efficient code?
For context, I'm using C# mostly for game development (as in "lowlevel / from scratch") which has a habit of gladly abandoning the usual ways of safe code design for that 0.1% of the bottleneck code in favor of maximum efficiency. Unfortunately, there are cases where C# gets in the way of that last bit of optimization.
Issues related to this:
Other sentiments regarding this:
This is probably more of a broader discussion, but I guess my core question is: Is there a general roadmap regarding potential improvements for performance-focused code in C#?
The text was updated successfully, but these errors were encountered: