Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Cursor's CursorChildren (CXCursor.VisitChildren) #260

Closed
danieljennings opened this issue Aug 16, 2021 · 4 comments · Fixed by #272
Closed

Performance of Cursor's CursorChildren (CXCursor.VisitChildren) #260

danieljennings opened this issue Aug 16, 2021 · 4 comments · Fixed by #272

Comments

@danieljennings
Copy link
Contributor

When profiling our application that uses ClangSharp, I've noticed that the Cursor class's CursorChildren lazy property evaluation takes a significant amount of time on profiles, specifically inside of the CXCursor.VisitChildren call.

It looks like 75% of that time is being spent inside of MarshalNative::GetFunctionPointerForDelegateInternal (of which nearly all of the time is inside of COMDelegate::ConvertToCallback).

Given the fact that this is likely to be a hotspot across many applications, is there an opportunity to avoid such a penalty by avoiding using the more "generic" CXCursor.VisitChildren in lieu of a custom native method that is P/Invoked to populate a List of the immediate children of a given Cursor using a native function (therefore avoiding interop of the otherwise-uninteresting delegate)?

If that's feasible, am I correct in my understanding of the project hierarchy here that such code would live within the libClangSharp project?

Thanks,

Daniel Jennings
Valve

@danieljennings
Copy link
Contributor Author

I've since realized that the fix is on the C# side, so here's a proof of concept solution (TranslationUnit_GetOrCreate.Value is just a lazy lookup of the MethodInfo for TranslationUnit.GetOrCreate<Cursor>, which you would clearly not need internal to ClangSharp). TLDR is instead of creating a new delegate for each invocation of Cursor.Children (and calling Marshal.GetFunctionPointerForDelegate for each one) we create a "static" delegate and pass the required fields inside of CursorChildClientData via the client data pointer.

In our application this cuts the total time to parse and traversing an entire AST via Cursor.Children by 40%, which demonstrates how significant of a penalty Marshal.GetFunctionPointerForDelegate is in practice.

private class CursorChildrenClientData
{
	public List<Cursor> Cursors;
	public TranslationUnit TranslationUnit;
}

private static unsafe CXCursorVisitor CursorChildrenDelegate = delegate ( CXCursor child, CXCursor parent, void *pClientData )
{
	GCHandle cursorsHandle = GCHandle.FromIntPtr( new IntPtr( pClientData ) );
	CursorChildrenClientData clientData = (CursorChildrenClientData)cursorsHandle.Target;
	Cursor orCreate = TranslationUnit_GetOrCreate.Value( clientData.TranslationUnit, child );
	clientData.Cursors.Add( orCreate );
	return CXChildVisitResult.CXChildVisit_Continue;
};

private static IntPtr CursorChildrenDelegatePointer = Marshal.GetFunctionPointerForDelegate( CursorChildrenDelegate );

private static unsafe IReadOnlyList<Cursor> CursorChildrenFastNoCache( Cursor cursor )
{
	CursorChildrenClientData clientData = new CursorChildrenClientData();
	clientData.Cursors = new List<Cursor>();
	clientData.TranslationUnit = cursor.TranslationUnit;

	GCHandle clientDataHandle = GCHandle.Alloc( clientData );
	clang.visitChildren( cursor.Handle, CursorChildrenDelegatePointer, GCHandle.ToIntPtr( clientDataHandle ).ToPointer() );
	clientDataHandle.Free();

	return clientData.Cursors;
}

@Perksey
Copy link
Member

Perksey commented Aug 28, 2021

Happy to contribute a fix which will be even faster on .NET 5+ :)

@tannergooding
Copy link
Member

Sorry for the delayed response, I thought I had already responded to this one. I'm happy to take performance fixes here, including by caching delegates where appropriate.

This is likely a case that might also benefit from using function pointers directly, but the requisite UnmanagedCallersOnly is .NET 5+ (I assume this is what @Perksey was alluding to) and so there would need to be a couple of small #ifdef to handle the difference for full framework.

@tannergooding
Copy link
Member

Have a fix up in #272. It saves some time, but other things like getting the Spelling profiled as significantly hotter on my box so I fixed those as well.

There's likely a few things that could be done differently/better here as separate work items. Initial thoughts include reducing the number of uncached P/Invokes that are "hot" and switching over to a form of "value lazy" to reduce allocations/indirections. However, we're already doing a pretty good job of caching things and there ends up being some cost here regardless given how Clang exposes everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants