Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient API to stream from a MetadataReader to a MetadataBuilder with modifications #30443

Open
jnm2 opened this issue Jul 31, 2019 · 10 comments
Labels
area-System.Reflection.Metadata help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@jnm2
Copy link
Contributor

jnm2 commented Jul 31, 2019

Rewriting IL using System.Reflection.Metadata seems to bit a bit of a mind-melting endeavor. Due to the order-dependent relations between the tables, and most binary blobs needing row references to be patched, I haven't been able to find a scalable pattern for a single-pass read and write. It feels like something like this should be in theory possible. Maybe with something like a visitor pattern?

I'd like to do some things that are purely additive, such as inserting a module initializer, and others that subtract or modify rows, like generating ref assemblies from third party framework implementation assemblies. Some of these things will slow down every build because I want to put them in post-processing NuGet packages, so one goal is absolutely minimal overhead.

Does this sound like something that can be made generally useful and that you'd be willing to accept in SR.Metadata and keep up to date?

@nguerrera
Copy link
Contributor

This feels like the domain of a higher level API that can offer you a sort of DOM that you can read/edit/write. I think Mono.Cecil would be more appropriate.

@tmat
Copy link
Member

tmat commented Jul 31, 2019

I agree with Nick's Mono.Cecil recommendation.

That said I have a prototype for a low-level code that efficiently hydrates PEBuilder from PEReader. It's however more complicated than it looks. The blocker at this point is the lack of representation for native resources on PEReader, which is needed in order to be able to round-trip assemblies.

@jnm2 If you'd like to help move it forward I would be happy to accept PR with support for native resource reading and serialization. This is also something that multiple other systems could use (e.g. we could remove the native resource serialization logic from Roslyn, F#, etc.). Just fair warning - it's likely quite a bit of work (including public API design).

Once the hydration is done there is of course the question of additions/updates. I don't think we want to support arbitrary metadata updates. That'd mean building a DOM etc. and Mono.Cecil already does that. What would be potentially useful though would be the ability to apply EnC delta metadata after the MetadataBuilder has been hydrated from the reader. That might have some quite interesting applications.

@jnm2
Copy link
Contributor Author

jnm2 commented Jul 31, 2019

Would you expect Mono.Cecil to be slower than hydrating a MetadataBuilder from a MetadataReader in one pass, skipping parts you want to remove?

I don't think I want a DOM for this. If additions, modifications and deletions can be done in a general single-pass way with no intermediate representation between MetadataReader and MetadataBuilder, that's what I'm after. Could that converge with your prototype, @tmat?

@tmat
Copy link
Member

tmat commented Aug 1, 2019

Would you expect Mono.Cecil to be slower than hydrating a MetadataBuilder from a MetadataReader in one pass, skipping parts you want to remove

Yes.

If additions, modifications and deletions can be done in a general single-pass way with no intermediate representation between MetadataReader and MetadataBuilder, that's what I'm after.

I understand. Some operations can be done without DOM. It also depends what kind of APIs do you expect to be exposed for these operations. As I said, I can imagine an efficient API that applies EnC delta to a fully hydrated MetadataBuilder. I'm guessing most of the operations you want to do could be encoded as an EnC delta.

First step is to implement round-tripping. That's what my prototype is trying to do. And that itself is non-trivial.

@jnm2
Copy link
Contributor Author

jnm2 commented Aug 1, 2019

Could something like this be both generally useful and efficient? (uses Dictionary, could be swapped for a more efficient map)

public class MetadataVisitor
{
    private readonly Dictionary<Handle, Handle> handleMap = new Dictionary<Handle, Handle>();

    protected MetadataVisitor(MetadataReader reader, MetadataBuilder builder)
    {
        Reader = reader;
        Builder = builder ?? throw new ArgumentNullException(nameof(builder));
    }

    protected MetadataReader Reader { get; }
    protected MetadataBuilder Builder { get; }

    public void VisitAll()
    {
        VisitAssemblyDefinition();

        foreach (var handle in Reader.AssemblyFiles)
            VisitAssemblyFile(handle);

        foreach (var handle in Reader.AssemblyReferences)
            VisitAssemblyReference(handle);

        foreach (var handle in Reader.CustomDebugInformation)
            VisitCustomDebugInformation(handle);

        // ...
    }

    private Handle MapHandle(Handle readerHandle, Func<Handle, Handle> visitor)
    {
        if (readerHandle.IsNil) return default;

        if (!handleMap.TryGetValue(readerHandle, out var mappedHandle))
            handleMap.Add(readerHandle, mappedHandle = visitor.Invoke(mappedHandle));

        return mappedHandle;
    }

    public StringHandle MapString(StringHandle readerHandle)
    {
        return (StringHandle)MapHandle(readerHandle, h =>
            Builder.GetOrAddString(Reader.GetString((StringHandle)h)));
    }

    public GuidHandle MapGuid(GuidHandle readerHandle)
    {
        return (GuidHandle)MapHandle(readerHandle, h =>
            Builder.GetOrAddGuid(Reader.GetGuid((GuidHandle)h)));
    }

    public BlobHandle MapBlob(BlobHandle readerHandle)
    {
        return (BlobHandle)MapHandle(readerHandle, h =>
            Builder.GetOrAddBlob(Reader.GetBlobBytes((BlobHandle)h)));
    }

    public EntityHandle MapEntityHandle(EntityHandle readerHandle)
    {
        return readerHandle.Kind switch
        {
            HandleKind.AssemblyDefinition => EntityHandle.AssemblyDefinition,
            HandleKind.AssemblyFile => (EntityHandle)MapAssemblyFile((AssemblyFileHandle)readerHandle),
            HandleKind.AssemblyReference => MapAssemblyReference((AssemblyReferenceHandle)readerHandle),
            HandleKind.CustomDebugInformation => MapCustomDebugInformation((CustomDebugInformationHandle)readerHandle),
            _ => throw new NotImplementedException()
        };
    }

    protected virtual AssemblyDefinitionHandle VisitAssemblyDefinition()
    {
        var assemblyDefinition = Reader.GetAssemblyDefinition();

        return Builder.AddAssembly(
            MapString(assemblyDefinition.Name),
            assemblyDefinition.Version,
            MapString(assemblyDefinition.Culture),
            MapBlob(assemblyDefinition.PublicKey),
            assemblyDefinition.Flags,
            assemblyDefinition.HashAlgorithm);
    }

    public AssemblyFileHandle MapAssemblyFile(AssemblyFileHandle readerHandle)
    {
        return (AssemblyFileHandle)MapHandle(readerHandle, h => VisitAssemblyFile((AssemblyFileHandle)h));
    }

    protected virtual AssemblyFileHandle VisitAssemblyFile(AssemblyFileHandle readerHandle)
    {
        var assemblyFile = Reader.GetAssemblyFile(readerHandle);

        return Builder.AddAssemblyFile(
            MapString(assemblyFile.Name),
            MapBlob(assemblyFile.HashValue),
            assemblyFile.ContainsMetadata);
    }

    public AssemblyReferenceHandle MapAssemblyReference(AssemblyReferenceHandle readerHandle)
    {
        return (AssemblyReferenceHandle)MapHandle(readerHandle, h => VisitAssemblyReference((AssemblyReferenceHandle)h));
    }

    protected virtual AssemblyReferenceHandle VisitAssemblyReference(AssemblyReferenceHandle readerHandle)
    {
        var assemblyReference = Reader.GetAssemblyReference(readerHandle);

        return Builder.AddAssemblyReference(
            MapString(assemblyReference.Name),
            assemblyReference.Version,
            MapString(assemblyReference.Culture),
            MapBlob(assemblyReference.PublicKeyOrToken),
            assemblyReference.Flags,
            MapBlob(assemblyReference.HashValue));
    }

    public CustomDebugInformationHandle MapCustomDebugInformation(CustomDebugInformationHandle readerHandle)
    {
        return (CustomDebugInformationHandle)MapHandle(readerHandle, h => VisitCustomDebugInformation((CustomDebugInformationHandle)h));
    }

    protected virtual CustomDebugInformationHandle VisitCustomDebugInformation(CustomDebugInformationHandle readerHandle)
    {
        var customDebugInformation = Reader.GetCustomDebugInformation(readerHandle);

        return Builder.AddCustomDebugInformation(
            MapEntityHandle(customDebugInformation.Parent),
            MapGuid(customDebugInformation.Kind),
            MapBlob(customDebugInformation.Value)); // Maybe this becomes invalid and should be dropped if not parsed?
    }
}

@tmat
Copy link
Member

tmat commented Aug 1, 2019

Copying the rows from reader to builder is the easy part. I already have very efficient code for that. It's not useful on its own though without the rest of the re-hydration code all the way up to PEReader, which includes native resources.

@tmat
Copy link
Member

tmat commented Aug 1, 2019

@jnm2
Copy link
Contributor Author

jnm2 commented Aug 1, 2019

@tmat It sounds like your prototype starts by copying everything rather than planning for some things to be added, modified or skipped in the initial copy. I'm not sure I see that as the most efficient way to add and modify and delete. How does deleting rows work when row handles are already serialized in blobs? Let's say for example that I want to strip out private types, modify private fields, and leave most tables uncopied altogether, but preserve attributes which are referencing many of the remaining types in the now-smaller table.

Does calculating an EnC diff solve this problem? Is calculating and applying that diff close enough in efficiency to only copying what is needed in the first place?
You know a lot more than I do. Sorry if these are dumb questions!

@tmat
Copy link
Member

tmat commented Aug 1, 2019

It's an early prototype :) Obviously, this requires more work. I started with the goal to roundtrip any assembly without modifying the metadata. Already turned out to be non-trivial problem.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@ericstj ericstj removed the untriaged New issue has not been triaged by the area owner label Jun 29, 2020
@wasabii
Copy link

wasabii commented Jun 16, 2022

@tmat Did anything become of your rewriter? I've found myself in need of similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Reflection.Metadata help wanted [up-for-grabs] Good issue for external contributors
Projects
No open projects
Development

No branches or pull requests

7 participants