Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GetEncodings method to System.Text.EncodingProvider to support enumerating available character encodings #25819

Closed
mklement0 opened this issue Apr 9, 2018 · 25 comments · Fixed by #37528
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding
Milestone

Comments

@mklement0
Copy link

mklement0 commented Apr 9, 2018

This API proposal arose out of #25804.

This proposal was approved before but we discovered the original proposal was returning EncodingInfo object while we didn't expose any public constructor for this type.
The delta change in this proposal is exposing a public constructor for EncodingInfo to allow creating such objects

Rationale and Use Cases

As of this writing:

However, it is desirable to have the ability to enumerate all available encodings:

  • Even in the abstract it seems like an awkward omission not to be able to reflect on the available encodings; more concretely, consider the following use cases, both related to PowerShell Core:

  • PowerShell Core commands such as Get-Content for reading text files support an -Encoding parameter that directly accepts System.Text.Encoding instances, so users expect to be able to discover all available encodings and/or be assisted in selecting a specific encoding with tab completion:

Get-Content file.txt -Encoding <tab-keypress>  # should show all available encodings
  • Similarly, a green-lighted, but not yet implemented Convert-TextFile command will allow conversion between all available character encodings, so their discovery / ease of selection is important there too.

Proposed API

Add a GetEncodings() method to abstract class System.Text.EncodingProvider:

namespace System.Text
{
    public abstract class EncodingProvider
    {
      // ...
      // New virtual method, analogous to System.Text.Encoding.GetEncodings()
      public virtual System.Text.EncodingInfo[] GetEncodings ();
      // ...
    }
    
    public sealed class EncodingInfo // already exposed class 
    {
       // This is internal constructor and we are making it public
        public EncodingInfo(int codePage, string name, string displayName); 
    }
}

And System.Text.CodePagesEncodingProvider would then implement that method to return the encodings it provides.

Specifically, a call to System.Text.CodePagesEncodingProvider.Instance.GetEncodings() would then return the array of System.Text.EncodingInfo instances representing that provider's encodings.

As with System.Text.Encodings.GetEncodings(), no particular enumeration order is guaranteed.


Integration into System.Text.Encodings.GetEncodings()

Once System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance) has been called, System.Text.Encodings.GetEncodings() will enumerate the additional encodings too, using the newly introduced method on the provider if registered.

In other words: System.Text.Encodings.GetEncodings() will return whatever encodings are currently available - whether just the default set by default or the union of the default set and the additional encodings after provider registration.

As with the existing enumeration, the encodings (EncodingInfo instances) are returned in no guaranteed order.

If there is more than one registered provider that supports a given encoding, the returned list will contain only one entry for it.

Updates (most recent ones first)

  • Clarification re returning only distinct encodings.
  • Return behavior clarified again - no particular enumeration order is guaranteed
  • Made new GetEncodings method virtual rather than abstract.
  • Open question removed in favor of unconditionally outputting all then-available encodings.
  • Return behavior clarified (ordering).
@mklement0 mklement0 changed the title Add a GetEncodings method to System.Text.EncodingProvider Add a GetEncodings method to System.Text.EncodingProvider to support enumerating available character encodings Apr 9, 2018
@jkotas
Copy link
Member

jkotas commented Apr 9, 2018

public abstract System.Text.EncodingInfo[] GetEncodings()

This cannot be abstract method. Adding abstract method to existing type is a breaking change.

Either way, a new overload is needed for GetEncodings()

I do not think we need a new overload for GetEncodings(). GetEncodings() should just return all encodings available.

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

@mklement0 thanks for the proposal.

For the open question, System.Text.Encodings.GetEncodings() should report all encodings including the provider encodings. I am not sure why this will be a compatibility issue? I cannot think in any breaking scenario here as we are going to return a superset from what we used to return. could you please elaborate more on that?

@mklement0
Copy link
Author

@tarekgh, @jkotas:

To be clear, my preference too is for System.Text.Encodings.GetEncodings() to return all available encodings, at least by default.

If there's no concern about backward compatibility (see below) and no need to support discovering just the default encodings while additional ones happen to be registered, then I'm happy to

  • make do without a new System.Text.Encodings.GetEncodings() overload.

  • let System.Text.Encodings.GetEncodings() unconditionally return whatever encodings are currently available.

I'm personally not worried about backward compatibility in this case, but conceivably there is code out there that relies on System.Text.Encodings.GetEncodings() only ever returning the default encodings; in other words: someone may rely on System.Text.Encodings.GetEncodings() as a reliable way to discover the default encodings.

So: Do wee need to worry about retaining a predictable, invariant way to discover the set of default encodings?

@jkotas:

This cannot be abstract method. Adding abstract method to existing type is a breaking change.

Thanks for pointing that out; I need guidance as to what to propose instead.

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

please remove the open question then.

Thanks for pointing that out; I need guidance as to what to propose instead.

make the method virtual instead of abstract.

@mklement0
Copy link
Author

@tarekgh: Thanks - done. I've also added details re return behavior (enumeration order); let me know if that makes sense.

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

@mklement0

System.Text.CodePagesEncodingProvider would then have to implement that method to return the encodings it provides, in ascending order of the CodePage property values of the EncodingInfo instances, analogous to System.Text.Encodings.GetEncodings().

System.Text.Encodings.GetEncodings documentation doesn't promise for any sort order. From where you got that?
https://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings(v=vs.110).aspx#Remarks

@mklement0
Copy link
Author

@tarekgh:

You're right - I didn't check the docs, I just inferred it from the de facto behavior.

Do you want me to say that the order is unspecified? What about when forming the union between default and registered encodings?

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

I prefer not mentioning anything about the order. we just promise to return the whole set supported by the core and the providers.

@mklement0
Copy link
Author

@tarekgh: Done (I've added a note that no particular enumeration order is guaranteed).

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

Thanks @mklement0

I have added the following statement too to the proposal integration section. let me know if you have any concern with that.

If there is more than one registered provider support same specific encoding, the returned list will contain only one entry for that specific encoding.

@tarekgh
Copy link
Member

tarekgh commented Apr 9, 2018

CC @krwq

@mklement0
Copy link
Author

Thanks, @tarekgh - good addition (I've streamlined the wording a bit).

@iSazonov
Copy link
Contributor

There may be another scenario if the user wants to get only the default list:

public virtual System.Text.EncodingInfo[] GetEncodings (bool IncludeDefaultsOnly = false);

@mklement0
Copy link
Author

@iSazonov: This was discussed above, but both @jkotas and @tarekgh think it's not necessary.

(On a side note: My guess is that separate method overloads are preferred to optional parameters; may be worth adding a clarification to the coding style document.)

@terrajobst
Copy link
Contributor

terrajobst commented Apr 17, 2018

Video

  • EncodingProvider.GetEncodings() should be IEnumerable<EncodingInfo>
  • EncodingProvider.GetEncodings() should be virtual and return an empty list (instead of throwing). While this isn't correct it means that one provider that isn't updated doesn't spoil the enumeration for everyone else.
  • Encoding.GetEncodings() should return all registered encodings across all providers and de-dupe them if necessary.

@tarekgh
Copy link
Member

tarekgh commented Apr 17, 2018

@mklement0 do you want to help in the implementation?

@mklement0
Copy link
Author

@tarekgh: Sounds good; I'll get started soon (unlike last time, where Anipik thankfully eventually jumped in).

@iSazonov
Copy link
Contributor

@mklement0 Ping me too if you want.

@mklement0
Copy link
Author

mklement0 commented May 10, 2018

I gave it a shot, but hit roadblocks (see below - some may require renewed discussion and sign-off).

Given my inexperience, I suggest someone more experienced take this on - @iSazonov, please feel free to take over.


Implementing this API requires CoreCLR (CoreLib) changes too, and I don't know how to coordinate that (even just adding a new virtual System.Text.EncodingProvider.GetEncodings() method breaks CoreFX-only builds):

  • Encoding.cs, whose Encoding.GetEncodings() method needs modifying, is part of CoreLib.

  • Encoding.GetEncodings() would need to iterate over all registered encoding providers in order to ask each for the encodings it provides, but EncodingProvider currently exposes no method for enumerating them.

On the CoreFX side:

@tarekgh
Copy link
Member

tarekgh commented May 10, 2018

@mklement0 thanks for trying. we'll try to get into this at some point.

@tarekgh
Copy link
Member

tarekgh commented May 10, 2018

t's unclear how to instantiate EncodingInfo: There's only an internal constructor

I think you have raised a good point too. we may need to revisit discussing EncodingInfo class to enable constructing it outside coreclr and provide the needed functionality. we'll try to think more about this one.

    public sealed partial class EncodingInfo
    {
        internal EncodingInfo() { }
        public int CodePage { get { throw null; } }
        public string DisplayName { get { throw null; } }
        public string Name { get { throw null; } }
        public override bool Equals(object value) { throw null; }
        public System.Text.Encoding GetEncoding() { throw null; }
        public override int GetHashCode() { throw null; }
    }

@iSazonov
Copy link
Contributor

@mklement0 I haven't an expirience to contribute in CoreCLR/CoreFX 😕

breaks CoreFX-only builds

I believe it will be two PRs - in CoreCLR repo and in CoreFX repo. After the first one will be merged it will be automatically replicated in Core FX repo. After that the second PR in CoreFX can be continue.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@iSazonov
Copy link
Contributor

@tarekgh Have you any news about EncodingInfo?

@tarekgh
Copy link
Member

tarekgh commented Feb 16, 2020

Have you any news about EncodingInfo?

No update yet because it is kind of low priority for now. But I suggest adding the following public constructor to EncodingInfo

 public EncodingInfo(int codePage, string name, string displayName) { }

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@tarekgh tarekgh added api-ready-for-review and removed untriaged New issue has not been triaged by the area owner api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Mar 9, 2020
@tarekgh tarekgh added the blocking Marks issues that we want to fast track in order to unblock other important work label May 19, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review blocking Marks issues that we want to fast track in order to unblock other important work labels May 26, 2020
@terrajobst
Copy link
Contributor

terrajobst commented May 26, 2020

Video

  • EncodingProvider.GetEncodings() should be IEnumerable<EncodingInfo>
  • EncodingProvider.GetEncodings() should be virtual and return an Enumerable.Empty<EncodingInfo>() (instead of throwing). While this isn't correct it means that one provider that isn't updated doesn't spoil the enumeration for everyone else.
  • Encoding.GetEncodings() should return all registered encodings across all providers and de-dupe them if necessary.
  • EncodingInfo.GetEncoding() calls the static Encoding.GetEncoding() method, which means getting encoding infos and using EncodingInfo.GetEncoding() might create an encoding that isn't tied to that encoding provider.
    • EncodingInfo should take an EncodingProvider that we'll use in EncodingInfo.GetEncoding() and call EncodingProvider.GetEncoding(codePage).
namespace System.Text
{
    public partial class EncodingProvider
    {
        public virtual IEnumerable<EncodingInfo> GetEncodings();
    }
    
    public partial class EncodingInfo
    {
        public EncodingInfo(EncodingProvider provider, int codePage, string name, string displayName);
    }
}

@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants