Skip to content

nsentinel/EncodingMixToUtf8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EncodingMixToUtf8 is file encoding detection and trasformation to UTF-8 command line tool.

Content

Why?

Visual Studio 2015 RTM "introduces" compiler bug in processing source files encoded with non-UTF-8 encoding. So this is a tool to convert source files to UTF-8 as a quick workaround. I've used it for our repositories. Hope it will be useful for someone else.

Known workarounds

  • Jared Parsons (jaredpar) suggestion:

You can explicitly specify the code page in the project settings using the CodePage element in the csproj file:

<CodePage>1251</CodePage>
  • Kevin Pilch-Bisson (Pilchie) suggestion:

You can also run msbuild with the /p:CodePage=1251 command line option when building on the command line, or set the environment variable CodePage to 1251 for VS to pick it up as well.

  • KoHHeKT suggestion for manually convert:

Tools - Options - Environment - Documents - Save Document as Unicode if cannot be saved in Codepage

  • You can also use any text editor with support to detect encoding and save as UTF-8. E.g. Notepad++

How it works

The tool technically is pretty simple and straightforward except for encoding auto-detect.

For encoding detection I've used package SimpleHelpers.FileEncoding by Khalid Salomão for convenience. It wraps C# port of encoding detection algorithm based on Mozilla Universal Charset Detector

I wish to thanks all the authors for their hard work. It greatly simplifies the task.

Command Line Options

  -p, --path                Required. Root directory path to start scanning
                            from

  -m, --search-extension    (Default: .cs;.vb;.settings;.resx) Search file
                            extensions

  -c, --codepage            Restrictions on converting from selected codepage
                            only. Highly recommended to convert only selected
                            codepages instead of everything due to chances of
                            false encoding detection.

  -o, --override            Override encoding detection to correct detection
                            errors for similar encodings (e.g. windows-1251 vs
                            x-mac-cyrillic). Format:
                            detected_code_page=mapped_code_page; e.g.
                            10007=1251; to map x-mac-cyrillic on windows-1251

  -t, --transform           Do text transformation (by default is check only
                            mode)

  -b, --backup              Backup path (modified files go there before
                            processing)

  -l, --log                 (Default: log.txt) Logging file

  --help                    Display this help screen.

Known issues

  • I strongly recommend to run tool in a first time without -t, --transform and without -c, --codepage (to get encodings for all source files) and check result log file (by default is log.txt) about how detection completed. Checking how detection done before doing transformation helps you to avoid potentially bad conversion in cases where auto-detection can go wrong. You can pre convert such files manually or specify -o, --override option.

  • I also highly recommend using only specified codepage conversion via -c, --codepage option. You can repeat conversion for other encodings one by one and assure the results. At least carefully check log file with encodings distribution (in scan only mode) before doing any transformations.

  • There are can be a false detection of similar encodings, e.g. windows-1251 can be detected as x-mac-cyrillic. You can use -o, --override option to specify replacement encoding in such conditions.

  • If you do not use any of version control systems specify Backup path via -b, --backup to avoid source text lost during conversion. It preserves the original folder structure.

  • Any errors (e.g. IO) during the transformation abort conversion and potentially leave files in a partially modified state.

Usage examples

  • Check encodings in w:\Src\Path
> EncodingMixToUtf8.exe -p w:\Src\Path
  • Check for windows-1251 encoding only in w:\Src\Path. Recommended to do a full scan first to check false detections.
> EncodingMixToUtf8.exe -p w:\Src\Path -c 1251
  • Check for windows-1251 encoding only with forced overriding x-mac-cyrillic to windows-1251 in w:\Src\Path. A log file will contain files with windows-1251 encoding only.
> EncodingMixToUtf8.exe -p w:\Src\Path -c 1251 -o 10007=1251
  • Transform source files in w:\Src\Path from windows-1251 encoding (with substitution x-mac-cyrillic to windows-1251) and backing up files to w:\Src\Backup
> EncodingMixToUtf8.exe -p w:\Src\Path -t -c 1251 -o 10007=1251 -b w:\Src\Backup

About

File encoding detection and trasformation tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages