EncodingMixToUtf8 is file encoding detection and trasformation to UTF-8 command line tool.
Visual Studio 2015 RTM "introduces" compiler bug in processing source files encoded with non-UTF-8 encoding. So this is a tool to convert source files to UTF-8 as a quick workaround. I've used it for our repositories. Hope it will be useful for someone else.
- Jared Parsons (jaredpar) suggestion:
You can explicitly specify the code page in the project settings using the CodePage element in the csproj file:
<CodePage>1251</CodePage>
- Kevin Pilch-Bisson (Pilchie) suggestion:
You can also run msbuild with the
/p:CodePage=1251
command line option when building on the command line, or set the environment variableCodePage
to1251
for VS to pick it up as well.
- KoHHeKT suggestion for manually convert:
Tools - Options - Environment - Documents - Save Document as Unicode if cannot be saved in Codepage
- You can also use any text editor with support to detect encoding and save as UTF-8. E.g. Notepad++
The tool technically is pretty simple and straightforward except for encoding auto-detect.
For encoding detection I've used package SimpleHelpers.FileEncoding by Khalid Salomão for convenience. It wraps C# port of encoding detection algorithm based on Mozilla Universal Charset Detector
I wish to thanks all the authors for their hard work. It greatly simplifies the task.
-p, --path Required. Root directory path to start scanning
from
-m, --search-extension (Default: .cs;.vb;.settings;.resx) Search file
extensions
-c, --codepage Restrictions on converting from selected codepage
only. Highly recommended to convert only selected
codepages instead of everything due to chances of
false encoding detection.
-o, --override Override encoding detection to correct detection
errors for similar encodings (e.g. windows-1251 vs
x-mac-cyrillic). Format:
detected_code_page=mapped_code_page; e.g.
10007=1251; to map x-mac-cyrillic on windows-1251
-t, --transform Do text transformation (by default is check only
mode)
-b, --backup Backup path (modified files go there before
processing)
-l, --log (Default: log.txt) Logging file
--help Display this help screen.
-
I strongly recommend to run tool in a first time without
-t, --transform
and without-c, --codepage
(to get encodings for all source files) and check result log file (by default islog.txt
) about how detection completed. Checking how detection done before doing transformation helps you to avoid potentially bad conversion in cases where auto-detection can go wrong. You can pre convert such files manually or specify-o, --override
option. -
I also highly recommend using only specified codepage conversion via
-c, --codepage
option. You can repeat conversion for other encodings one by one and assure the results. At least carefully check log file with encodings distribution (in scan only mode) before doing any transformations. -
There are can be a false detection of similar encodings, e.g. windows-1251 can be detected as x-mac-cyrillic. You can use
-o, --override
option to specify replacement encoding in such conditions. -
If you do not use any of version control systems specify
Backup path
via-b, --backup
to avoid source text lost during conversion. It preserves the original folder structure. -
Any errors (e.g. IO) during the transformation abort conversion and potentially leave files in a partially modified state.
- Check encodings in
w:\Src\Path
> EncodingMixToUtf8.exe -p w:\Src\Path
- Check for
windows-1251
encoding only inw:\Src\Path
. Recommended to do a full scan first to check false detections.
> EncodingMixToUtf8.exe -p w:\Src\Path -c 1251
- Check for
windows-1251
encoding only with forced overridingx-mac-cyrillic
towindows-1251
inw:\Src\Path
. A log file will contain files withwindows-1251
encoding only.
> EncodingMixToUtf8.exe -p w:\Src\Path -c 1251 -o 10007=1251
- Transform source files in
w:\Src\Path
fromwindows-1251
encoding (with substitutionx-mac-cyrillic
towindows-1251
) and backing up files tow:\Src\Backup
> EncodingMixToUtf8.exe -p w:\Src\Path -t -c 1251 -o 10007=1251 -b w:\Src\Backup