Add support of UTF-8 for Windows #728

SirLynix · 2017-04-02T15:03:53Z

Hi there,

We're using Premake at my job and we had some troubles when trying to copy files from/to locations containing unicode characters, the reason for this is that Premake uses the ASCII version of the Windows API.

I made a fix by adding two helper functions (utf8_fromwide and utf8_towide) and using them with the Unicode Windows API everywhere I could (I just didn't do it for the os.getpass function, making the Windows console working with unicode is not an easy thing to do).

Anyway, tell me if I need to fix some stuff, as I'm not sure I did everything in the "right" (premake-team) way 😄

…de_file_operations

SirLynix · 2017-04-02T19:27:54Z

src/host/os_isdir.c

@@ -42,6 +44,10 @@ int os_isdir(lua_State* L)
 	{
 		lua_pushboolean(L, 0);
 	}
+
+#ifdef _WIN32
+	lua_pop(L, -2); /* pop wide string */


Oops, should be lua_remove, can't fix it right now

tvandijck · 2017-04-03T16:10:55Z

src/host/utf8handling.c

+	lua_pushstring(L, unicode_str);
+	free(unicode_str);
+
+	return unicode_str;


you just free'd that pointer.... should really not return that, as that will be a "use after free" access violation.

Indeed, in my original design I wasn't returning anything. Will change that.

tvandijck · 2017-04-03T16:12:36Z

src/host/utf8handling.c

+#include "stdlib.h"
+
+#ifdef PLATFORM_WINDOWS
+const char* utf8_fromwide(lua_State* L, const wchar_t* wstr)


I would prefer a size_t utf8_fromwide(char* destination, size_t destSize, const wchar_t* wstr);

which would not need to allocate memory, would allow for the call to WideCharToMultiByte to simply be called once. And wouldn't need the LuaState... pushing the string to the state would be up to the caller.

I though using Lua GC to manage memory was a good idea, but I will make the changes.

Mmh, actually a wrapper function is useless if all it does is calling WideCharToMultiByte, since we can do that in places where we are calling this function (since we're already checking for the Windows part).

What if I make something like that:

const char* utf8_fromwide(const wchar_t* wstr) { int size_required = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, NULL, 0, NULL, NULL); char* unicode_str = (char*) malloc(size_required * sizeof(char)); WideCharToMultiByte(CP_UTF8, 0, wstr, -1, unicode_str, size_required, NULL, NULL); return unicode_str; }

be a better solution? Of course the caller has to free the buffer then (which I dislike).

I agree that that is not preferable... In general I was just hoping that a malloc for this isn't something we need... mostly because these match functions are used ALOT, and so have quite a bit of performance impact... on our project we're already hitting 2GB of memory use.

if that removes the need for this method entirely, then I'm OK with just calling it in place to be entirely honest, and simply use stack for this...

tvandijck · 2017-04-03T16:16:00Z

Other then the use after free, I have no real objections... I think API wise relying less on lua and more on just C would be nice, but it's no breaking point for me.. we can always refactor that further if we wish for that..

SirLynix · 2017-04-03T18:28:45Z

I'll try to think of something that will not impact performances much then, especially there. I'll use the stack wherever I can and remove those two functions then, I hope I can make it in a not-so-dirty way, despite the Win32 calls. Le 3 avr. 2017 8:24 PM, "Tom van Dijck" <notifications@github.com> a écrit :

…

***@***.**** commented on this pull request. ------------------------------ In src/host/utf8handling.c <#728 (comment)>: > @@ -0,0 +1,33 @@ +/** + * \file utf8handking.c + * \brief Handles conversions between Unicode (UTF-8) and system native encoding (wide chars on Windows) + * \author Copyright (c) 2017 Jérôme Leclercq and the Premake project + */ + +#include "premake.h" +#include "stdlib.h" + +#ifdef PLATFORM_WINDOWS +const char* utf8_fromwide(lua_State* L, const wchar_t* wstr) I agree that that is not preferable... In general I was just hoping that a malloc for this isn't something we need... mostly because these match functions are used ALOT, and so have quite a bit of performance impact... on our project we're already hitting 2GB of memory use. if that removes the need for this method entirely, then I'm OK with just calling it in place to be entirely honest, and simply use stack for this... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#728 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC3QXXZ5ZGwbTQqZrj32mBKwtbFqfRQAks5rsTldgaJpZM4Mw07h> .

fun4jimmy · 2017-04-03T18:33:22Z

Some of this stuff was discussed as part of #629. I never got round to doing anything that was proposed but I think this might fix that issue as well.

SirLynix · 2017-04-04T07:08:07Z

I remade this using only stack memory (and I'm wondering why I didn't do that in the first place..).
In case any conversion fails, it will trigger a Lua error, as discussed in #629.

I'm wondering, should I apply the same changes to the integrated Lua library to support unicode operations? (Talking about io library)

tvandijck · 2017-04-04T15:57:36Z

Thank you very much, that looks pretty clean to me... ready to merge as far as I'm concerned.

I'm wondering, should I apply the same changes to the integrated Lua library to support unicode operations? (Talking about io library)

We could, but lets make that a separate pull request.. I'm not too fond of actually changing lua itself to be honest, since that makes moving between versions of Lua really cumbersome.. So I'd rather see extensions to the io library to support this... like an io.open_utf8(...) or something of that kind, I don't know how that would look really.. so I don't really have a very well thought out opinion on this, but it should be fairly straightforward to 'override' the standard io library with methods that support utf8 instead.

I would imagine that really all you need to override is the io.open method? all the other methods work on stream handles (FILE*, etc), so those should not be affected.

SirLynix · 2017-04-04T16:53:46Z

No problem with making another pull request.
I can override the files functions of io without touching the Lua library, but I think I'll have to override more than one function (since Windows Unicode stuff doesn't work with FILE* handles AFAIK, I'll check that).

I don't think making a separate Unicode open function will work in this case because of the io metatable (for calls like handle:read()), so I'll probably override, except if I find a way to make Unicode works with FILE* handles.

SirLynix added 4 commits April 2, 2017 15:39

Merge remote-tracking branch 'refs/remotes/premake/master'

00aa7e3

Add support of unicode to all file operations

98c3814

Merge remote-tracking branch 'refs/remotes/premake/master' into unico…

443536d

…de_file_operations

Fix problematic assert

0cc98c5

SirLynix commented Apr 2, 2017

View reviewed changes

Fix Lua stack corruption with os.isdir

d83689d

tvandijck suggested changes Apr 3, 2017

View reviewed changes

Rework unicode encoding/decoding (Windows)

9c4c998

tvandijck approved these changes Apr 4, 2017

View reviewed changes

tvandijck merged commit 21184d4 into premake:master Apr 4, 2017

tvandijck mentioned this pull request Apr 11, 2017

Do not support chinese path of source file #629

Closed

SirLynix mentioned this pull request May 14, 2017

Unicode file functions for Windows #785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of UTF-8 for Windows #728

Add support of UTF-8 for Windows #728

SirLynix commented Apr 2, 2017

SirLynix Apr 2, 2017

tvandijck Apr 3, 2017

SirLynix Apr 3, 2017

tvandijck Apr 3, 2017

SirLynix Apr 3, 2017

SirLynix Apr 3, 2017

tvandijck Apr 3, 2017

tvandijck commented Apr 3, 2017

SirLynix commented Apr 3, 2017 via email

fun4jimmy commented Apr 3, 2017

SirLynix commented Apr 4, 2017

tvandijck commented Apr 4, 2017

SirLynix commented Apr 4, 2017

Add support of UTF-8 for Windows #728

Add support of UTF-8 for Windows #728

Conversation

SirLynix commented Apr 2, 2017

SirLynix Apr 2, 2017

Choose a reason for hiding this comment

tvandijck Apr 3, 2017

Choose a reason for hiding this comment

SirLynix Apr 3, 2017

Choose a reason for hiding this comment

tvandijck Apr 3, 2017

Choose a reason for hiding this comment

SirLynix Apr 3, 2017

Choose a reason for hiding this comment

SirLynix Apr 3, 2017

Choose a reason for hiding this comment

tvandijck Apr 3, 2017

Choose a reason for hiding this comment

tvandijck commented Apr 3, 2017

SirLynix commented Apr 3, 2017 via email

fun4jimmy commented Apr 3, 2017

SirLynix commented Apr 4, 2017

tvandijck commented Apr 4, 2017

SirLynix commented Apr 4, 2017