-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Windows code to work with non-native code pages #57
base: master
Are you sure you want to change the base?
Conversation
Filenames may have up to 32K chars (and eben more bytes) if they use UNC names like "?\UNC-SHARE..." |
If I follow correctly, this makes the dir() function always return UTF-8 strings. If that is the case, this is a breaking change — what happens when returning files with the native codepage and it happens to not be UTF-8? Shouldn't we be doing this converting based on the current locale? |
Windows uses utf16 to unicode API. And if you use unicode then you can use path names longer than MAX_PATH |
Cant we have a function argument on all lfs functions for using old behaviour (native codepage paths) or new one (utf-8 paths)? |
@hishamhm "ñ" character will be returned in utf-8 as a two byte string and in current implementation as one byte only, but it is not a breaking change because it is "always" returning the same two bytes for "ñ". If the path with "ñ" comes from other source and is in unicode instead of utf8, we could provide a function for converting it to utf8 Also: in my code editor (Notepad++) when I write "bañera" as a dir name the "ñ" character is encoded in utf8 instead of an ASCii extension. |
@Adriatic1 wchar_t name[260]; is used for file name without prepending path but full path length is 32760 |
The original code has MAX_PATH buffer limit, so I guess I adapted to it (see typedef struct dir_data definition). Feel free to build upon the patch and add modifications for longer paths, I am low on free times nowadays... |
I would be pleased to apply unicode to utf8 changes but only if @hishamhm agrees with that. |
https://github.com/sonoro1234/luafilesystem/tree/unicode |
That would be a great addition, but I think the right thing to do would be to apply these changes only if the active encoding (e.g. detected via setlocale?) is using UTF-8. This way we could support UTF-8 and keep compatibility for those using legacy codepages. |
Using unicode you have the benefit of not depending on code page (Microsoft recomends using unicode for getting access to characters in whatever code page) |
I understand that UTF-8 in the Lua side + Unicode functions in the Windows API is better. My concern is with backward compatibility. If I understand the Windows side of things in LuaFileSystem correctly, the current behavior is that, if a user is using Windows-1251 (aka Latin1, ISO8859-1) in the system locale, and they give to lfs a string encoded using it (e.g. Could we add a flag to switch lfs to UTF-8 mode in LFS 1.x, and then make it the default in LuaFileSystem 2.0? |
this is a test on windows7 with "olé" and other file with caracter outside code page: first cmd dir
with this script
script output
changing lfs for lfs_ffi (unicode)
1-lfs 2-io.open is trickier: io.open with lfs letra_?_letra.txt fails a replacement of io.open unicode would succed with lfs_ffi but would fail with lfs on letra_?_letra.txt and with olé.txt (but olé.txt could be made to work with conversion function) |
@sonoro1234 Thank you for the tests. This confirm that simply adopting the UTF-8 conversion approach of lfs_ffi is not possible for lfs 1.x, because it breaks compatibility with io.open(). And even in lfs 2.x, I think most users will complain if the results of lfs.dir() are not directly usable with io.open() without a conversion step. I read some more about the ugly world of Windows codepages, and my original plan wouldn't work because Windows does not support UTF-8 codepages, so we can't detect "UTF-8 mode" from setlocale. I also noted that in the first script output it displays "olÚ.txt" instead of "olé.txt". This means the Lua interpreter is running with the DOS codepage (cp-850) rather than the Windows-1252 codepage (that I think (based on this) that if you put Could you run another test, please? What happens if you run that script with lfs adding |
os.setlocale(".65001") -> nil I was wondering: Is io.open on linux being able to open any utf8 path? |
perhaps provide lfs.open unicode-utf8 aware? |
For folks finding this discussion from search results, start with the initial problem report in #56. |
What is the current state of affairs with LFS and Windows? As I understand it LFS on the Windows platform tends to use Windows 1252 character encoding in a US-centric environment which is a super-set of the the ISO 8859-1/Latin1 in that they both use a single 8-bit number to encode a range of characters that extend upward from the standard ASCII 7-bit (0-127) numbers although other parts of the World may use a different selection of characters for the range of numbers from 128 to 255. UTF-8 uses the same numbers for the Unicode code-points as Windows 1252 but the way that the non-ASCII numbers (above 127) are encoded means that for characters with number from 128 to 255 (in fact all the way to 2047) use two bytes not one per characters. The normal Windows i/o functions use whatever 8-bit codepage (mapping of character numbers 128-255 to characters) is currently selected but that is very fragile because if any values (above 127) are created by a different system/user for a file or path name the mapping between 8-bit values and characters - what The only reliable way to pass non-ASCII file and path names to Windows i/o functions is to use the so-called wide_character versions that require/return UTF-16 arguments and which always use the same "number" (or sequence of numbers) to represent the same "character". It is not possible to use UTF-16 with the current Q.E.D. LFS is broken for Windows for non-ASCII characters and fixing it means that previous code that sort of worked when fed/returning file/path names that used non-ASCII characters that were present in the selected code-page on WIndows because the system created and using those file/path name had the same encoding, will fail to work when the same data is read on a new version that uses UTF-8 internally to match how other OSes do it. OTOH That file/path name data is more likely to be readable by those other systems that are already using UTF-8 internally. |
I think it's rather simple and such solution is already used ib a lot of software including mine: library works with UTF-8. When it calls WinAPI, UTF-8 strings are recoded into UTF-16. When it calls Unix API, strings are used as is. |
@SlySven as far as I can tell this is also how all file I/O in the Lua standard library works. Should we go the direction suggested by @sonoro1234 and provide UTF-8-enabled replacements like lfs.open and lfs.remove? I wouldn't be opposed to that in a LFS 2.0, but we'd need someone to contribute that code as I don't do Windows development myself. Would we also need a lfs.toutf8 function for converting local-codepage strings into UTF-8 strings so that code that currently works using the local code page could have an easy path fpr converting? (Otherwise they would need some sort of WinAPI bindings defeating the purpose of LFS). LFS is designed to be a portable interface. Should we ensure UTF-8 interfaces on non-Windows platforms as well? Most Linux systems currently run on UTF-8 but not all. Should a toutf8 function be available there as well, binding libiconv? That wouldn't be hard to do. |
I'm particularly tied to Lua 5.1 because the project I code for cannot afford to move to a later minor version without breaking things for our current users who will have scripts and fragments of lua code that they may not personally be able to revise and we have given an undertaking not to break our own API in a non-backwards compatible way. That being the case and because we are aiming to achieve I18n for our next major version we have only recently started to include a recommendation of a UTF8 string module - although I understand that the equivalent is already in 5.3 if not 5.2 . I am not sure what processes are needed to be available to transcode between UTF-8 and UTF-16 but it seems that the file i/o library will have to do this as close to the Windows calls as possible... it looks like luawinfile does this. |
I recently discovered luawinfile. It's similar to lfs, but converts strings from UTF-8 to UTF-16 and uses the wide-string versions of the various file-related functions. It seems to work; I was able to open a file whose name contained non-ASCII characters. Perhaps it can be used as a guide for making luafilesystem Unicode-friendly on Windows. |
I found that luawinfile uses Lua 5.3 functions and I gave not been able to backport it to 5.1 - I'm just hoping someone else has the skills (and inclination) to do it... 🙏 |
I created a (very messy) fork that compiles under Lua 5.1, but doesn't quite work. 😬 |
https://github.com/sonoro1234/luafilesystem/blob/unicode/lfs_ffi.lua#L184 |
9cd10fe
to
f77f248
Compare
Fixes the Windows OS issue with some code failing to list some files where name is given in non-native code page.