Fix Windows code to work with non-native code pages #57

Adriatic1 · 2016-01-02T07:20:26Z

Fixes the Windows OS issue with some code failing to list some files where name is given in non-native code page.

Bulat-Ziganshin · 2016-10-16T22:09:43Z

Filenames may have up to 32K chars (and eben more bytes) if they use UNC names like "?\UNC-SHARE..."

hishamhm · 2016-10-19T05:53:08Z

If I follow correctly, this makes the dir() function always return UTF-8 strings. If that is the case, this is a breaking change — what happens when returning files with the native codepage and it happens to not be UTF-8? Shouldn't we be doing this converting based on the current locale?

moteus · 2016-10-19T06:00:14Z

Windows uses utf16 to unicode API. And if you use unicode then you can use path names longer than MAX_PATH
This is related link
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath

sonoro1234 · 2017-02-13T11:42:50Z

Cant we have a function argument on all lfs functions for using old behaviour (native codepage paths) or new one (utf-8 paths)?
Perhaps less verbose would be a library runtime flag for swtching behaviour.

sonoro1234 · 2017-02-17T10:25:52Z

@hishamhm "ñ" character will be returned in utf-8 as a two byte string and in current implementation as one byte only, but it is not a breaking change because it is "always" returning the same two bytes for "ñ".

If the path with "ñ" comes from other source and is in unicode instead of utf8, we could provide a function for converting it to utf8

Also: in my code editor (Notepad++) when I write "bañera" as a dir name the "ñ" character is encoded in utf8 instead of an ASCii extension.

sonoro1234 · 2017-02-17T14:02:03Z

@Adriatic1 wchar_t name[260]; is used for file name without prepending path but full path length is 32760
according to
https://msdn.microsoft.com/en-us/library/windows/desktop/ee681827(v=vs.85).aspx

Adriatic1 · 2017-02-17T15:41:01Z

The original code has MAX_PATH buffer limit, so I guess I adapted to it (see typedef struct dir_data definition). Feel free to build upon the patch and add modifications for longer paths, I am low on free times nowadays...

sonoro1234 · 2017-02-18T14:04:28Z

I would be pleased to apply unicode to utf8 changes but only if @hishamhm agrees with that.

sonoro1234 · 2017-02-22T16:43:16Z

https://github.com/sonoro1234/luafilesystem/tree/unicode
is the ffi lfs version I am using for this now

hishamhm · 2017-02-22T18:40:36Z

I would be pleased to apply unicode to utf8 changes but only if @hishamhm agrees with that.

That would be a great addition, but I think the right thing to do would be to apply these changes only if the active encoding (e.g. detected via setlocale?) is using UTF-8. This way we could support UTF-8 and keep compatibility for those using legacy codepages.

sonoro1234 · 2017-02-22T19:32:24Z

Using unicode you have the benefit of not depending on code page (Microsoft recomends using unicode for getting access to characters in whatever code page)
And then, having translation between unicode and utf8, we can use utf8 as a convenient unicode replacement in the Lua side.

hishamhm · 2017-02-22T19:48:32Z

I understand that UTF-8 in the Lua side + Unicode functions in the Windows API is better.

My concern is with backward compatibility. If I understand the Windows side of things in LuaFileSystem correctly, the current behavior is that, if a user is using Windows-1251 (aka Latin1, ISO8859-1) in the system locale, and they give to lfs a string encoded using it (e.g. "olé" as 6F 6C E9), they get the expected file.

Could we add a flag to switch lfs to UTF-8 mode in LFS 1.x, and then make it the default in LuaFileSystem 2.0?

sonoro1234 · 2017-02-23T09:19:24Z

this is a test on windows7 with "olé" and other file with caracter outside code page:

first cmd dir

c:\luaGL\frames_anima\luajit_test\dir>dir
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 182F-9872

 Directorio de c:\luaGL\frames_anima\luajit_test\dir

23/02/2017  09:30    <DIR>          .
23/02/2017  09:30    <DIR>          ..
23/02/2017  09:06                 0 asc.txt
12/02/2017  17:15                 0 letra_?_letra.txt
23/02/2017  08:54                 0 olé.txt
               3 archivos              0 bytes
               2 dirs  69.161.799.680 bytes libres

with this script

local sep = package.config:sub(1,1)
local dir = [[c:\luaGL\frames_anima\luajit_test\dir]]
lfs = require"lfs"
for file in lfs.dir(dir) do
    if file~="." and file~=".." then
        print()
        print(file)
        print(file:byte(1,#file))
        print(lfs.attributes(dir..sep..file))
        local handle,err = io.open(dir..sep..file,"r")
        print(handle,err)
        if handle then handle:close() end
    end
end

script output

asc.txt
97      115     99      46      116     120     116
table: 0x00321200
file (0x76072960)       nil

letra_?_letra.txt
108     101     116     114     97      95      63      95      108     101     116     114     97      46
116     120     116
nil     cannot obtain information from file `c:\luaGL\frames_anima\luajit_test\dir\letra_?_letra.txt'
nil     c:\luaGL\frames_anima\luajit_test\dir\letra_?_letra.txt: Invalid argument

olÚ.txt
111     108     233     46      116     120     116
table: 0x00329428
file (0x76072960)       nil

changing lfs for lfs_ffi (unicode)

asc.txt
97      115     99      46      116     120     116
table: 0x002b7490
file (0x76072960)       nil

letra_Ãó_letra.txt
108     101     116     114     97      95      199     162     95      108     101     116     114     97
46      116     120     116
table: 0x002bcd18
nil     c:\luaGL\frames_anima\luajit_test\dir\letra_Ãó_letra.txt: No such file or directory

ol├®.txt
111     108     195     169     46      116     120     116
table: 0x002bcca0
nil     c:\luaGL\frames_anima\luajit_test\dir\ol├®.txt: No such file or directory

1-lfs
lfs fails with letra_Ãó_letra.txt, lfs_ffi does not fail in any file.
so unicode seems better.

2-io.open is trickier:
io.open expects ascii extended with code page so:

io.open with lfs letra_?_letra.txt fails
io.open with lfs_ffi fails on letra_Ãó_letra.txt and ol├®.txt
with a function for utf8 to ascii extended olé.txt would succed.

a replacement of io.open unicode would succed with lfs_ffi but would fail with lfs on letra_?_letra.txt and with olé.txt (but olé.txt could be made to work with conversion function)

hishamhm · 2017-02-23T14:07:57Z

@sonoro1234 Thank you for the tests.

This confirm that simply adopting the UTF-8 conversion approach of lfs_ffi is not possible for lfs 1.x, because it breaks compatibility with io.open(). And even in lfs 2.x, I think most users will complain if the results of lfs.dir() are not directly usable with io.open() without a conversion step.

I read some more about the ugly world of Windows codepages, and my original plan wouldn't work because Windows does not support UTF-8 codepages, so we can't detect "UTF-8 mode" from setlocale.

I also noted that in the first script output it displays "olÚ.txt" instead of "olé.txt". This means the Lua interpreter is running with the DOS codepage (cp-850) rather than the Windows-1252 codepage (that dir used above). (This table helped me confirm this.)

I think (based on this) that if you put os.setlocale(".1252") at the beginning of the script, it would display "olé.txt".

Could you run another test, please? What happens if you run that script with lfs adding os.setlocale(".65001") in the beginning? Thank you!

sonoro1234 · 2017-02-23T15:42:10Z

os.setlocale(".65001") -> nil
so it is not working
os.setlocale(".1252") lets me see olé.txt from lua

I was wondering: Is io.open on linux being able to open any utf8 path?

sonoro1234 · 2017-02-23T17:11:51Z

perhaps provide lfs.open unicode-utf8 aware?

cxw42 · 2017-11-08T20:00:38Z

For folks finding this discussion from search results, start with the initial problem report in #56.

SlySven · 2018-04-28T23:16:43Z

What is the current state of affairs with LFS and Windows?

As I understand it LFS on the Windows platform tends to use Windows 1252 character encoding in a US-centric environment which is a super-set of the the ISO 8859-1/Latin1 in that they both use a single 8-bit number to encode a range of characters that extend upward from the standard ASCII 7-bit (0-127) numbers although other parts of the World may use a different selection of characters for the range of numbers from 128 to 255. UTF-8 uses the same numbers for the Unicode code-points as Windows 1252 but the way that the non-ASCII numbers (above 127) are encoded means that for characters with number from 128 to 255 (in fact all the way to 2047) use two bytes not one per characters.

The normal Windows i/o functions use whatever 8-bit codepage (mapping of character numbers 128-255 to characters) is currently selected but that is very fragile because if any values (above 127) are created by a different system/user for a file or path name the mapping between 8-bit values and characters - what string.byte() and string.code() use - can break either because a character in the original file/path name code page is not included in the end user's system or it is in a different place (in the range 128-255).

The only reliable way to pass non-ASCII file and path names to Windows i/o functions is to use the so-called wide_character versions that require/return UTF-16 arguments and which always use the same "number" (or sequence of numbers) to represent the same "character".

It is not possible to use UTF-16 with the current string library because the latter is predicated around 8-bit values for characters; it is possible to use UTF-8 with the current string library (for post Lua 5.1 versions) and for 5.1 and some prior versions, additional libraries can fill in the gaps in the library at those versions.

Q.E.D. LFS is broken for Windows for non-ASCII characters and fixing it means that previous code that sort of worked when fed/returning file/path names that used non-ASCII characters that were present in the selected code-page on WIndows because the system created and using those file/path name had the same encoding, will fail to work when the same data is read on a new version that uses UTF-8 internally to match how other OSes do it. OTOH That file/path name data is more likely to be readable by those other systems that are already using UTF-8 internally.

Bulat-Ziganshin · 2018-04-28T23:30:37Z

I think it's rather simple and such solution is already used ib a lot of software including mine: library works with UTF-8. When it calls WinAPI, UTF-8 strings are recoded into UTF-16. When it calls Unix API, strings are used as is.

hishamhm · 2018-04-29T00:52:43Z

@SlySven as far as I can tell this is also how all file I/O in the Lua standard library works.

Should we go the direction suggested by @sonoro1234 and provide UTF-8-enabled replacements like lfs.open and lfs.remove?

I wouldn't be opposed to that in a LFS 2.0, but we'd need someone to contribute that code as I don't do Windows development myself.

Would we also need a lfs.toutf8 function for converting local-codepage strings into UTF-8 strings so that code that currently works using the local code page could have an easy path fpr converting? (Otherwise they would need some sort of WinAPI bindings defeating the purpose of LFS).

LFS is designed to be a portable interface. Should we ensure UTF-8 interfaces on non-Windows platforms as well? Most Linux systems currently run on UTF-8 but not all. Should a toutf8 function be available there as well, binding libiconv? That wouldn't be hard to do.

SlySven · 2018-04-29T06:01:18Z

I'm particularly tied to Lua 5.1 because the project I code for cannot afford to move to a later minor version without breaking things for our current users who will have scripts and fragments of lua code that they may not personally be able to revise and we have given an undertaking not to break our own API in a non-backwards compatible way. That being the case and because we are aiming to achieve I18n for our next major version we have only recently started to include a recommendation of a UTF8 string module - although I understand that the equivalent is already in 5.3 if not 5.2 .

I am not sure what processes are needed to be available to transcode between UTF-8 and UTF-16 but it seems that the file i/o library will have to do this as close to the Windows calls as possible... it looks like luawinfile does this.

Erutuon · 2018-11-01T00:47:01Z

I recently discovered luawinfile. It's similar to lfs, but converts strings from UTF-8 to UTF-16 and uses the wide-string versions of the various file-related functions. It seems to work; I was able to open a file whose name contained non-ASCII characters. Perhaps it can be used as a guide for making luafilesystem Unicode-friendly on Windows.

SlySven · 2018-11-02T17:04:45Z

I found that luawinfile uses Lua 5.3 functions and I gave not been able to backport it to 5.1 - I'm just hoping someone else has the skills (and inclination) to do it... 🙏

Erutuon · 2018-11-02T20:10:32Z

I created a (very messy) fork that compiles under Lua 5.1, but doesn't quite work. 😬

sonoro1234 · 2018-11-02T20:23:14Z

https://github.com/sonoro1234/luafilesystem/blob/unicode/lfs_ffi.lua#L184
are the experiments (succesfull for me) that I did on this LuaJIT ffi binding from spacewander.
When module.unicode is true, all functions (which emulate lfs) do the necessary conversions between unicode and utf-8.

Fix Windows code to work with non-native code pages

b99cf78

sonoro1234 mentioned this pull request Dec 6, 2019

proposal for windows unicode #132

Open

hishamhm force-pushed the master branch 2 times, most recently from 9cd10fe to f77f248 Compare June 24, 2022 15:10

cuavas mentioned this pull request Oct 27, 2024

Does not compile in Windows in Unicode (WCHAR) mode #170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Windows code to work with non-native code pages #57

Fix Windows code to work with non-native code pages #57

Adriatic1 commented Jan 2, 2016

Bulat-Ziganshin commented Oct 16, 2016

hishamhm commented Oct 19, 2016

moteus commented Oct 19, 2016 •

edited

Loading

sonoro1234 commented Feb 13, 2017

sonoro1234 commented Feb 17, 2017 •

edited

Loading

sonoro1234 commented Feb 17, 2017

Adriatic1 commented Feb 17, 2017

sonoro1234 commented Feb 18, 2017

sonoro1234 commented Feb 22, 2017

hishamhm commented Feb 22, 2017

sonoro1234 commented Feb 22, 2017

hishamhm commented Feb 22, 2017

sonoro1234 commented Feb 23, 2017

hishamhm commented Feb 23, 2017

sonoro1234 commented Feb 23, 2017

sonoro1234 commented Feb 23, 2017

cxw42 commented Nov 8, 2017

SlySven commented Apr 28, 2018

Bulat-Ziganshin commented Apr 28, 2018

hishamhm commented Apr 29, 2018

SlySven commented Apr 29, 2018

Erutuon commented Nov 1, 2018 •

edited

Loading

SlySven commented Nov 2, 2018

Erutuon commented Nov 2, 2018 •

edited

Loading

sonoro1234 commented Nov 2, 2018

Fix Windows code to work with non-native code pages #57

Are you sure you want to change the base?

Fix Windows code to work with non-native code pages #57

Conversation

Adriatic1 commented Jan 2, 2016

Bulat-Ziganshin commented Oct 16, 2016

hishamhm commented Oct 19, 2016

moteus commented Oct 19, 2016 • edited Loading

sonoro1234 commented Feb 13, 2017

sonoro1234 commented Feb 17, 2017 • edited Loading

sonoro1234 commented Feb 17, 2017

Adriatic1 commented Feb 17, 2017

sonoro1234 commented Feb 18, 2017

sonoro1234 commented Feb 22, 2017

hishamhm commented Feb 22, 2017

sonoro1234 commented Feb 22, 2017

hishamhm commented Feb 22, 2017

sonoro1234 commented Feb 23, 2017

hishamhm commented Feb 23, 2017

sonoro1234 commented Feb 23, 2017

sonoro1234 commented Feb 23, 2017

cxw42 commented Nov 8, 2017

SlySven commented Apr 28, 2018

Bulat-Ziganshin commented Apr 28, 2018

hishamhm commented Apr 29, 2018

SlySven commented Apr 29, 2018

Erutuon commented Nov 1, 2018 • edited Loading

SlySven commented Nov 2, 2018

Erutuon commented Nov 2, 2018 • edited Loading

sonoro1234 commented Nov 2, 2018

moteus commented Oct 19, 2016 •

edited

Loading

sonoro1234 commented Feb 17, 2017 •

edited

Loading

Erutuon commented Nov 1, 2018 •

edited

Loading

Erutuon commented Nov 2, 2018 •

edited

Loading