Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Windows code to work with non-native code pages #57

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Adriatic1
Copy link

Fixes the Windows OS issue with some code failing to list some files where name is given in non-native code page.

@Bulat-Ziganshin
Copy link

Filenames may have up to 32K chars (and eben more bytes) if they use UNC names like "?\UNC-SHARE..."

@hishamhm
Copy link
Member

If I follow correctly, this makes the dir() function always return UTF-8 strings. If that is the case, this is a breaking change — what happens when returning files with the native codepage and it happens to not be UTF-8? Shouldn't we be doing this converting based on the current locale?

@moteus
Copy link
Contributor

moteus commented Oct 19, 2016

Windows uses utf16 to unicode API. And if you use unicode then you can use path names longer than MAX_PATH
This is related link
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath

@sonoro1234
Copy link

Cant we have a function argument on all lfs functions for using old behaviour (native codepage paths) or new one (utf-8 paths)?
Perhaps less verbose would be a library runtime flag for swtching behaviour.

@sonoro1234
Copy link

sonoro1234 commented Feb 17, 2017

@hishamhm "ñ" character will be returned in utf-8 as a two byte string and in current implementation as one byte only, but it is not a breaking change because it is "always" returning the same two bytes for "ñ".

If the path with "ñ" comes from other source and is in unicode instead of utf8, we could provide a function for converting it to utf8

Also: in my code editor (Notepad++) when I write "bañera" as a dir name the "ñ" character is encoded in utf8 instead of an ASCii extension.

@sonoro1234
Copy link

@Adriatic1 wchar_t name[260]; is used for file name without prepending path but full path length is 32760
according to
https://msdn.microsoft.com/en-us/library/windows/desktop/ee681827(v=vs.85).aspx

@Adriatic1
Copy link
Author

The original code has MAX_PATH buffer limit, so I guess I adapted to it (see typedef struct dir_data definition). Feel free to build upon the patch and add modifications for longer paths, I am low on free times nowadays...

@sonoro1234
Copy link

I would be pleased to apply unicode to utf8 changes but only if @hishamhm agrees with that.

@sonoro1234
Copy link

https://github.com/sonoro1234/luafilesystem/tree/unicode
is the ffi lfs version I am using for this now

@hishamhm
Copy link
Member

I would be pleased to apply unicode to utf8 changes but only if @hishamhm agrees with that.

That would be a great addition, but I think the right thing to do would be to apply these changes only if the active encoding (e.g. detected via setlocale?) is using UTF-8. This way we could support UTF-8 and keep compatibility for those using legacy codepages.

@sonoro1234
Copy link

Using unicode you have the benefit of not depending on code page (Microsoft recomends using unicode for getting access to characters in whatever code page)
And then, having translation between unicode and utf8, we can use utf8 as a convenient unicode replacement in the Lua side.

@hishamhm
Copy link
Member

I understand that UTF-8 in the Lua side + Unicode functions in the Windows API is better.

My concern is with backward compatibility. If I understand the Windows side of things in LuaFileSystem correctly, the current behavior is that, if a user is using Windows-1251 (aka Latin1, ISO8859-1) in the system locale, and they give to lfs a string encoded using it (e.g. "olé" as 6F 6C E9), they get the expected file.

Could we add a flag to switch lfs to UTF-8 mode in LFS 1.x, and then make it the default in LuaFileSystem 2.0?

@sonoro1234
Copy link

this is a test on windows7 with "olé" and other file with caracter outside code page:

first cmd dir

c:\luaGL\frames_anima\luajit_test\dir>dir
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 182F-9872

 Directorio de c:\luaGL\frames_anima\luajit_test\dir

23/02/2017  09:30    <DIR>          .
23/02/2017  09:30    <DIR>          ..
23/02/2017  09:06                 0 asc.txt
12/02/2017  17:15                 0 letra_?_letra.txt
23/02/2017  08:54                 0 olé.txt
               3 archivos              0 bytes
               2 dirs  69.161.799.680 bytes libres

with this script

local sep = package.config:sub(1,1)
local dir = [[c:\luaGL\frames_anima\luajit_test\dir]]
lfs = require"lfs"
for file in lfs.dir(dir) do
    if file~="." and file~=".." then
        print()
        print(file)
        print(file:byte(1,#file))
        print(lfs.attributes(dir..sep..file))
        local handle,err = io.open(dir..sep..file,"r")
        print(handle,err)
        if handle then handle:close() end
    end
end

script output

asc.txt
97      115     99      46      116     120     116
table: 0x00321200
file (0x76072960)       nil

letra_?_letra.txt
108     101     116     114     97      95      63      95      108     101     116     114     97      46
116     120     116
nil     cannot obtain information from file `c:\luaGL\frames_anima\luajit_test\dir\letra_?_letra.txt'
nil     c:\luaGL\frames_anima\luajit_test\dir\letra_?_letra.txt: Invalid argument

olÚ.txt
111     108     233     46      116     120     116
table: 0x00329428
file (0x76072960)       nil

changing lfs for lfs_ffi (unicode)

asc.txt
97      115     99      46      116     120     116
table: 0x002b7490
file (0x76072960)       nil

letra_Ãó_letra.txt
108     101     116     114     97      95      199     162     95      108     101     116     114     97
46      116     120     116
table: 0x002bcd18
nil     c:\luaGL\frames_anima\luajit_test\dir\letra_Ãó_letra.txt: No such file or directory

ol├®.txt
111     108     195     169     46      116     120     116
table: 0x002bcca0
nil     c:\luaGL\frames_anima\luajit_test\dir\ol├®.txt: No such file or directory

1-lfs
lfs fails with letra_Ãó_letra.txt, lfs_ffi does not fail in any file.
so unicode seems better.

2-io.open is trickier:
io.open expects ascii extended with code page so:

io.open with lfs letra_?_letra.txt fails
io.open with lfs_ffi fails on letra_Ãó_letra.txt and ol├®.txt
with a function for utf8 to ascii extended olé.txt would succed.

a replacement of io.open unicode would succed with lfs_ffi but would fail with lfs on letra_?_letra.txt and with olé.txt (but olé.txt could be made to work with conversion function)

@hishamhm
Copy link
Member

@sonoro1234 Thank you for the tests.

This confirm that simply adopting the UTF-8 conversion approach of lfs_ffi is not possible for lfs 1.x, because it breaks compatibility with io.open(). And even in lfs 2.x, I think most users will complain if the results of lfs.dir() are not directly usable with io.open() without a conversion step.

I read some more about the ugly world of Windows codepages, and my original plan wouldn't work because Windows does not support UTF-8 codepages, so we can't detect "UTF-8 mode" from setlocale.

I also noted that in the first script output it displays "olÚ.txt" instead of "olé.txt". This means the Lua interpreter is running with the DOS codepage (cp-850) rather than the Windows-1252 codepage (that dir used above). (This table helped me confirm this.)

I think (based on this) that if you put os.setlocale(".1252") at the beginning of the script, it would display "olé.txt".

Could you run another test, please? What happens if you run that script with lfs adding os.setlocale(".65001") in the beginning? Thank you!

@sonoro1234
Copy link

os.setlocale(".65001") -> nil
so it is not working
os.setlocale(".1252") lets me see olé.txt from lua

I was wondering: Is io.open on linux being able to open any utf8 path?

@sonoro1234
Copy link

perhaps provide lfs.open unicode-utf8 aware?

@cxw42
Copy link

cxw42 commented Nov 8, 2017

For folks finding this discussion from search results, start with the initial problem report in #56.

@SlySven
Copy link

SlySven commented Apr 28, 2018

What is the current state of affairs with LFS and Windows?

As I understand it LFS on the Windows platform tends to use Windows 1252 character encoding in a US-centric environment which is a super-set of the the ISO 8859-1/Latin1 in that they both use a single 8-bit number to encode a range of characters that extend upward from the standard ASCII 7-bit (0-127) numbers although other parts of the World may use a different selection of characters for the range of numbers from 128 to 255. UTF-8 uses the same numbers for the Unicode code-points as Windows 1252 but the way that the non-ASCII numbers (above 127) are encoded means that for characters with number from 128 to 255 (in fact all the way to 2047) use two bytes not one per characters.

The normal Windows i/o functions use whatever 8-bit codepage (mapping of character numbers 128-255 to characters) is currently selected but that is very fragile because if any values (above 127) are created by a different system/user for a file or path name the mapping between 8-bit values and characters - what string.byte() and string.code() use - can break either because a character in the original file/path name code page is not included in the end user's system or it is in a different place (in the range 128-255).

The only reliable way to pass non-ASCII file and path names to Windows i/o functions is to use the so-called wide_character versions that require/return UTF-16 arguments and which always use the same "number" (or sequence of numbers) to represent the same "character".

It is not possible to use UTF-16 with the current string library because the latter is predicated around 8-bit values for characters; it is possible to use UTF-8 with the current string library (for post Lua 5.1 versions) and for 5.1 and some prior versions, additional libraries can fill in the gaps in the library at those versions.

Q.E.D. LFS is broken for Windows for non-ASCII characters and fixing it means that previous code that sort of worked when fed/returning file/path names that used non-ASCII characters that were present in the selected code-page on WIndows because the system created and using those file/path name had the same encoding, will fail to work when the same data is read on a new version that uses UTF-8 internally to match how other OSes do it. OTOH That file/path name data is more likely to be readable by those other systems that are already using UTF-8 internally.

@Bulat-Ziganshin
Copy link

I think it's rather simple and such solution is already used ib a lot of software including mine: library works with UTF-8. When it calls WinAPI, UTF-8 strings are recoded into UTF-16. When it calls Unix API, strings are used as is.

@hishamhm
Copy link
Member

@SlySven as far as I can tell this is also how all file I/O in the Lua standard library works.

Should we go the direction suggested by @sonoro1234 and provide UTF-8-enabled replacements like lfs.open and lfs.remove?

I wouldn't be opposed to that in a LFS 2.0, but we'd need someone to contribute that code as I don't do Windows development myself.

Would we also need a lfs.toutf8 function for converting local-codepage strings into UTF-8 strings so that code that currently works using the local code page could have an easy path fpr converting? (Otherwise they would need some sort of WinAPI bindings defeating the purpose of LFS).

LFS is designed to be a portable interface. Should we ensure UTF-8 interfaces on non-Windows platforms as well? Most Linux systems currently run on UTF-8 but not all. Should a toutf8 function be available there as well, binding libiconv? That wouldn't be hard to do.

@SlySven
Copy link

SlySven commented Apr 29, 2018

I'm particularly tied to Lua 5.1 because the project I code for cannot afford to move to a later minor version without breaking things for our current users who will have scripts and fragments of lua code that they may not personally be able to revise and we have given an undertaking not to break our own API in a non-backwards compatible way. That being the case and because we are aiming to achieve I18n for our next major version we have only recently started to include a recommendation of a UTF8 string module - although I understand that the equivalent is already in 5.3 if not 5.2 .

I am not sure what processes are needed to be available to transcode between UTF-8 and UTF-16 but it seems that the file i/o library will have to do this as close to the Windows calls as possible... it looks like luawinfile does this.

@Erutuon
Copy link

Erutuon commented Nov 1, 2018

I recently discovered luawinfile. It's similar to lfs, but converts strings from UTF-8 to UTF-16 and uses the wide-string versions of the various file-related functions. It seems to work; I was able to open a file whose name contained non-ASCII characters. Perhaps it can be used as a guide for making luafilesystem Unicode-friendly on Windows.

@SlySven
Copy link

SlySven commented Nov 2, 2018

I found that luawinfile uses Lua 5.3 functions and I gave not been able to backport it to 5.1 - I'm just hoping someone else has the skills (and inclination) to do it... 🙏

@Erutuon
Copy link

Erutuon commented Nov 2, 2018

I created a (very messy) fork that compiles under Lua 5.1, but doesn't quite work. 😬

@sonoro1234
Copy link

https://github.com/sonoro1234/luafilesystem/blob/unicode/lfs_ffi.lua#L184
are the experiments (succesfull for me) that I did on this LuaJIT ffi binding from spacewander.
When module.unicode is true, all functions (which emulate lfs) do the necessary conversions between unicode and utf-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants