Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

Closed
bombipappoo opened this issue Jul 9, 2020 · 12 comments · Fixed by #1794
Closed

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

bombipappoo opened this issue Jul 9, 2020 · 12 comments · Fixed by #1794

Comments

@bombipappoo
Copy link
Contributor

On Windows, with HDF5 1.10.6 or later, nc_open() does not work properly for the filename containing non-ASCII characters.
Unfortunately, #1668 does not resolve this issue.

This is because Netcdf treats the filename encodings as ANSI, whereas HDF5 now treats as UTF-8. "ANSI" means locale specific 8-bit character set.

If we pass the filename encoded as ANSI to nc_open(), check_file_type() is called then file opened by fopen(). At this time, the filename is treated as ANSI, so the file can be opened.
If the file format is NC4, H5Fopen() is called. Since the filename is treated as UTF-8, it is converted from UTF-8 to UTF-16, and the file is opened by _wopen(). However, this conversion is incorrect so _wopen() will fail.

On the other hand, if we pass the filename encoded as UTF-8 to nc_open(), opening the file fails at check_file_type(), since the UTF-8 filename contains illegal characters for ANSI.

To solve this problem, convert the filename from ANSI to UTF-8 before calling HDF5 functions if HDF5 is 1.10.6 or later.
Additionally, it would be nice if a new API that accepts UTF-8 or UTF-16 like nc_openW() or nc_open_utf8() would be added to access the Unicode filename.

@DennisHeimbigner
Copy link
Collaborator

What version of netcdf are you using?

@bombipappoo
Copy link
Contributor Author

bombipappoo commented Jul 9, 2020

I'm using 4.7.4.
and latest master.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Jul 9, 2020

I think you are being misled by the fact that the type of the path
argument is char* rather than unsigned char*. It technically should
be unsigned, but our code generally treats the two as interchangeable
and does not, for example, truncate to 7-bit ASCII.
Attached is a test program that shows that it seems to work ok
with utf 8 file names. See if it works for you. If not, then we need
to investigate to find out why.
testutf8.zip
=Dennis Heimbigner

@DennisHeimbigner
Copy link
Collaborator

Oops just recalled you said Windows. Let me test that. I could well believe
that it fails there.

@bombipappoo
Copy link
Contributor Author

nc_create() works and the file is created correctly, but nc_open() returns 2(ENOENT).

@DennisHeimbigner
Copy link
Collaborator

Ok, this is going to take a while to fix. For the record, I need to do this:

  1. find any remaining bare open(), fopen, etc calls and convert them to my
    wrapped versions that already exist.
  2. Modify wrappers to call MultiByteToWideChar windows function to convert utf8
    to utf16 (windows wide characters).
  3. invoke e.g _wfopen etc on the wide char versions.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Jul 9, 2020

So now I am confused. I just did a build and test of netcdf using visual
studio. THe test ncdump/test_unicode.sh passed and ncdump/tst_netcdf4.sh
passed which implies that unicode is ok. But comment above about ENOENT
seems to imply it fails. [Edit] netcdf4 works, but other file based dispatcher, including
netcdf-3 (and pnetcdf?) will fail.

@bombipappoo
Copy link
Contributor Author

nc_open() etc. will treat filenames as UTF-8?
If so, it is one of the best solution for me since all characters can be used as filename.
However, it is incompatible with current version, so all application using netcdf are affected.
For example, ncgen and ncdump need to convert the filename from ANSI to UTF-8 before calling nc_open(), since argv is encoded as ANSI.

I cannot think that most Unix programmers and English Windows programmers will deal this, because it works correctly in their environment without do anything.
Therefore, I think it is desirable that nc_open() etc. works the same as before without do anything.

To do so, how about the following?
nc_open() etc. treats the filename encoding as ANSI, then convert ANSI to UTF-8.
Additionally, newly added function nc_open_utf8() etc. treats the filename encoding as UTF-8.
Also, if HDF5 1.10.5 or earlier is used, need to be reconverted from UTF-8 to ANSI before calling HDF5 functions.

@bombipappoo
Copy link
Contributor Author

It seems that those tests are for variable or dimension names, not for filenames.

test_unicode_directory.sh is for UTF-8 filename.
Byte sequence '\xe6\xb5\xb7' is represents follows.

  • '海' (UTF-8)
  • 'æµ·' (Windows-1252)

Assume your code page is Windows-1252:
nc_create("\xe6\xb5\xb7.nc") will create the file "æµ·.nc" if netcdf-3 or netcdf-4(HDF5 <= 1.10.5),
or will create the file "海.nc" if netcdf-4(HDF5 >= 1.10.6).
nc_open("\xe6\xb5\xb7.nc") will open the file always "æµ·.nc" at check_file_type(), so it fails if netcdf-4(HDF5 >= 1.10.6).

@DennisHeimbigner
Copy link
Collaborator

Bash and the linux api in general handles utf8 ok as near as I can tell.
You can look at ncdump/test_unicode.sh to see this.
However the same cannot be said for windows or cygwin.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Jul 10, 2020

that nc_open() etc. works the same as before

I do not understand what you mean. The netcdf library
and utilities have supported utf8 for a long time now.
We do not test that as thoroughly as we should so sometimes
we get reversions.

@bombipappoo
Copy link
Contributor Author

I know that nc_open() works the same as before.

The cause is that the behavior of HDF5 has changed.
In this way, the definition of HDopen has changed from _open to Wopen_utf8.
HDF5 1.10.5
HDF5 1.10.6

So the simplest solution is, as I first showed, convert the filename from ANSI to UTF-8 before calling HDF5 functions if HDF5 >= 1.10.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants