Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

bombipappoo · 2020-07-09T02:53:24Z

On Windows, with HDF5 1.10.6 or later, nc_open() does not work properly for the filename containing non-ASCII characters.
Unfortunately, #1668 does not resolve this issue.

This is because Netcdf treats the filename encodings as ANSI, whereas HDF5 now treats as UTF-8. "ANSI" means locale specific 8-bit character set.

If we pass the filename encoded as ANSI to nc_open(), check_file_type() is called then file opened by fopen(). At this time, the filename is treated as ANSI, so the file can be opened.
If the file format is NC4, H5Fopen() is called. Since the filename is treated as UTF-8, it is converted from UTF-8 to UTF-16, and the file is opened by _wopen(). However, this conversion is incorrect so _wopen() will fail.

On the other hand, if we pass the filename encoded as UTF-8 to nc_open(), opening the file fails at check_file_type(), since the UTF-8 filename contains illegal characters for ANSI.

To solve this problem, convert the filename from ANSI to UTF-8 before calling HDF5 functions if HDF5 is 1.10.6 or later.
Additionally, it would be nice if a new API that accepts UTF-8 or UTF-16 like nc_openW() or nc_open_utf8() would be added to access the Unicode filename.

The text was updated successfully, but these errors were encountered:

DennisHeimbigner · 2020-07-09T03:05:02Z

What version of netcdf are you using?

bombipappoo · 2020-07-09T03:30:06Z

I'm using 4.7.4.
and latest master.

DennisHeimbigner · 2020-07-09T03:56:49Z

I think you are being misled by the fact that the type of the path
argument is char* rather than unsigned char*. It technically should
be unsigned, but our code generally treats the two as interchangeable
and does not, for example, truncate to 7-bit ASCII.
Attached is a test program that shows that it seems to work ok
with utf 8 file names. See if it works for you. If not, then we need
to investigate to find out why.
testutf8.zip
=Dennis Heimbigner

DennisHeimbigner · 2020-07-09T04:07:14Z

Oops just recalled you said Windows. Let me test that. I could well believe
that it fails there.

bombipappoo · 2020-07-09T04:44:43Z

nc_create() works and the file is created correctly, but nc_open() returns 2(ENOENT).

DennisHeimbigner · 2020-07-09T18:08:25Z

Ok, this is going to take a while to fix. For the record, I need to do this:

find any remaining bare open(), fopen, etc calls and convert them to my
wrapped versions that already exist.
Modify wrappers to call MultiByteToWideChar windows function to convert utf8
to utf16 (windows wide characters).
invoke e.g _wfopen etc on the wide char versions.

DennisHeimbigner · 2020-07-09T18:24:44Z

So now I am confused. I just did a build and test of netcdf using visual
studio. THe test ncdump/test_unicode.sh passed and ncdump/tst_netcdf4.sh
passed which implies that unicode is ok. But comment above about ENOENT
seems to imply it fails. [Edit] netcdf4 works, but other file based dispatcher, including
netcdf-3 (and pnetcdf?) will fail.

bombipappoo · 2020-07-10T03:34:55Z

nc_open() etc. will treat filenames as UTF-8?
If so, it is one of the best solution for me since all characters can be used as filename.
However, it is incompatible with current version, so all application using netcdf are affected.
For example, ncgen and ncdump need to convert the filename from ANSI to UTF-8 before calling nc_open(), since argv is encoded as ANSI.

I cannot think that most Unix programmers and English Windows programmers will deal this, because it works correctly in their environment without do anything.
Therefore, I think it is desirable that nc_open() etc. works the same as before without do anything.

To do so, how about the following?
nc_open() etc. treats the filename encoding as ANSI, then convert ANSI to UTF-8.
Additionally, newly added function nc_open_utf8() etc. treats the filename encoding as UTF-8.
Also, if HDF5 1.10.5 or earlier is used, need to be reconverted from UTF-8 to ANSI before calling HDF5 functions.

bombipappoo · 2020-07-10T03:39:46Z

It seems that those tests are for variable or dimension names, not for filenames.

test_unicode_directory.sh is for UTF-8 filename.
Byte sequence '\xe6\xb5\xb7' is represents follows.

'海' (UTF-8)
'æµ·' (Windows-1252)

Assume your code page is Windows-1252:
nc_create("\xe6\xb5\xb7.nc") will create the file "æµ·.nc" if netcdf-3 or netcdf-4(HDF5 <= 1.10.5),
or will create the file "海.nc" if netcdf-4(HDF5 >= 1.10.6).
nc_open("\xe6\xb5\xb7.nc") will open the file always "æµ·.nc" at check_file_type(), so it fails if netcdf-4(HDF5 >= 1.10.6).

DennisHeimbigner · 2020-07-10T04:28:09Z

Bash and the linux api in general handles utf8 ok as near as I can tell.
You can look at ncdump/test_unicode.sh to see this.
However the same cannot be said for windows or cygwin.

DennisHeimbigner · 2020-07-10T04:30:24Z

that nc_open() etc. works the same as before

I do not understand what you mean. The netcdf library
and utilities have supported utf8 for a long time now.
We do not test that as thoroughly as we should so sometimes
we get reversions.

bombipappoo · 2020-07-10T05:41:22Z

I know that nc_open() works the same as before.

The cause is that the behavior of HDF5 has changed.
In this way, the definition of HDopen has changed from _open to Wopen_utf8.
HDF5 1.10.5
HDF5 1.10.6

So the simplest solution is, as I first showed, convert the filename from ANSI to UTF-8 before calling HDF5 functions if HDF5 >= 1.10.6.

magnusuMET mentioned this issue Jul 10, 2020

All paths should accept and return OsString georust/netcdf#67

Closed

bombipappoo mentioned this issue Jul 14, 2020

Convert filename from ANSI to UTF-8 before calling HDF5. #1794

Merged

rouault mentioned this issue Jul 14, 2020

netcdf subdataset utf-8 path fails to open on windows OSGeo/gdal#2763

Closed

WardF closed this as completed in #1794 Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

bombipappoo commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020

bombipappoo commented Jul 9, 2020 •

edited

Loading

DennisHeimbigner commented Jul 9, 2020 •

edited

Loading

DennisHeimbigner commented Jul 9, 2020

bombipappoo commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020 •

edited

Loading

bombipappoo commented Jul 10, 2020

bombipappoo commented Jul 10, 2020

DennisHeimbigner commented Jul 10, 2020

DennisHeimbigner commented Jul 10, 2020 •

edited by dopplershift

Loading

bombipappoo commented Jul 10, 2020

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

Cannot open file with non-ASCII characters in the path, such as Japanese. #1786

Comments

bombipappoo commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020

bombipappoo commented Jul 9, 2020 • edited Loading

DennisHeimbigner commented Jul 9, 2020 • edited Loading

DennisHeimbigner commented Jul 9, 2020

bombipappoo commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020

DennisHeimbigner commented Jul 9, 2020 • edited Loading

bombipappoo commented Jul 10, 2020

bombipappoo commented Jul 10, 2020

DennisHeimbigner commented Jul 10, 2020

DennisHeimbigner commented Jul 10, 2020 • edited by dopplershift Loading

bombipappoo commented Jul 10, 2020

bombipappoo commented Jul 9, 2020 •

edited

Loading

DennisHeimbigner commented Jul 9, 2020 •

edited

Loading

DennisHeimbigner commented Jul 9, 2020 •

edited

Loading

DennisHeimbigner commented Jul 10, 2020 •

edited by dopplershift

Loading