-
Notifications
You must be signed in to change notification settings - Fork 30
String Handling
Strings appear in HDF5 (file or API) in the following places:
- Link names (HDF5 path names)
- Attribute names
- Attribute or dataset elements
- File names
- Of HDF5 files in general
- In the definition external links
- In the description of a dataset's external layout
- The log file name in the LOG VFD
- The file name extension patterns in the SPLIT VFD
- The file names in Virtual Datasets (VDS)
- Field names of compound datatypes
- String constants in enumerated datatypes
- Tags of opaque datatypes
- The names of user-defined properties
- Data transformation expressions
- External link prefixes
- Error messages
- Comments (Deprecated!)
Every string in HDF5 is represented as a byte sequence plus an explicit piece of metadata (encoding) or a priori knowledge. Currently, HDF5 supports two character encodings, ASCII (default) and UTF-8, for certain items:
The character encoding of items 1, 2 is stored in their creation properties.
The character encoding of 3 is stored in the attribute's or dataset's datatype.
Currently, the character encoding of 4 to 12 is limited to ASCII and NOT stored explicitly.
In the C-API, strings are represented as char*
pointers. Counterparts on the .NET side include byte[]
and string
(StringBuilder
). .NET string
strings are UTF-16 encoded Unicode strings. PInvoke supports declarative string marshaling for ASCII and Unicode (UTF-16) strings.
In HDF.PInvoke, we have to strike a balance between the convenience of string
strings, and HDF5's default ASCII encoding on one hand, and PInvoke's lack of support for marshaling UTF-8 encoded strings.
Let's consider an example:
hid_t H5Acreate_by_name
(hid_t loc_id, const char *obj_name, const char *attr_name, hid_t type_id,
hid_t space_id, hid_t acpl_id, hid_t aapl_id, hid_t lapl_id)
A suitable PInvoke declaration is:
namespace HDF.PInvoke
{
public unsafe sealed class H5A
{
...
[DllImport(Constants.DLLFileName, EntryPoint = "H5Acreate_by_name",
CallingConvention = CallingConvention.Cdecl),
SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public extern static hid_t create_by_name
(hid_t loc_id, byte[] obj_name, byte[] attr_name, hid_t type_id,
hid_t space_id, hid_t acpl_id = H5P.DEFAULT,
hid_t aapl_id = H5P.DEFAULT, hid_t lapl_id = H5P.DEFAULT);
...
}
}
With the right attribute creation property list
hid_t acpl = H5P.create(H5P.ATTRIBUTE_CREATE);
H5P.set_char_encoding(lcpl, H5T.cset_t.UTF8)
this can be used to create attributes with UTF-8 encoded Unicode string names:
hid_t att = H5A.create_by_name(group,
Encoding.UTF8.GetBytes("Ελληνικά"),
Encoding.UTF8.GetBytes("日本語"),
H5T.IEEE_F32LE, H5S.SCALAR, acpl);
Of course, this would be a little tedious for ASCII encoded strings, and that's why we provide a second PInvoke declaration as follows:
[DllImport(Constants.DLLFileName, EntryPoint = "H5Acreate_by_name",
CharSet = CharSet.Ansi,
CallingConvention = CallingConvention.Cdecl),
SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public extern static hid_t create_by_name
(hid_t loc_id, string obj_name, string attr_name, hid_t type_id,
hid_t space_id, hid_t acpl_id = H5P.DEFAULT,
hid_t aapl_id = H5P.DEFAULT, hid_t lapl_id = H5P.DEFAULT);
For ASCII strings, this simplifies the attribute creation to:
hid_t att = H5A.create_by_name(group, "dset", "attr1", H5T.IEEE_F32LE, H5S.SCALAR);
As tempting as it might appear to use this shortcut, we strongly discourage you from using these functions. Unless you are certain that you'll never leave the world of ASCII, don't do it! And even if you are certain, think of the person who might one day inherit your software. They deserve better.
External reference (might be a little outdated but still useful): https://www.hdfgroup.org/HDF5/doc1.8/Advanced/UsingUnicode/index.html