-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
savetxt and loadtxt are not portable #505
Comments
Some guidelines on output in formats with sufficient width for different precision kinds would be very useful (maybe even having a function returning a "standardized" format string?). In the past I made the mistake of not writing enough decimals, leading to false conclusions during benchmarking a CFD code. |
Does |
Consulting with N2146, Section 13.7.5, about Generalized editing of the forms
which means behavior could differ between compilers. It also does not solve the issue for
|
Thanks. Another solution might be to keep the number of decimals fixed and deal with different kinds using |
Here's what I used: https://en.wikipedia.org/wiki/IEEE_754#Basic_and_interchange_formats I took the decimal digits value for each real kind from the table, and rounded up, so we get 8, 16, and 35 for sp, dp, and qp, respectively. Then for full width I added 7 to that (it's possible that's more than what's needed):
Even if
We can test for exact values and we should be able to reproduce them for sp, dp, or qp. Current tests for
and that works for all real kinds. Granted, I didn't test with very large or very small numbers, and we probably should. Though I have some basic awareness of how floating-point numbers work, I don't have enough experience to know about all caveats to test for. So any help with that would be very welcome. |
I'm not an expert of this area too but I thought that exact comparisons of floating point numbers is always a bad idea 🤷♂️ |
Exactly comparing values that you know should be exactly the same is fine. On the other side, comparing a floating point value with the result of a floating point expression can be problematic, thus people use tolerant comparisons there. What I'm unsure of is the ranges of values to test for, and whether the total width of decimal + 7 is good enough in all cases. Currently I test with a 100 iterations of |
Thank you for doing this, it will definitely give a much more solid answer. @kargl does not seem to be much active on GitHub but he writes excellent messages on Discourse in this domain. I wish he could advise. |
Is the purpose of savetxt and loadtxt to provide for a round-trip transfer of data? There are many reasons to transfer an array to a data stream -- to inspect it, to format it for use with other applications (plotting, entry into databases and spreadsheets (usually CSV), presentation in a document (converting to HTML, LaTex, *roff, ...), archiving, transfer to another process, .... There are libraries for self-describing data (NetCFD, HDF5, ...) serialized standards (TOML, JSON, ...) , transfer to other processes or platforms (xdr, ...) ... For simple round-trips hexadecimal format is often used (the GPF M_matrix utility uses that in "save" and "load" commands as a simple example). xdr is an old standby for a compromise between the efficiency of binary transfers but all the problems that entails (byte size of values, endian, mantissa types, ...) and cross-platform portability. I have never seen a formal declaration of which of these or other uses savetxt and loadtxt were targetting short and/or long-term. The answer is very different depending on the intended use. Hexadecimal values are far more reliable for a round trip but far less intelligible, something as simple as CSV is better than a plain table for importing into many applications, HTML is often importable and also useful for use with a browser, most people prefer a row-column decimal-justified column format for small tables to be viewed manually, any kind of data archiving is hopefully using a self-describing format , where a large number of digits can be inappropriate and imply often non-existent precision to the values ... and so on. NAMELIST should not be forgotten, as it is self-describing, a part of the Fortran standard, simple to use and supports user-defined types better than anything else I can think of. |
Thanks @urbanjost, good question. Let's discuss, I think it's up to us to decide. My understanding is that the main purposes are:
1 and 2 are very useful for development and prototyping, or to easily send a 2-d array to a friend or colleague and they can read it from any tool or language. 3 is important for use in the early development of simulations (toy models), especially the chaotic kind that are sensitive to initial conditions. Restarting such simulations would need exact round-trip IO. Production code will likely use some more optimal self-describing binary format like NetCDF or HDF5, but the target audience for 3 is to an extent at odds with 2) for real and complex numbers. To preserve the data, you need to store all the digits. The more digits the less readable. If 3 is not important, list-directed IO is still not optimal because most compilers will still output all significant digits for that type kind. We can consider having the user provide the format as an argument to If 3) is not important, then my marking this as a "bug" is not charitable, of course. CC also @certik and @jvdp1 who were the original implementors. Something really confuses me though. I see that there was a PR #89 that made a format fix for ifort in |
@milancurcic "The problem here is that to load the data, the user would need to specify the format, or we'd need to rely on list-directed IO for input because we can't use "zero-width" descriptors like g0 in read statements." Are there or will there be functions in stdlib to read a single real, integer, or logical variable from a string? If there is, one could first decompose a line of text into tokens and then read one of the basic types from the tokens. |
I don't think we have these functions yet, but we will eventually. This could certainly be an approach to this problem. It'd be interesting to see how it compares performance-wise--I think we'd be effectively comparing our home-cooked text parsing with that of the compiler. |
I have a routine that does that for reading tables that massages the input data quite a bit and assumes the file is small enough to be held in memory that I would certainly not use for large files, but for relatively small sizes the time is negligible. |
could definitely tune it up, but for a million values I got
so reading a million values with list-directed I/O I got about 0.34(ifort) to 0.50(gfortran) wall-clock seconds; and using an untuned routine that does parsing I got 4.4 to 3.4+ seconds. So significantly slower, but sure that could be improved. For what I use that routine for 1 000 000, values is towards the high end, but if someone was reading in a file in the gigabytes that would be very significant. But I have some old routines that do not use internal reads that are faster than list-directed I/O so it definitely could be sped up and not horrible for small files; what I use it for rarely would be in the tens of thousands of values and is typically < 2 000. |
I don't participate here, for a very simple reason. While the goals of Fortran stdlib appear noble, the complete lack of a low level developers guide is a major problem in my mind. If Fortran stdlib is intended as a place to prototype features that might be subsumed into the Fortran standard, then the prototype should consider requirements of the Fortran standard. One major failure is that Fortran stdlib lacks guidance on generic interfaces and portability. PS: I would appreciate not being CC'd on Fortran standard discussions, because I'm now being cc'd on all messages about loadtxt and savetxt. If one needs portable file IO, then use HDF5. |
Here are the updated edit descriptors that work:
These formats allow exact round-trip of data for
a.txt:
b.txt:
c.txt:
This is now part of #494 if anybody would like to review it. |
Thanks, Steve. Would you participate once we made such a low-level developers guide? What information specifically do you find missing about generic interfaces and portability? We currently use the fypp preprocessor to generate specific procedures for a number of types kinds. Is the lacking guidance then about how a new contributor should use this to create generic procedures? |
It is unlikely that I'll participate as I know neither FORD, FYPP, nor CMAKE, and I have only I have gone into detail in Fortran Discourse about data mining
Does Fortran stdlib build? If no, the errors will be obvious. Neither Just a spot check of stdlib_specialfunctions.f90 and stdlib_specialfunctions_legendre.f90 shows
So, to answer your last question. No. This is not about a lack of guidance for a new |
There has been some discussion of fpm(1) automatically building a module and/or INCLUDE file for each build that would include So I don't think there is a perfect solution for the problem with kinds, but I think we can reduce the problem a little bit, especially via fpm(1) enhancements, as the problem potentially can vex any Fortran application. |
Why do you need this format:
It seems that
Is the issue with exponents? This is a common issue and we should have definitive guidelines how to print real numbers (single, double, quadruple) that does not lose precision. Yes, we can allow the user to configure. |
@urbanjost , I agree with much of what you wrote. I will however point we are discussing modern Fortran. |
Yes, but it's not about losing precision. See #505 (comment) where I used Take this program:
which produces this output:
The issue came up with
which produces this output:
In summary, you need to specify exponent width in case of |
Fixed by #494. |
@milancurcic maybe it is too late, but this Discourse thread might help with this issue and linked PR. |
Yes, looks like we nailed it. :) |
Considering that these formats are more useful than only for |
@epagone yes it would. I always struggle to remember the exact format. Every time I print some results to a file that is human readable, I like to print the full precision. |
That's exactly my same use case @certik |
While porting stdlib_io tests to the new testing framework (PR in progress in #494), it occurred to me that the
loadtxt
tests were not testing values, but merely loading the data and printing them to stdout. When I re-wrote the tests to compare the values, though the tests passed for GFortran, they failed for ifort (see awvwgk#87).We should be able to round-trip the data without loss of information, i.e.
Currently it seems that doesn't generally work because both
loadtxt
andsavetxt
use list-directed input and output, which is not generally portable. We should use specific formats with field widths wide enough so that we can both round-trip the data and so that the files are in the same format between different compilers.The text was updated successfully, but these errors were encountered: