Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

savetxt and loadtxt are not portable #505

Closed
milancurcic opened this issue Sep 1, 2021 · 29 comments
Closed

savetxt and loadtxt are not portable #505

milancurcic opened this issue Sep 1, 2021 · 29 comments
Assignees
Labels
bug Something isn't working topic: IO Common input/output related features

Comments

@milancurcic
Copy link
Member

While porting stdlib_io tests to the new testing framework (PR in progress in #494), it occurred to me that the loadtxt tests were not testing values, but merely loading the data and printing them to stdout. When I re-wrote the tests to compare the values, though the tests passed for GFortran, they failed for ifort (see awvwgk#87).

We should be able to round-trip the data without loss of information, i.e.

call random_number(a)
call savetxt('data.txt', a)
call loadtxt('data.txt', b)
print *, all(a == b) ! should print T

Currently it seems that doesn't generally work because both loadtxt and savetxt use list-directed input and output, which is not generally portable. We should use specific formats with field widths wide enough so that we can both round-trip the data and so that the files are in the same format between different compilers.

@milancurcic milancurcic added the bug Something isn't working label Sep 1, 2021
@milancurcic milancurcic self-assigned this Sep 1, 2021
@ivan-pi
Copy link
Member

ivan-pi commented Sep 1, 2021

Some guidelines on output in formats with sufficient width for different precision kinds would be very useful (maybe even having a function returning a "standardized" format string?). In the past I made the mistake of not writing enough decimals, leading to false conclusions during benchmarking a CFD code.

@epagone
Copy link

epagone commented Sep 1, 2021

Some guidelines on output in formats with sufficient width for different precision kinds would be very useful (maybe even having a function returning a "standardized" format string?)

Does g0 address this or it's not guaranteed to work in this context?

@ivan-pi
Copy link
Member

ivan-pi commented Sep 1, 2021

Consulting with N2146, Section 13.7.5, about Generalized editing of the forms Gw , Gw.d , and G w.dEe, the standard says,

When w is zero the processor selects the field width.

which means behavior could differ between compilers. It also does not solve the issue for loadtxt, since the standard says further

On input, w shall not be zero. [...] Note that w cannot be zero for input editing.

@epagone
Copy link

epagone commented Sep 1, 2021

Thanks. Another solution might be to keep the number of decimals fixed and deal with different kinds using is_close (and related issues therein specified). However, it might be a bit too risky.

@milancurcic
Copy link
Member Author

Some guidelines on output in formats with sufficient width for different precision kinds would be very useful

Here's what I used:

https://en.wikipedia.org/wiki/IEEE_754#Basic_and_interchange_formats

I took the decimal digits value for each real kind from the table, and rounded up, so we get 8, 16, and 35 for sp, dp, and qp, respectively. Then for full width I added 7 to that (it's possible that's more than what's needed):

! Format strings with edit descriptors for each type and kind
character(*), parameter :: &
    FMT_INT = '(*(i0,1x))', &
    FMT_REAL_SP = '(*(es15.8,1x))', &
    FMT_REAL_DP = '(*(es23.16,1x))', &
    FMT_REAL_QP = '(*(es42.35,1x))', &
    FMT_COMPLEX_SP = '(*(es15.8,1x,es15.8))', &
    FMT_COMPLEX_DP = '(*(es23.16,1x,es23.16))', &
    FMT_COMPLEX_QP = '(*(es42.35,1x,es42.35))'

Even if g0 were to be portable between compilers (I tried that first), the real problem I ran into is what Ivan mentioned, you can't use it on input. For integers I use i0 on output (could be g0 as well), and list-directed (*) on input.

Thanks. Another solution might be to keep the number of decimals fixed and deal with different kinds using is_close (and related issues therein specified).

We can test for exact values and we should be able to reproduce them for sp, dp, or qp. Current tests for loadtxt in #506 do this:

        do n = 1, 100
            call random_number(input)
            call savetxt('test_dp.txt', input)
            call loadtxt('test_dp.txt', expected)
            call check(error, all(input == expected))
            if (allocated(error)) return
        end do

and that works for all real kinds. Granted, I didn't test with very large or very small numbers, and we probably should.

Though I have some basic awareness of how floating-point numbers work, I don't have enough experience to know about all caveats to test for. So any help with that would be very welcome.

@epagone
Copy link

epagone commented Sep 1, 2021

I'm not an expert of this area too but I thought that exact comparisons of floating point numbers is always a bad idea 🤷‍♂️

@milancurcic
Copy link
Member Author

Exactly comparing values that you know should be exactly the same is fine. On the other side, comparing a floating point value with the result of a floating point expression can be problematic, thus people use tolerant comparisons there.

What I'm unsure of is the ranges of values to test for, and whether the total width of decimal + 7 is good enough in all cases. Currently I test with a 100 iterations of random_number() for a 10x10 matrix, and that works for all reals. Perhaps I also need a series of tests with very large and very small numbers. I'll try that next.

@epagone
Copy link

epagone commented Sep 1, 2021

I'll try that next.

Thank you for doing this, it will definitely give a much more solid answer. @kargl does not seem to be much active on GitHub but he writes excellent messages on Discourse in this domain. I wish he could advise.

@urbanjost
Copy link

urbanjost commented Sep 1, 2021

Is the purpose of savetxt and loadtxt to provide for a round-trip transfer of data? There are many reasons to transfer an array to a data stream -- to inspect it, to format it for use with other applications (plotting, entry into databases and spreadsheets (usually CSV), presentation in a document (converting to HTML, LaTex, *roff, ...), archiving, transfer to another process, ....

There are libraries for self-describing data (NetCFD, HDF5, ...) serialized standards (TOML, JSON, ...) , transfer to other processes or platforms (xdr, ...) ... For simple round-trips hexadecimal format is often used (the GPF M_matrix utility uses that in "save" and "load" commands as a simple example).

xdr is an old standby for a compromise between the efficiency of binary transfers but all the problems that entails (byte size of values, endian, mantissa types, ...) and cross-platform portability.

I have never seen a formal declaration of which of these or other uses savetxt and loadtxt were targetting short and/or long-term. The answer is very different depending on the intended use. Hexadecimal values are far more reliable for a round trip but far less intelligible, something as simple as CSV is better than a plain table for importing into many applications, HTML is often importable and also useful for use with a browser, most people prefer a row-column decimal-justified column format for small tables to be viewed manually, any kind of data archiving is hopefully using a self-describing format , where a large number of digits can be inappropriate and imply often non-existent precision to the values ... and so on. NAMELIST should not be forgotten, as it is self-describing, a part of the Fortran standard, simple to use and supports user-defined types better than anything else I can think of.

@milancurcic
Copy link
Member Author

milancurcic commented Sep 2, 2021

Is the purpose of savetxt and loadtxt to provide for a round-trip transfer of data?

Thanks @urbanjost, good question. Let's discuss, I think it's up to us to decide. My understanding is that the main purposes are:

  1. Easy one-liners for reading and writing arrays
  2. Human readable
  3. Bit-to-bit reproducible transfer of data? Maybe; I think so.

1 and 2 are very useful for development and prototyping, or to easily send a 2-d array to a friend or colleague and they can read it from any tool or language.

3 is important for use in the early development of simulations (toy models), especially the chaotic kind that are sensitive to initial conditions. Restarting such simulations would need exact round-trip IO. Production code will likely use some more optimal self-describing binary format like NetCDF or HDF5, but the target audience for savetxt and loadtxt is I think different then those of mature production code.

3 is to an extent at odds with 2) for real and complex numbers. To preserve the data, you need to store all the digits. The more digits the less readable.

If 3 is not important, list-directed IO is still not optimal because most compilers will still output all significant digits for that type kind.

We can consider having the user provide the format as an argument to savetxt, like in NumPy.savetxt. Then the user could specify fmt='f5.2' and write pi as 3.14. The problem here is that to load the data, the user would need to specify the format, or we'd need to rely on list-directed IO for input because we can't use "zero-width" descriptors like g0 in read statements.

If 3) is not important, then my marking this as a "bug" is not charitable, of course.

CC also @certik and @jvdp1 who were the original implementors.

Something really confuses me though. I see that there was a PR #89 that made a format fix for ifort in qsavetxt, however the current stdlib_io.fypp doesn't have qsavetxt. Does anybody know what PR removed it? I can't find it.

@Beliavsky
Copy link

Beliavsky commented Sep 2, 2021

@milancurcic "The problem here is that to load the data, the user would need to specify the format, or we'd need to rely on list-directed IO for input because we can't use "zero-width" descriptors like g0 in read statements."

Are there or will there be functions in stdlib to read a single real, integer, or logical variable from a string? If there is, one could first decompose a line of text into tokens and then read one of the basic types from the tokens.

@milancurcic
Copy link
Member Author

Is there or will there be functions in stdlib to read a single real, integer, or logical variable from a string? If there is, one could first decompose a line of text into tokens and then read one of the basic types from the tokens.

I don't think we have these functions yet, but we will eventually. This could certainly be an approach to this problem. It'd be interesting to see how it compares performance-wise--I think we'd be effectively comparing our home-cooked text parsing with that of the compiler.

@urbanjost
Copy link

I have a routine that does that for reading tables that massages the input data quite a bit and assumes the file is small enough to be held in memory that I would certainly not use for large files, but for relatively small sizes the time is negligible.
read_tables. It is just a convenience routine for reading data out of messy input files, but I will time it as a worse-case example, as it does some extra work to clean up some unfriendly input (but it is not bullet-proof and goes wrong on filenames with numbers in them in the file and such).

@urbanjost
Copy link

urbanjost commented Sep 3, 2021

could definitely tune it up, but for a million values I got

 (test1) urbanjs@venus:~/github/fpm-tools$ xx
 list-directed read  0.3437500    
 read_table_d   4.296875    
 size=            1000000
 size(dim=1)=      100000
 size=(dim=2)          10
(test1) urbanjs@venus:~/github/fpm-tools$ ffpm /tmp/xx.ff -compiler gfortra
(test1) urbanjs@venus:~/github/fpm-tools$ xx
 list-directed read  0.500000000    
 read_table_d   3.34375000    
 size=            1000000
 size(dim=1)=      100000

so reading a million values with list-directed I/O I got about 0.34(ifort) to 0.50(gfortran) wall-clock seconds; and using an untuned routine that does parsing I got 4.4 to 3.4+ seconds. So significantly slower, but sure that could be improved. For what I use that routine for 1 000 000, values is towards the high end, but if someone was reading in a file in the gigabytes that would be very significant. But I have some old routines that do not use internal reads that are faster than list-directed I/O so it definitely could be sped up and not horrible for small files; what I use it for rarely would be in the tens of thousands of values and is typically < 2 000.

@kargl
Copy link

kargl commented Sep 4, 2021

I'll try that next.

Thank you for doing this, it will definitely give a much more solid answer. @kargl does not seem to be much active on GitHub but he writes excellent messages on Discourse in this domain. I wish he could advise.

I don't participate here, for a very simple reason. While the goals of Fortran stdlib appear noble, the complete lack of a low level developers guide is a major problem in my mind. If Fortran stdlib is intended as a place to prototype features that might be subsumed into the Fortran standard, then the prototype should consider requirements of the Fortran standard. One major failure is that Fortran stdlib lacks guidance on generic interfaces and portability.

PS: I would appreciate not being CC'd on Fortran standard discussions, because I'm now being cc'd on all messages about loadtxt and savetxt. If one needs portable file IO, then use HDF5.

@milancurcic
Copy link
Member Author

Here are the updated edit descriptors that work:

  character(*), parameter :: &
    FMT_INT = '(*(i0,1x))', &
    FMT_REAL_SP = '(*(es15.8e2,1x))', &
    FMT_REAL_DP = '(*(es24.16e3,1x))', &
    FMT_REAL_QP = '(*(es44.35e4,1x))', &
    FMT_COMPLEX_SP = '(*(es15.8e2,1x,es15.8e2))', &
    FMT_COMPLEX_DP = '(*(es24.16e3,1x,es24.16e3))', &
    FMT_COMPLEX_QP = '(*(es44.35e4,1x,es44.35e4))'

These formats allow exact round-trip of data for real32, real64, and real128 with gfortran and ifort, with minimum whitespace between the columns. Exact format also allows aligned columns for easier reading.

program savetxt_example
  use stdlib_kinds, only: sp, dp, qp
  use stdlib_io, only: savetxt
  implicit none
  real(sp) :: a(4,3)
  real(dp) :: b(4,3)
  real(qp) :: c(4,3)

  call random_number(a)
  call random_number(b)
  call random_number(c)

  call savetxt('a.txt', a)
  call savetxt('b.txt', b)
  call savetxt('c.txt', c)

end program savetxt_example

a.txt:

 9.33669031E-01  2.64528692E-01  2.54230320E-01
 2.96062171E-01  9.26160216E-01  2.47726083E-01
 2.87877381E-01  7.95032144E-01  4.12473977E-01
 8.26243699E-01  1.06937587E-01  7.78156042E-01

b.txt:

 2.6968407684770512E-001  8.2420001809307941E-001  3.0919805643459708E-001
 5.4808561725895077E-001  5.5357863003309327E-001  3.5734971596028231E-001
 4.9961482000271629E-001  6.2205647695416277E-001  5.3462213146864401E-001
 6.5247240291741648E-001  2.4842211148445337E-001  6.7894369306342506E-001

c.txt:

 2.08361307985962562610788302110365700E-0001  5.69439675947036657838952732681504237E-0001  1.55528153936312621447178835266931698E-0001
 9.18999690461143604511254997530596456E-0001  4.15992702082243546961360276875804139E-0001  5.84807448198961432169743015969827754E-0001
 8.75648724434417664249649052750668218E-0001  4.37633315084581150934259523607018412E-0001  9.58784781634355742058251023194250353E-0001
 4.56441615622529378587037179851793261E-0001  8.63972797258206240440791712871309709E-0001  8.16966148769914006149497083124331801E-0001

This is now part of #494 if anybody would like to review it.

@milancurcic
Copy link
Member Author

I don't participate here, for a very simple reason. While the goals of Fortran stdlib appear noble, the complete lack of a low level developers guide is a major problem in my mind. If Fortran stdlib is intended as a place to prototype features that might be subsumed into the Fortran standard, then the prototype should consider requirements of the Fortran standard. One major failure is that Fortran stdlib lacks guidance on generic interfaces and portability.

Thanks, Steve. Would you participate once we made such a low-level developers guide? What information specifically do you find missing about generic interfaces and portability? We currently use the fypp preprocessor to generate specific procedures for a number of types kinds. Is the lacking guidance then about how a new contributor should use this to create generic procedures?

@kargl
Copy link

kargl commented Sep 4, 2021

I don't participate here, for a very simple reason. While the goals of Fortran stdlib appear noble, the complete lack of a low level developers guide is a major problem in my mind. If Fortran stdlib is intended as a place to prototype features that might be subsumed into the Fortran standard, then the prototype should consider requirements of the Fortran standard. One major failure is that Fortran stdlib lacks guidance on generic interfaces and portability.

Thanks, Steve. Would you participate once we made such a low-level developers guide? What information specifically do you find missing about generic interfaces and portability? We currently use the fypp preprocessor to generate specific procedures for a number of types kinds. Is the lacking guidance then about how a new contributor should use this to create generic procedures?

It is unlikely that I'll participate as I know neither FORD, FYPP, nor CMAKE, and I have only
a fleeting knowledge about git.

I have gone into detail in Fortran Discourse about data mining iso_fortran_env to get the available
kind type parameters. Currently Fortran stdlib does not data mine and simply assumes real32,
real64, and real128 are valid kind type parameters. (Same applies to at least integer named
constants. logical and character kind type parameters appear to be ignored.) The assumption
seems to be that real32, real64, and real128 are the same as IEEE 754 binary32, binary64,
and binary128. This may or may not be the case, but the use of these real kind type parameter
excludes the Intel 80-bit extended double format, which is the type mapped to C_LONG_DOUBLE
on the most popular CPUs on the planet. The only guarantee is that real32 occupies 32 bits, real64
occupies 64 bits, and if it exists real128 occupies 128 bits. What happens if real128 = -1 or
real128 = -2? You can test this with a simple change to stdlib_kinds.f90.

 use iso_fortran_env, only: sp=>real32, dp=>real64   ! , qp=>real128
 integer, parameter :: qp = -1

Does Fortran stdlib build? If no, the errors will be obvious. Neither REAL(QP) nor 1._e0_qp
are valid. If yes, then Fortran stdlib does not use qp anywhere, which means that there are
no portable generic interfaces within stdlib (because on a system with a valid real128, no
code will be generated for this kind type parameter).

Just a spot check of stdlib_specialfunctions.f90 and stdlib_specialfunctions_legendre.f90 shows
that the LEGENDRE function is not generic. In fact, two functions LEGENDRE and
DLEGENDRE are defined for the exact same type, REAL(DP), and both functions are PUBLIC!
With the proviso that LEGENDRE likely should be defined in terms of REAL(SP), then LEGENDRE
and DLEGENDRE are specific names (See for example SIN and DSIN). Now, let's look at
the Fortran 2018 standard.

 B.3 Obsolescent features
 ...
 (11) Specific names for intrinsic functions ­ see B.3.12.

 B.3.12 Specific names for intrinsic functions

 The specific names of the intrinsic functions are often obscure and hinder portability.
 They have been redundant   since Fortran 90. Use generic names for references to
 intrinsic procedures.

So, to answer your last question. No. This is not about a lack of guidance for a new
contributor. It is about a lack of guidance in general, and the lack of any hope of
portable generic interfaces in Fortran stdlib, because the available types for a given
processor are not enumerator from iso_fortran_env.

@urbanjost
Copy link

There has been some discussion of fpm(1) automatically building a module and/or INCLUDE file for each build that would include
compiler-specific information, and for it to possibly include a #DEFINE file for a similar purpose (or whatever format a default preprocessor might require (which is another ongoing discussion). As fpm(1) so far goes a long way towards not needing CMake, deep knowledge of git(1), and preprocessing other than what is supported by the selected compiler, including the mined information for supported types is a stand-out of example of how that could be useful. The problem with different compilers is not specific to stdlib but to any Fortran application. Without some type of conditional compilation supported by the standard I do not see it going away anytime soon. Without some unlikely change to Fortran itself like the compiler skipping SELECT TYPE sections for unsupported types (which would only be a partial solution anyway) and language support for templating I don't see any near-term solution that does not include preprocessing. And of course if a problem requires 128bit values and the compiler does not support them no amount of preprocessing helps. Perhaps a fallback TYPE that is arbitrary precision for unsupported precisions would at least let all code at least compile, but would probably open a Pandora's box of other issues, and isn't likely anyway. So Fortran's standard support of just about any type a compiler can process is the feature that produces the problem, and I don't think anyone is going to call for that to be removed. Isn't this a generic problem that only a routine that is part of the standard and so provided by the compiiler provider can avoid this? I would not blame stdlib too much for this, as the problem seems intractable in general terms. It is making me think that making releases of stdlib into fpm(1) packages (which could then use the aforementioned mined data files) is worth some thought. But everything I think of keeps hitting a catch-22.

So I don't think there is a perfect solution for the problem with kinds, but I think we can reduce the problem a little bit, especially via fpm(1) enhancements, as the problem potentially can vex any Fortran application.

@certik
Copy link
Member

certik commented Sep 5, 2021

Why do you need this format:

    FMT_REAL_DP = '(*(es24.16e3,1x))', &

It seems that es23.16 should be enough to not lose accuracy:

Is the issue with exponents? This is a common issue and we should have definitive guidelines how to print real numbers (single, double, quadruple) that does not lose precision.

Yes, we can allow the user to configure.

@kargl
Copy link

kargl commented Sep 5, 2021

@urbanjost , I agree with much of what you wrote. I will however point we are discussing modern Fortran.
A modern Fortran processor has a Companion C compiler. A Companion C compiler will have a pre-processor.
Therefore, a Fortran processor does not need a Fortran specific pre-processor. It can use the C compiler's
pre-processor. Fortran stdlib should leverage this C pre-processor.

@milancurcic
Copy link
Member Author

Is the issue with exponents? This is a common issue and we should have definitive guidelines how to print real numbers (single, double, quadruple) that does not lose precision.

Yes, but it's not about losing precision. See #505 (comment) where I used es23.16. That was before I ran into exponents issue.

Take this program:

use iso_fortran_env, only: qp => real128
real(qp) :: a

print '(es44.35)', 0.5 * huge(a) 
print '(es44.35e4)', 0.5 * huge(a) 

end

which produces this output:

********************************************
 5.94865747678615882542879663314003508E+4931

The issue came up with real128 numbers, and specifying exponent width fixes it. But I chose to use the exponent width for all kinds so that the decimal and exponent parts align on output for easier reading. Here's another program:

use iso_fortran_env, only: dp => real64
real(dp) :: a

print *, 'Using es23.16'
print '(es23.16)', 0.5 * tiny(a) 
print '(es23.16)', 0.123456789_dp
print '(es23.16)', 0.5 * huge(a) 
print *
print *, 'Using es24.16e3'
print '(es24.16e3)', 0.5 * tiny(a) 
print '(es24.16e3)', 0.123456789_dp
print '(es24.16e3)', 0.5 * huge(a) 

end

which produces this output:

 Using es23.16
 1.1125369292536007-308
 1.2345678900000000E-01
 8.9884656743115785+307

 Using es24.16e3
 1.1125369292536007E-308
 1.2345678900000000E-001
 8.9884656743115785E+307

In summary, you need to specify exponent width in case of real128 numbers, and for smaller kinds it's for appearance and readability, as far as I can tell.

@awvwgk awvwgk added the topic: IO Common input/output related features label Sep 18, 2021
@milancurcic
Copy link
Member Author

Fixed by #494.

@epagone
Copy link

epagone commented Nov 16, 2021

@milancurcic maybe it is too late, but this Discourse thread might help with this issue and linked PR.

@milancurcic
Copy link
Member Author

Thanks @epagone, the discussion there seems consistent with the implementation in #494.

@certik
Copy link
Member

certik commented Nov 17, 2021

Yes, looks like we nailed it. :)

@epagone
Copy link

epagone commented Nov 17, 2021

Considering that these formats are more useful than only for savetxt and loadtxt, would it make sense to make them available on their own in stdlib_io and simply use them? I am thinking, for example, of any sort of I/O where round-trip accuracy is required (e.g. writing to a CSV, JSON, TOML, YAML, etc...) to override a default behaviour that might be less strict.

@certik
Copy link
Member

certik commented Nov 17, 2021

@epagone yes it would. I always struggle to remember the exact format. Every time I print some results to a file that is human readable, I like to print the full precision.

@epagone
Copy link

epagone commented Nov 18, 2021

That's exactly my same use case @certik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topic: IO Common input/output related features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants