Addition of variance function in stdlib_experimental_stats #144

jvdp1 · 2020-02-06T00:03:03Z

Based on #137

With this PR, I propose to add a function for computing the variance of elements in arrays using the same API as stdlib::mean. The used algorithm is a two-pass algorithm (as discussed in #3).
Based on #3 and #114, I avoided to use the function mean (and to create a new function center for doing x - mean), to avoid loss in performance.

How can we avoid select case statement:

        select case(dim)
          #:for fi in range(1, rank+1)
          case(${fi}$)
            n = real(size(x, dim), ${k1}$)
            mean = sum(x, dim) / n
            do i = 1, size(x, dim)
              res = res + (x${rankindice(':', 'i', rank, fi )}$ - mean)**2
            end do
          #:endfor
          case default
            call error_stop("ERROR (mean): wrong dimension")
        end select

and

        select case(dim)
          #:for fi in range(1, rank+1)
          case(${fi}$)
            n = real(count(mask, dim), ${k1}$)
            mean = sum(x, dim, mask) / n
            do i = 1, size(x, dim)
              res = res + merge( (x${rankindice(':', 'i', rank, fi )}$ - mean)**2,&
              #:if t1[0] == 'r'
                                  0._${k1}$,&
              #:else
                                  cmplx(0._${k1}$, 0._${k1}$, ${k1}$),&
              #:endif
                                  mask${rankindice(':', 'i', rank, fi)}$)
            end do
          #:endfor
          case default
            call error_stop("ERROR (mean): wrong dimension")
        end select

I probably miss something that should be obvious!

Another issue is the compilation time needed with the Makefiles in the CI!

Note: Each new statistical function in stdlib_stats will potentially includes 600 additional functions. It really illustrates the issue of having no templates.

jvdp1 · 2020-02-06T09:47:53Z

@fiolj Could you check the implementation for complex numbers? Currently there is no tests implemented, to limit the number of tests (but I can implement some tests in a latter commit/PR).

fiolj · 2020-02-06T16:34:13Z

I"ll do, but I can't before this night or tomorrow morning El jue., 6 de feb. de 2020 06:47, Jeremie Vandenplas < notifications@github.com> escribió:

…

@fiolj <https://github.com/fiolj> Could you check the implementation for complex numbers? Currently there is no tests implemented, to limit the number of tests (but I can implement some tests in a latter commit/PR). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#144?email_source=notifications&email_token=AAOTPJI45BEWWDFRHN76RZTRBPMEXA5CNFSM4KQUDHUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK6SHXY#issuecomment-582820831>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOTPJMTE6L74EXCLGWIGXDRBPMEXANCNFSM4KQUDHUA> .

certik

This PR is fine. Thanks.

In the long term, I wonder if fypp can be improved so that there is not as much repetition. There are still 8 different blocks to declare the signature of the function (real/integer and (x, mask)/(x, dim, mask) and mask = scalar/array, all combinations 2x2x2=8). And even more long term, the Fortran language itself should be improved so that fypp is not needed.

fiolj · 2020-02-07T06:22:07Z

I found it really good overall, but I think the variance for complex arrays is always a real number
(wikipedia link).
Thus, for instance in var_all, we should replace:
@@ -16,7 +16,7 @@ contains
module function ${RName}$(x, mask) result(res)
${t1}$, intent(in) :: x${ranksuffix(rank)}$
logical, intent(in), optional :: mask

```
   ${t1}$ :: res
```

```
   real(${k1}$) :: res
```

and
@@ -28,7 +28,7 @@ contains
n = real(size(x, kind = int64), ${k1}$)
mean = sum(x) / n

   res = sum((x - mean)**2) / (n - 1._${k1}$)

   res = sum(abs(x - mean)**2) / (n - 1._${k1}$)

and in rname("var",rank, t1, k1):

   ${t1}$ :: res${reduced_shape('x', rank, 'dim')}$

   real(${k1}$) :: res${reduced_shape('x', rank, 'dim')}$

and later:

         res = res + (x${rankindice(':', 'i', rank, fi )}$ - mean)**2

        res = res + abs(x${rankindice(':', 'i', rank, fi )}$ - mean)**2

Also, if res is real, we don't need the conversion to real to return ieee_nan

     res = ieee_value(real(res, kind=${k1}$), ieee_quiet_nan)

     res = ieee_value(1._${k1}$, ieee_quiet_nan)

Coincidentally, I was thinking that my solution to the ieee for complex was unnecessarily complex.
The above line: res = ieee_value(1._${k1}$, ieee_quiet_nan) should work also when res is a complex variable.

Finally, Tomorrow I can look into it in more detail, but I was thinking that may be, even for real numbers we don't have to match the kind of the input. Indeed if the input is a large quantity of real(sp) it may be desirable that the variance (and mean) be calculated in double precision. I don't know if I am missing something about availability of double precision for some machines but I would think that nowadays that should not be an issue

milancurcic

Thanks @jvdp1, looks good to me, pending any revisions to address complex/real suggestions by @fiolj.

jvdp1 · 2020-02-07T20:26:18Z

Thanks @fiolj for your review.
I will look tomorrow on your propositions for the variance of complex.
Below are some other comments.

Coincidentally, I was thinking that my solution to the ieee for complex was unnecessarily complex.
The above line: res = ieee_value(1._${k1}$, ieee_quiet_nan) should work also when res is a complex variable.

This is indeed a good idea, much simpler than the previous implementation. I will implement it here and for mean too.

Finally, Tomorrow I can look into it in more detail, but I was thinking that may be, even for real numbers we don't have to match the kind of the input.

The idea was to get the same API as sum. So the result is of the same type as the real array, as in sum. This API also avoids internal conversion from, e.g., sp to dp, before calling sum (since it would have little sense to do it after sum operations IMO).
Also, having all results in dp, will not reduce the number of generated functions.

Indeed if the input is a large quantity of real(sp) it may be desirable that the variance (and mean) be calculated in double precision.

I think it is the responsability of the user to check that (as it is already the case with sum).

…aimag(x))

jvdp1 · 2020-02-07T23:03:58Z

From @fiolj comments, I push commits for:

simplifying the ieee_value call (as proposed by @fiolj)
implementing a corrected computation of variance of complex
implementing tests for complex (I tested them against Octave)

Regarding the implementation, I did e.g.,

      #:if t1[0] == 'r'
          res = sum((x - mean)**2) / (n - 1._${k1}$)
        #:else
          res = sum(abs(x - mean)**2) / (n - 1._${k1}$)
        #:endif

to avoid the abs function in the real var function (it is there useless and give some penalties to efficiency)

@fiolj could you confirm it was what you suggested, please?

fiolj · 2020-02-08T14:11:10Z

Thanks @jvdp1, that was exactly what I suggested.
I think it looks good

fiolj · 2020-02-08T14:16:26Z

This PR is fine. Thanks.

In the long term, I wonder if fypp can be improved so that there is not as much repetition. There are still 8 different blocks to declare the signature of the function (real/integer and (x, mask)/(x, dim, mask) and mask = scalar/array, all combinations 2x2x2=8). And even more long term, the Fortran language itself should be improved so that fypp is not needed.

I agree with @certik here. I was thinking along the same lines. A possible solution for real/int interfaces does not seem to difficult. Something like (for instance for the function mean):


    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("mean_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else t1
        module function ${RName}$ (x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

which is valid for all types (real, complex, int). For some simple functions we could write the implementation in some way that would be mostly the same also.

For functionvar, we could use something similar

    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("var_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else "real({})".format(k1)
        module function ${RName}$(x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

This would cut the code to a half, not solving the remaining four factor.

jvdp1 · 2020-02-09T13:47:40Z

Thank you for your review.

For functionvar, we could use something similar

    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("var_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else "real({})".format(k1)
        module function ${RName}$(x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

This would cut the code to a half, not solving the remaining four factor.

That will reduce the number of blocks, but it might give a fypp file that is difficult to read and to follow due to too many conditional statements (if/set; especially for the implementation where fypp conditions (if) will be needed, as already used for complex). We could probably implement some functions in common.fypp, but I am still afraid for the complexity of the code.

For now, I would suggest that we merge this PR with the master. Then we can open a PR to (try to) reduce the number of blocks in var and mean (but unfortunately, it will not reduce the number of generated functions).

fiolj · 2020-02-10T11:05:01Z

El 9/2/20 a las 10:47, Jeremie Vandenplas escribió:

For now, I would suggest that we merge this PR with the master. Then we can open a PR to (try to) reduce the number of blocks in |var| and |mean| (but unfortunately, it will not reduce the number of generated functions).

Yes, agree. It is not a completely satisfactory solution yet, and we should try to come out with something better. I put it forward mainly to start thinking about it.

src/common.fypp

aradi · 2020-02-10T13:04:57Z

src/common.fypp

+#! E.g., (:, :, :, i, :, :)
+#!
+
+#:def rankindice(varname, varname1, origrank, dim)


We should probably rename this macro to have a more descriptive name. If it is only used to select subarrays by reducing the dimension, we could have:

#:def select_subarray(origrank, selectors) #:assert origrank > 0 #:set seldict = dict(selectors) #:call join_lines(joinstr=", ", prefix="(", suffix=")") #:for i in range(1, origrank + 1) $:seldict.get(i, ":") #:endfor #:endcall #:enddef

and use it as

#! -> x(:, i, :) x${select_subarray(3, [(2, 'i')])}$

It could also be used, if we need to reduce more than one rank, e.g.

#! -> x(:, :, i, j) x${select_subarray(4, [(3, 'i'), (4, 'j')])}$

Also the description should be clarified a bit.

Implemented as suggested. The proposed macro is more general and better fit to its aim.
Could you have another review, please?

nshaffer

Looks good to me, too. I have only minor comments. It's a serviceable baseline implementation.

nshaffer · 2020-02-11T20:54:50Z

src/stdlib_experimental_stats.md

+### Return value
+
+If `array` is of type `real` or `complex`, the result is of the same type as `array`.
+If `array` is of type `integer`, the result is of type `double precision`.


I will raise this issue elsewhere, but I do not agree with this API for the return type when the input is integer data. I only bring it up here because it is not quite correct to say that the return type is double precision, when in fact the type is real(real64). I'm not suggesting any changes now.

@nshaffer Thank you for your review.

I used double precision because they are declared as dp. But I agree they are actually real(real64). The issue with using real64 in the spec is that if the definition of dp
in stdlib_experimental_kinds changes (there has been already discussions on that), then we will need to modify the spec too.

Would it be better to write "..... the result is of type dp."?

nshaffer · 2020-02-11T21:07:23Z

src/stdlib_experimental_stats_var.fypp

+        real(${k1}$) :: n
+        ${t1}$ :: mean
+
+        if (.not.optval(mask, .true.)) then


This is a weird idiom to me. Here, I'd prefer the more obvious

if (present(mask)) then if (mask .eqv. .false.) then

But this is a matter of style rather than substance.

Hopefully none of both options will be needed in a future standard.

milancurcic · 2020-02-18T00:17:51Z

I just revisited this PR. It looks like everybody approved it. We can revisit any outstanding minor issues in a later PR. Merging, thanks @jvdp1!.

jvdp1 added 3 commits February 5, 2020 23:36

addition of variance

e966e7b

varaince_dev: update var modules

044abc5

variance_dev: update spec var

d77b6e9

jvdp1 requested review from certik, aradi, milancurcic and nshaffer February 6, 2020 00:03

certik approved these changes Feb 6, 2020

View reviewed changes

milancurcic approved these changes Feb 7, 2020

View reviewed changes

jvdp1 added 4 commits February 7, 2020 21:51

variance_dev: changed ieee_value() as proposed

baabfc8

variance_dev: remove support of complex because it was wrong

9b19154

variance_dev: addition of variance of complex as (var(real(x)) + var(…

da90a89

…aimag(x))

variance_dev: use fypp to avoid abs in real functions

2a0182a

aradi reviewed Feb 10, 2020

View reviewed changes

variance_dev:suggestions by @aradi

01e897c

aradi approved these changes Feb 10, 2020

View reviewed changes

nshaffer approved these changes Feb 11, 2020

View reviewed changes

milancurcic merged commit 7397e96 into fortran-lang:master Feb 18, 2020

jvdp1 deleted the variance_dev branch February 18, 2020 07:23

Addition of variance function in stdlib_experimental_stats #144

Addition of variance function in stdlib_experimental_stats #144

Uh oh!

Conversation

jvdp1 commented Feb 6, 2020

Uh oh!

jvdp1 commented Feb 6, 2020

Uh oh!

fiolj commented Feb 6, 2020 via email

Uh oh!

certik left a comment

Choose a reason for hiding this comment

Uh oh!

fiolj commented Feb 7, 2020

Uh oh!

milancurcic left a comment

Choose a reason for hiding this comment

Uh oh!

jvdp1 commented Feb 7, 2020

Uh oh!

jvdp1 commented Feb 7, 2020

Uh oh!

fiolj commented Feb 8, 2020

Uh oh!

fiolj commented Feb 8, 2020

Uh oh!

jvdp1 commented Feb 9, 2020

Uh oh!

fiolj commented Feb 10, 2020 via email

Uh oh!

Uh oh!

aradi Feb 10, 2020

Choose a reason for hiding this comment

Uh oh!

jvdp1 Feb 10, 2020

Choose a reason for hiding this comment

Uh oh!

nshaffer left a comment

Choose a reason for hiding this comment

Uh oh!

nshaffer Feb 11, 2020

Choose a reason for hiding this comment

Uh oh!

jvdp1 Feb 13, 2020

Choose a reason for hiding this comment

Uh oh!

nshaffer Feb 11, 2020

Choose a reason for hiding this comment

Uh oh!

jvdp1 Feb 13, 2020

Choose a reason for hiding this comment

Uh oh!

milancurcic commented Feb 18, 2020

Uh oh!

Uh oh!