Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of variance function in stdlib_experimental_stats #144

Merged
merged 8 commits into from
Feb 18, 2020

Conversation

jvdp1
Copy link
Member

@jvdp1 jvdp1 commented Feb 6, 2020

Based on #137

With this PR, I propose to add a function for computing the variance of elements in arrays using the same API as stdlib::mean. The used algorithm is a two-pass algorithm (as discussed in #3).
Based on #3 and #114, I avoided to use the function mean (and to create a new function center for doing x - mean), to avoid loss in performance.

  • How can we avoid select case statement:
        select case(dim)
          #:for fi in range(1, rank+1)
          case(${fi}$)
            n = real(size(x, dim), ${k1}$)
            mean = sum(x, dim) / n
            do i = 1, size(x, dim)
              res = res + (x${rankindice(':', 'i', rank, fi )}$ - mean)**2
            end do
          #:endfor
          case default
            call error_stop("ERROR (mean): wrong dimension")
        end select

and

        select case(dim)
          #:for fi in range(1, rank+1)
          case(${fi}$)
            n = real(count(mask, dim), ${k1}$)
            mean = sum(x, dim, mask) / n
            do i = 1, size(x, dim)
              res = res + merge( (x${rankindice(':', 'i', rank, fi )}$ - mean)**2,&
              #:if t1[0] == 'r'
                                  0._${k1}$,&
              #:else
                                  cmplx(0._${k1}$, 0._${k1}$, ${k1}$),&
              #:endif
                                  mask${rankindice(':', 'i', rank, fi)}$)
            end do
          #:endfor
          case default
            call error_stop("ERROR (mean): wrong dimension")
        end select

I probably miss something that should be obvious!

Another issue is the compilation time needed with the Makefiles in the CI!

Note: Each new statistical function in stdlib_stats will potentially includes 600 additional functions. It really illustrates the issue of having no templates.

@jvdp1
Copy link
Member Author

jvdp1 commented Feb 6, 2020

@fiolj Could you check the implementation for complex numbers? Currently there is no tests implemented, to limit the number of tests (but I can implement some tests in a latter commit/PR).

@fiolj
Copy link
Contributor

fiolj commented Feb 6, 2020 via email

Copy link
Member

@certik certik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is fine. Thanks.

In the long term, I wonder if fypp can be improved so that there is not as much repetition. There are still 8 different blocks to declare the signature of the function (real/integer and (x, mask)/(x, dim, mask) and mask = scalar/array, all combinations 2x2x2=8). And even more long term, the Fortran language itself should be improved so that fypp is not needed.

@fiolj
Copy link
Contributor

fiolj commented Feb 7, 2020

I found it really good overall, but I think the variance for complex arrays is always a real number
(wikipedia link).
Thus, for instance in var_all, we should replace:
@@ -16,7 +16,7 @@ contains
module function ${RName}$(x, mask) result(res)
${t1}$, intent(in) :: x${ranksuffix(rank)}$
logical, intent(in), optional :: mask

  •    ${t1}$ :: res
    
  •    real(${k1}$) :: res
    

and
@@ -28,7 +28,7 @@ contains
n = real(size(x, kind = int64), ${k1}$)
mean = sum(x) / n

  •    res = sum((x - mean)**2) / (n - 1._${k1}$)
    
  •    res = sum(abs(x - mean)**2) / (n - 1._${k1}$)
    

and in rname("var",rank, t1, k1):

  •    ${t1}$ :: res${reduced_shape('x', rank, 'dim')}$
    
  •    real(${k1}$) :: res${reduced_shape('x', rank, 'dim')}$
    

and later:

  •          res = res + (x${rankindice(':', 'i', rank, fi )}$ - mean)**2
    
  •         res = res + abs(x${rankindice(':', 'i', rank, fi )}$ - mean)**2
    

Also, if res is real, we don't need the conversion to real to return ieee_nan

  •      res = ieee_value(real(res, kind=${k1}$), ieee_quiet_nan)
    
  •      res = ieee_value(1._${k1}$, ieee_quiet_nan)
    

Coincidentally, I was thinking that my solution to the ieee for complex was unnecessarily complex.
The above line: res = ieee_value(1._${k1}$, ieee_quiet_nan) should work also when res is a complex variable.

Finally, Tomorrow I can look into it in more detail, but I was thinking that may be, even for real numbers we don't have to match the kind of the input. Indeed if the input is a large quantity of real(sp) it may be desirable that the variance (and mean) be calculated in double precision. I don't know if I am missing something about availability of double precision for some machines but I would think that nowadays that should not be an issue

Copy link
Member

@milancurcic milancurcic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jvdp1, looks good to me, pending any revisions to address complex/real suggestions by @fiolj.

@jvdp1
Copy link
Member Author

jvdp1 commented Feb 7, 2020

Thanks @fiolj for your review.
I will look tomorrow on your propositions for the variance of complex.
Below are some other comments.

Coincidentally, I was thinking that my solution to the ieee for complex was unnecessarily complex.
The above line: res = ieee_value(1._${k1}$, ieee_quiet_nan) should work also when res is a complex variable.

This is indeed a good idea, much simpler than the previous implementation. I will implement it here and for mean too.

Finally, Tomorrow I can look into it in more detail, but I was thinking that may be, even for real numbers we don't have to match the kind of the input.

The idea was to get the same API as sum. So the result is of the same type as the real array, as in sum. This API also avoids internal conversion from, e.g., sp to dp, before calling sum (since it would have little sense to do it after sum operations IMO).
Also, having all results in dp, will not reduce the number of generated functions.

Indeed if the input is a large quantity of real(sp) it may be desirable that the variance (and mean) be calculated in double precision.

I think it is the responsability of the user to check that (as it is already the case with sum).

@jvdp1
Copy link
Member Author

jvdp1 commented Feb 7, 2020

From @fiolj comments, I push commits for:

  • simplifying the ieee_value call (as proposed by @fiolj)
  • implementing a corrected computation of variance of complex
  • implementing tests for complex (I tested them against Octave)

Regarding the implementation, I did e.g.,

      #:if t1[0] == 'r'
          res = sum((x - mean)**2) / (n - 1._${k1}$)
        #:else
          res = sum(abs(x - mean)**2) / (n - 1._${k1}$)
        #:endif

to avoid the abs function in the real var function (it is there useless and give some penalties to efficiency)

@fiolj could you confirm it was what you suggested, please?

@fiolj
Copy link
Contributor

fiolj commented Feb 8, 2020

Thanks @jvdp1, that was exactly what I suggested.
I think it looks good

@fiolj
Copy link
Contributor

fiolj commented Feb 8, 2020

This PR is fine. Thanks.

In the long term, I wonder if fypp can be improved so that there is not as much repetition. There are still 8 different blocks to declare the signature of the function (real/integer and (x, mask)/(x, dim, mask) and mask = scalar/array, all combinations 2x2x2=8). And even more long term, the Fortran language itself should be improved so that fypp is not needed.

I agree with @certik here. I was thinking along the same lines. A possible solution for real/int interfaces does not seem to difficult. Something like (for instance for the function mean):


    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("mean_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else t1
        module function ${RName}$ (x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

which is valid for all types (real, complex, int). For some simple functions we could write the implementation in some way that would be mostly the same also.

For functionvar, we could use something similar

    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("var_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else "real({})".format(k1)
        module function ${RName}$(x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

This would cut the code to a half, not solving the remaining four factor.

@jvdp1
Copy link
Member Author

jvdp1 commented Feb 9, 2020

Thank you for your review.

For functionvar, we could use something similar

    #:for k1, t1 in RCI_KINDS_TYPES
      #:for rank in RANKS
        #:set RName = rname("var_all",rank, t1, k1)
        #:set ret_type = 'real(dp)' if t1[0] == 'i' else "real({})".format(k1)
        module function ${RName}$(x, mask) result(res)
          ${t1}$, intent(in) :: x${ranksuffix(rank)}$
          logical, intent(in), optional :: mask
          ${ret_type}$ :: res
        end function ${RName}$
      #:endfor
    #:endfor

This would cut the code to a half, not solving the remaining four factor.

That will reduce the number of blocks, but it might give a fypp file that is difficult to read and to follow due to too many conditional statements (if/set; especially for the implementation where fypp conditions (if) will be needed, as already used for complex). We could probably implement some functions in common.fypp, but I am still afraid for the complexity of the code.

For now, I would suggest that we merge this PR with the master. Then we can open a PR to (try to) reduce the number of blocks in var and mean (but unfortunately, it will not reduce the number of generated functions).

@fiolj
Copy link
Contributor

fiolj commented Feb 10, 2020 via email

src/common.fypp Outdated Show resolved Hide resolved
src/common.fypp Outdated
#! E.g., (:, :, :, i, :, :)
#!

#:def rankindice(varname, varname1, origrank, dim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably rename this macro to have a more descriptive name. If it is only used to select subarrays by reducing the dimension, we could have:

#:def select_subarray(origrank, selectors)
  #:assert origrank > 0
  #:set seldict = dict(selectors)
  #:call join_lines(joinstr=", ", prefix="(", suffix=")")
    #:for i in range(1, origrank + 1)
      $:seldict.get(i, ":")
    #:endfor
  #:endcall
#:enddef

and use it as

#!  -> x(:, i, :)
x${select_subarray(3, [(2, 'i')])}$

It could also be used, if we need to reduce more than one rank, e.g.

#!  -> x(:, :, i, j)
x${select_subarray(4, [(3, 'i'), (4, 'j')])}$

Also the description should be clarified a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as suggested. The proposed macro is more general and better fit to its aim.
Could you have another review, please?

Copy link
Contributor

@nshaffer nshaffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, too. I have only minor comments. It's a serviceable baseline implementation.

### Return value

If `array` is of type `real` or `complex`, the result is of the same type as `array`.
If `array` is of type `integer`, the result is of type `double precision`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will raise this issue elsewhere, but I do not agree with this API for the return type when the input is integer data. I only bring it up here because it is not quite correct to say that the return type is double precision, when in fact the type is real(real64). I'm not suggesting any changes now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nshaffer Thank you for your review.

I used double precision because they are declared as dp. But I agree they are actually real(real64). The issue with using real64 in the spec is that if the definition of dp
in stdlib_experimental_kinds changes (there has been already discussions on that), then we will need to modify the spec too.

Would it be better to write "..... the result is of type dp."?

real(${k1}$) :: n
${t1}$ :: mean

if (.not.optval(mask, .true.)) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird idiom to me. Here, I'd prefer the more obvious

if (present(mask)) then
    if (mask .eqv. .false.) then

But this is a matter of style rather than substance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully none of both options will be needed in a future standard.

@milancurcic
Copy link
Member

I just revisited this PR. It looks like everybody approved it. We can revisit any outstanding minor issues in a later PR. Merging, thanks @jvdp1!.

@milancurcic milancurcic merged commit 7397e96 into fortran-lang:master Feb 18, 2020
@jvdp1 jvdp1 deleted the variance_dev branch February 18, 2020 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants