Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stanc3 checks data keywords and doesn't allow potential non-data to be passed in. #336

Closed
wds15 opened this issue Oct 14, 2019 · 52 comments
Closed

Comments

@wds15
Copy link

wds15 commented Oct 14, 2019

When calling the integrate_ode functions inside other functions, then this is prone to over-propagation from data to var for the data arguments. With the old compiler it was possible to sub-select from bigger data structures smaller ones and these then still stay as data (as it should be). The new compiler does not do that. The new compiler does "stronger" over-propagation from data to var which makes the programs fail to compile due to a syntax error.

Attached is a Stan program example which illustrates the problem.

The error message is:

Semantic error in '../stan_pkpd/oral_1cmt_mm/oral_1cmt_mm_run.stan', line 868, column 14 to column 99:
   -------------------------------------------------
   866:      theta_tilde[P+1] = lref[1];
   867:      lref_tilde[1] = lref[2];
   868:      int_sol = integrate_ode(pk_1cmt_mm_lode, lref_tilde, 0, to_array_1d(Dt), theta_tilde, x_r, x_i);//, 1e-4, 1e-4, 1000);
                       ^
   869:      ka = theta[1];
   870:      for(i in 1:num_elements(Dt)) {
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_ode'. Available signatures:
((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
Instead supplied arguments of incompatible type: (real, real[], real[], real[], int[]) => real[], real[], int, real[], real[], real[], int[].

I really think this should be fixed ... if possible before the 2.21 release.

stanc3-overpropagate-error.zip

@seantalts
Copy link
Member

@VMatthijs any chance you can quickly see what's going on here? Do we need to turn on your optimization for optimal autodiff levels? We're trying to get the compiler released by Friday :)

@VMatthijs
Copy link
Member

VMatthijs commented Oct 14, 2019 via email

@wds15
Copy link
Author

wds15 commented Oct 14, 2019

So I should change my model and declare all data-only function arguments explicitly?

The code I shared worked all along. It was almost magic to have data being passed around and finally being called in the integrate_ode calls which lives inside functions. There is a lot of sweat to make this work and unfortunate to see this being broken now.

If it is now required to declare some specific arguments as data...then that is a solution, but it disqualifies at least my old programs.

@VMatthijs
Copy link
Member

VMatthijs commented Oct 14, 2019 via email

@seantalts
Copy link
Member

Yeah, now that I look at that model I'm not sure how Stan 2 knew that those arguments were data. I think the spec would say that you don't know it's data unless it's marked data, right? @bob-carpenter do you have a minute? Was stanc 2 doing some kind of fancy autodiff-level inference or was it just relying on C++ templates somehow?

@seantalts
Copy link
Member

I'm going to close this for now as I think the behavior we have is adhering to the Stan spec. I'll add it to the list of changes since Stan 2.

@wds15
Copy link
Author

wds15 commented Oct 14, 2019

I do disagree here... this is valid Stan code to my best knowledge. Could you please tell me how it is otherwise possible to pass data to function a which then subsets this and passes this to function b, please? In function b the subsetted data must still be data.

This has always been possible and it should by all means remain possible.

Let's have a take from @bob-carpenter on this, before we close.

@wds15 wds15 reopened this Oct 14, 2019
@wds15
Copy link
Author

wds15 commented Oct 14, 2019

Ok... I can get stanc3 to accept the program, but now it generates invalid C++ code. I had to pour in a number of "data" qualifiers to get stanc3 to compile it to C++...that's OK.

However, I would like to get a working C++ obviously instead of a compiler error.

oral_1cmt_mm_run_stan.txt

EDIT: It appears to me that there is a major change here. In stan2 it was possible to pass data subsets to another function. That other function then was only needed to have a real[] argument and that was just fine. So the other function then worked for passing in data OR vars. This is no longer possible now. With stanc3 I have to declare the argument as data real[] in order to get the desired behavior for data subsets. Thus, the function cannot be used any more with data subsets or vars! The possibility of stan2 to allow for data or vars was to my knowledge not by accident, but rather by design. I think that should be allowed in stanc3 and it is supposed to be allowed to my understanding of the language.

@wds15
Copy link
Author

wds15 commented Oct 14, 2019

Ok... so here is a minimal example which should give the key elements of the big example. So the following program is just fine under Stan 2 (I can compile to C++ and I can compile that C++ to binary):

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    if (day == 1) {
      res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    } else {
      res = sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    }
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
}
parameters {
  real theta;
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

This is legal Stan2 syntax, but the stanc3 compiler will propagate the to_array_1s(more_data[1:2]) expression to become a var and thus it will refuse to compile it and it complains as:


--- Translating Stan model to C++ code ---
bin/stanc  --o=../stanc3-examples/data-munging.hpp ../stanc3-examples/data-munging.stan

Semantic error in '../stanc3-examples/data-munging.stan', line 9, column 32 to column 71:
   -------------------------------------------------
     7:      real res = 0;
     8:      if (day == 1) {
     9:        res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
                                         ^
    10:      } else {
    11:        res = sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
   -------------------------------------------------

Ill-typed arguments supplied to function 'foo'. Available signatures:
(data real[], real) => real
Instead supplied arguments of incompatible type: real[], real.

However, it would be a huge limitation if this would be disallowed in the future. All I want is that sub-expressions formed from data-only things will stay data only. This was totally fine in the Stan2 world... I don't really see why this should change. I am also very sure that this is intended behavior.

@seantalts
Copy link
Member

seantalts commented Oct 14, 2019

I don't understand - this still seems like a bug in Stan 2 if this worked there. Here you have an array filled with a parameter (theta gets passed in as num), so it seems pretty clear that it can't be marked data. If you fix the signature and pass in a transformed data variable for num instead of theta, it works fine.

@wds15
Copy link
Author

wds15 commented Oct 15, 2019

Nope. This is clearly a bug in stanc3 not being able to deal with nested things... and here is the example which should illustrate the point:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
  real derived_data_1 = foo(to_array_1d(other_data[1:2]), 3.0);
}
parameters {
  real theta;
}
transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

So in the transformed parameters it is legal to have a sub-expression which involves only data to be passed to a data only argument of foo. Thus, sub-expressions with only data should always output data.

But the same thing is NOT allowed at the moment inside bar. There the more_inputs argument is being instantiated with a data only argument. The sub-expression I am forming (same as in transformed parameters) is wrongly being cast into a var. This makes stanc3 reject the code as it cannot pass this into the first argument of foo.

This is a major problem for most ODE models in Stan. Whenever you call the ODE functions in Stan and want to act on subsetted data within functions, then this is not anymore possible.

The reason why this shows clearly that this is a stanc3 bug to me is that in transformed parameters things work as they should while the same is rejected if nested in a function.

@seantalts
Copy link
Member

seantalts commented Oct 15, 2019

If there was a spec, I think it would say that you need to mark bar's more_inputs as data, right? In stanc3 we keep track of the autodiff level in the type system and check based on function labels. It would be much harder to perform this checking if we tried to keep track of all of the possible uses of the function instead of requiring the data annotation on bar more_inputs.

Are you saying there's some reason you can't use this signature for bar:

  real bar(real[] inputs, data vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }

@wds15
Copy link
Author

wds15 commented Oct 15, 2019

As it stands there is an inconsistency in stanc3, because this is allowed:

transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}

so the sub-expression to_array_1d(other_data[1:2]) is data. The same thing does not work inside the function bar. (EDIT: ok, looks like the problem is the vector more_inputs argument declaration which turns it to var).

And no, I would not want to be forced to write bar with data vector more_inputs as argument. It's great to have data as a qualifier to say "this argument is only ever data, nothing else".... but in the past arguments were either data or var.

So take as example pharmacokinetic models. There I write models which track the dose given to patients over time. My functions will be used with dose being most of the time just data... However, sometimes I want to use the very same functions, but I scale the dose first with some parameter. Thus, the function should support to take in data or vars.

All I would like to see is that data stays as data as long as possible. Anything else would be bad for performance anyways (really bad).

I do not think that a spec would say that bar's more_input must be declared as data. Why? Data gets "degraded" to var if something is evaluated which involves a var, but as long as that is not the case why should I convert a data object into a var?

(BTW... changing the definition of bar to what you suggest leads to non-compilable C++ code)

So looking at this, it seems that we now have a data qualifier and everywhere where we do not have a data qualifier is qualified as NOT data - but that's not how it used to work. To my understanding the data qualifier is there to say this is data for sure, but the absence of data means make me data if you can and if not go with var.

@seantalts
Copy link
Member

Well, it's not that it gets cast to var in the C++, it just gets typechecked as if that was possible, and only programs that we can (easily) prove don't pass var in will be allowed with the current checker.

I think for now the easiest thing is to try turning off this part of typechecking and see if that works for the 2.21 release. Maybe in future releases we add a more sophisticated version that actually tracks data as it flows around the program.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 15, 2019

Here's a minimal example:

functions {
  real foo(data real[] x) {
    return sum(x) + 1;
  }
}
data {
  vector[10] y;
}
model {
  real z = foo(to_array_1d(y[1:2]));
}

Because y is data, it's data. Because 1:2 involves only constants, y[1:2] is also data, and thus to_array_1d(y[1:2]) is also data.

What we have now walks over the expression and rejects if there are any parameter variables, transformed parameter variables, or local variables from the transformed parameter or model blocks.

In the future, we should be able to deal with this case of using a local variable in a parameter context:

model {
  real yslice[2]  = to_array_1d(y[1:2]));
  real z = foo(yslice);
}

but this won't work now in the existing parser because yslice gets overpromoted to double.

@seantalts
Copy link
Member

seantalts commented Oct 15, 2019

That all works fine in stanc3 - Sebastian doesn't like the case where, in your example, foo is defined without the data keyword, real foo(real[] x). Right now in Stan 2 that's allowed and it will compile if it is actually passed data, apparently.

@bob-carpenter
Copy link
Contributor

Oh, sorry. Didn't mean to include that data declaration---it just jumped out of my fingers knowing the usage. This one still parses in the current compiler

functions {
  real foo(real[] x) {
    return sum(x) + 1;
  }
}
data {
  vector[10] y;
}
model {
  real y2[2] = to_array_1d(y[1:2]);
  real z = foo(y2);
}

@wds15
Copy link
Author

wds15 commented Oct 15, 2019

@bob-carpenter just to make sure we are on the same page here. So the culprit is this code:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}

With the stan2 this will be ok while stanc3 has trouble right now. The reason is that stan2 will allow me to pass into more_inputs a data argument and then the formed sub-expression to_array_1d(more_inputs[1:2]) will stay data such that foo will accept it.

The stanc3 compiler on the other hand will cast for the type checking more_inputs to behave like a var and as a result I am getting a syntax error.

So for stan2 we have:

  • everything with a data qualifier must be data and only data
  • everything without a data qualifier is data if thats possible and var otherwise

For stanc3 its different:

  • everything with a data qualifier must be data and only data
  • everything without a data qualifier must not be data => it is treated as var immediately

(my toy example is constructed to nail the point, but basically we loose some flexibility with stanc3 in terms of data staying data in the parser as long as possible)

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 15, 2019 via email

@seantalts
Copy link
Member

Yeah, obviously there is no spec for this so it's easy to see why Matthijs interpreted the logic around the data keyword as being that anything passed in had to be marked as data. His system has the benefit that there is a Stan compiler error if something that can't be proven to be data is passed in. I actually thought that was the intended behavior as well behind the data keyword, because in this world I think it's a C++ compiler error if an autodiff variable is passed in, right?

In any case, it should hopefully be pretty simple to turn that feature off until it's sophisticated enough to deal with the expanded semantics. :) Seems like not too much is lost relying on the C++ error messages for errors of this type.

@wds15
Copy link
Author

wds15 commented Oct 15, 2019

Uff... thanks @bob-carpenter ... I was relatively nervous when I saw this issue being closed as it was classified as "too lenient before".

It would not be so nice to have yet another data type, but we can discuss that another day.

@seantalts This is a super subtle thing which was likely not well documented if at all. I am probably one of the persons who notices it most likely as I am running ODE integrators inside functions. This is not the bulk of models being run in Stan, but if we do not go with the stan2 behavior I would likely have to kill quite a few of my Stan programs. Thus, I am much relieved we agree on that the stan2 behavior is correct and will be maintained going forward. I hope this is an easy fix for you.

@seantalts
Copy link
Member

Bob, I actually think you may be misunderstanding here. Your example works in stanc3. We have implemented:

  • an expression is data-only if it does not contain any variables defined in the parameter or transformed parameter block or as a local variable in the model or transformed parameter block

in stanc3. I think you can pass arbitrary data expressions in to data-only functions.

What can't happen is nested behavior like in Sebastian's example.

I'm not convinced that this was intentional or optimal in Stan 2's implementation.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

I would disagree - the Stan2 behavior treats user functions just like internal stan-math functions. In contrast, the current stanc3 behavior with overpropagation treats user defined functions more restrictive than internal stan functions. I don't see how that makes sense.

Maybe I can restructure my Stan programs by pouring in a number of data qualifiers and then they work, but this is really a major change and most likely limits the utility of many of my functions. If we had overloading things could be better, not sure though.

In any case, we should not start with stanc3 breaking a lot of code. This would be a Stan3 type of change for me and I would actually suggest that we have a transition period where things like this get flagged loudly to the users so that they can adjust.

@seantalts
Copy link
Member

We’re already fixing a number of other bugs from Stan 2 (and changing a few things ): https://github.com/stan-dev/stanc3/wiki/changes-from-stanc2

I’m not sure if this one belongs on the list or not, but I am curious what Bob thinks about your example. I’m pretty sure he misunderstood the issue given the example he came up with to highlight the problem & how he described it.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

No, I think @bob-carpenter understood. I see your point about his example not being on the spot - which is why I clarified things and on that note he clearly confirmed this is a bug in in stanc3.

@seantalts
Copy link
Member

@bob-carpenter would you mind taking another look?

@bob-carpenter
Copy link
Contributor

Stan2 behavior treats user functions just like internal stan-math functions.

That's the key reasoning here. If you have an expression that only contains data, the result is data.

The way I thought about it, the keyword data forces the argument of a function to be data, and thus allows you to reason within a function body that a variable declared as data will be data only. That was necessary to allow integration calls within function bodies. Without the data keyword on the function argument declaratio for x_r, the following will fail to compile:

real foo(real[] x_r) {
  ... integrate_ode( ... , x_r, ...) ...  // FAIL: can't infer `x_r` is data only
}

I missed a condition in the definition of what counted as data above---an expression is not data if it contains a function argument variable that is not marked with the data keyword.

@seantalts
Copy link
Member

Gotcha. So that's the behavior in stanc3 now - I think Stan 2 didn't check that condition, according to @wds15.

Do you think stanc3 should also not check for its initial release? Or is it better to change a bunch of this stuff all at once?

@bob-carpenter
Copy link
Contributor

... His [@VMatthijs's] system has the benefit that there is a Stan compiler error if something that can't be proven to be data is passed in.

The existing system has that guarantee, too. It's easy to show by induction on the structure of an expression that if the only variables in an expression are primitives, then the result is a primitive.

That guarantee is based on the assumption that there are no Stan functions that automaticaly upconvert primitives to autodiff variables. I'm pinging @syclik because this is a general rule that math library functions should obey.

The trickier thing is map_rect and guarantees that data variables remain constant across iterations. Right now, we could do something like this:

parameters {
  real alpha;
model {
  int a = alpha > 0;  // converts parameter theta
  int x_i[1] = { a };
  ... map_rect(...x_i...) ... // WARNING:  value of x_r changes by iteration!!!

The value of x_i is still primitive (it's int after all, so it has to be), but it can vary by iteration. This will require dataflow analysis to flag and is another argument for not branching on parameters or allowing primitives to depend on parameters as in the above assignment to an integer.

@bob-carpenter
Copy link
Contributor

It should, but it doesn't. The following file translates to C++, but fails to compile:

functions {
  real[] ode(real time, real[] state, real[] theta,
           real[] x_r, int[] x_i) {
    return { 1.0 };
  }
  real[ , ] foo(real[] theta, real[] x_r) {
    return integrate_ode(ode, { 0.0 }, 0.0, { 1.0, 2.0 }, theta, x_r, { 1 });
  }
}
parameters {
  real theta[1];
  real alpha;
}
model {
  real y[1, 2] = foo(theta, { alpha });
}

@seantalts
Copy link
Member

Right - whereas stanc3 throws this error:

Semantic error in 'bob.stan', line 7, column 11 to column 76:
   -------------------------------------------------
     5:    }
     6:    real[ , ] foo(real[] theta, real[] x_r) {
     7:      return integrate_ode(ode, { 0.0 }, 0.0, { 1.0, 2.0 }, theta, x_r, { 1 });
                    ^
     8:    }
     9:  }
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_ode'. Available signatures: 
((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
Instead supplied arguments of incompatible type: (real, real[], real[], real[], int[]) => real[], real[], real, real[], real[], real[], int[].

@wds15 's point is that enforcing that also breaks some of his programs where he relied on Stan 2 not making that check and would make him carefully track data annotations throughout his program, and that this would force users to create multiple functions if they wanted to use them both with parameters and with data-only arguments. I think a decent example here might be one of @wds15's slicing helper functions - would be useful in both contexts, and checking the data-only purity through function calls would force him to make two copies of that function, one for parameters and one for data (and it obviously gets worse with a larger number of arguments).

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 16, 2019

I think that program should throw an error as stanc3 does because we don't want things not compiling in C++.

I hadn't realized anyone was relying on this bug and I see where it'll make some things more difficult for such users.

A compromise would be to throw a warning that there's an unguarded use of data that may throw C++ compiler errors if you don't call it correctly.

@seantalts
Copy link
Member

@enetsee or @rybern would you have time to look at making this into a warning instead of an error before Friday afternoon? I'm guessing I won't be able to get to it given code gen bugs. Doesn't seem like a huge change but I'm not so familiar with semantic check.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

This:

real foo(real[] x_r) {
  ... integrate_ode( ... , x_r, ...) ...  // FAIL: can't infer `x_r` is data only
}

has always been fine in Stan2 - and to my understanding rightfully so. It would otherwise never being possible to write Stan models where data is used during the ODE and those calls to the integrator are made within a function. By design we allowed for it and now this is not anymore allowed in stanc3 unless I use the data argument??

Even this:

real foo(vector xv) {
  ... integrate_ode( ... , to_array_1d(xv[1:2]), ...) ...  // FAIL: can't infer `x_r` is data only
}

was always just fine.

I do disagree with this being a bug all along in Stan2. It's a severe bug in stanc3 from my view. I am not even sure this should be a warning, really.

And as I say with "Stan2 behavior treats user functions just like internal stan-math functions." then this would now be changed with stanc3. The call to_array_1d(xv[1:2]) involves data only whenever xv is data and then it returns data.... but users would not be allowed to write such functions, because they have to nail down that the argument must only be data rather than allowing the argument to be either data or not - depending on what I call the function with.

Imagine you want to write a user function which transforms the input like

real foo(vector xv) {
  ... integrate_ode( ... , to_array_1d_and_trasform(xv[1:2]), ...) ...  // FAIL: can't infer `x_r` is data only
}

As stanc3 is designed, the user would have to write a to_array_1d_and_trasform function with a data qualifier. This would disallow using the exact same function with non-data. It doesn't make sense to me at all.

@seantalts
Copy link
Member

It's a severe bug in stanc3 from my view.

I don't think calling it a "severe bug" is helping :P It was obviously put in intentionally to adhere to a very reasonable spec, one that it sounds like was the one Bob intended when first adding the data keyword.

The problem here is that without checking the data keywords, we can create invalid C++ code in some cases, which is viewed as a bug in Stan any time it happens as far as I know.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

Sorry for the "severe" here. It's legit Stan2 to me what I am looking for. (this definitely has a severe impact on a large codebase I have)

The stanc3 spec is in-consistent as I keep saying if this is really what you mean it is. From my view that needs revision.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 16, 2019 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 16, 2019 via email

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

Maybe I will just have to rewrite my programs... ok, so be it. Still, where I am lost is how the primitive status is being propagated. There your logic doesn't make sense to me.

To get back to the example which is not allowed in stanc3, but ok in stan2:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
  real derived_data_1 = foo(to_array_1d(other_data[1:2]), 3.0);
}
parameters {
  real theta;
}
transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

Here stanc3 complains about more_inputs not being data for the bar function - but the thing is that I am passing other_data into the function in the model block. The other_data is a primitive so everything should be fine, but it's not. Why is that not ok? other_data is known to be data and as such I should be able to pass it into bar as the more_data argument.

The validity checks of what function calls what needs to wait until you actually see the calls from the users as to what is used as arguments.

In case that is not possible, users should at least be able to declare overloads, I think.

... but ok... now I got two language gurus against me... so I will probably silence for the moment...

@seantalts
Copy link
Member

Still, where I am lost is how the primitive status is being propagated.

It's not being propagated through function boundaries - I think doing this perfectly in all cases is impossible, but I think we could eventually come up with a system that tracks it conservatively and would likely work in most of the cases you care about. That said, I'm not sure if that should be the spec or not... Like, the whole point of the data keyword is that you shouldn't be able to pass potential non-data in there. Otherwise it loses all meaning as far as I can tell and is equivalent to not annotating with data, right? Maybe I'm missing something. And from my perspective that could be an argument for removing the data keyword for Stan 3, if we figure out how to properly track autodiff level through function boundaries in most cases.

Here stanc3 complains about more_inputs not being data for the bar function - but the thing is that I am passing other_data into the function in the model block. The other_data is a primitive so everything should be fine, but it's not. Why is that not ok? other_data is known to be data and as such I should be able to pass it into bar as the more_data argument.

I think this is covered by the above, but I think it's precisely because it's hard to track autodiff level through function boundaries that Bob introduced the data keyword.

Anyway, I would be happy to turn this into a warning for the initial release and long term either deprecate the entire data keyword (if I'm thinking about it right and it becomes useless once the tracking is implemented) or turn that warning into an error with Stan 3.

@seantalts
Copy link
Member

seantalts commented Oct 16, 2019

(assuming someone has time for turning it into a warning before Friday and it's not a huge change).

@enetsee
Copy link
Member

enetsee commented Oct 16, 2019

@seantalts I've had a quick look and this is happening in semantic check (specifically, semantic_check_fn_stan_math , I think).

The check against the stan math signature (stan_math_returntype) requires equality of ad-level and type but we could change this to accumulate warnings on mismatched ad-levels of arguments and return a custom sum type rather than an option e.g.

type ('a,'err) check_result = 
  | CheckOk of 'a
  | CheckWarning of 'a * 'err list
  | CheckError of 'err list

We would also have to modify the Validation applicative to issue warnings from semantic check and update the driver to handle the new return type.

I should be able to turn it around before Friday if you want me to take a run at it.

@bob-carpenter
Copy link
Contributor

Here stanc3 complains about more_inputs not being data for the bar function

I think that's correct. With just these:

real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }

The problem is that as bar is written, it can be called with more_inputs being not data. I realize that's not the case in this program, but that's the basis of the type inference.

@bob-carpenter
Copy link
Contributor

And from my perspective that could be an argument for removing the data keyword for Stan 3, if we figure out how to properly track autodiff level through function boundaries in most cases.

That would work. The only reason it was added was to allow data-only type inference within function bodies using function arguments. This may be a problem with exporting functions, making library functions, etc.

@seantalts
Copy link
Member

@wds15 would changing it just for application to stan math functions work for you? I know all of your examples involved nested user-defined functions as well, but not sure if that was the important part.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

@seantalts So you mean that these restrictions would only be applied to things like the x_r argument of integrate_ode, but not to user defined functions? Yes, that would work, of course.

@seantalts
Copy link
Member

I think @enetsee was looking at the code to do the opposite of that - loosen the restrictions on stan math functions like integrate_ode but keep the ones on UDFs. So this would essentially mean if you have no data qualifiers anywhere between you and integrate_ode it should just let it flow and let God the C++ compiler sort it out.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

No... the UDFs need to be loosened for now. A warning would from stanc3 would be good probably... I am fine with letting C++ sort it out, but that is probably not a good permanent solution.

@enetsee
Copy link
Member

enetsee commented Oct 16, 2019

I've put the machinery in place to support warnings in semantic check but haven't started on the concrete changes. Am I aiming at UDFs only? I think I can do both if that's what we want.

@wds15
Copy link
Author

wds15 commented Oct 16, 2019

Udf are sufficient, I think.

@enetsee
Copy link
Member

enetsee commented Oct 17, 2019

Ok, I think I have it working for returning and side-effecting UDFs.

This is the diff on the test output. @seantalts we may want to add some more test cases?

File "test/integration/bad/stanc.expected", line 1316, characters 0-1:
| $ ../../../../install/default/bin/stanc functions-bad20.stan
|
|Semantic error in 'functions-bad20.stan', line 4, column 2 to line 6, column 3:
| -------------------------------------------------
| 2: real my_fun3(real x);
| 3:
| 4: real my_fun3(data real x) {
| ^
| 5: return 2 * x;
| 6: }
| -------------------------------------------------
|
|Function 'my_fun3' has already been declared to have type (real) => real
|
| $ ../../../../install/default/bin/stanc functions-bad21.stan
|
-|Semantic error in 'functions-bad21.stan', line 13, column 11 to column 21:
+|Warning: Semantic warning in 'functions-bad21.stan', line 13, column 11 to column 21:
| -------------------------------------------------
| 11: }
| 12: model {
| 13: real z = my_fun3(y);
| ^
| 14: y ~ normal(0,1);
| 15: }
| -------------------------------------------------
|
-|Ill-typed arguments supplied to function 'my_fun3'. Available signatures:
-|(data real) => real
-|Instead supplied arguments of incompatible type: real.
+|Warning: Argument to 'my_fun3' has an incompatible autodiff level.
|
| $ ../../../../install/default/bin/stanc functions-bad22-ode.stan
|
|Semantic error in 'functions-bad22-ode.stan', line 15, column 11 to column 82:
| -------------------------------------------------
| 13: real[,] do_integration_nested(real[] y0, real t0, data real[] ts, real[] theta, matrix xmat_r) {
| 14: int x_i[0];
| 15: return(integrate_ode_rk45(sho, y0, t0, ts, theta, to_array_1d(xmat_r[1]), x_i));
| ^
| 16: }
| 17:
| -------------------------------------------------
|
|Ill-typed arguments supplied to function 'integrate_ode_rk45'. Available signatures:
|((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
|((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[], data real, data real, data real) => real[,]

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Oct 17, 2019 via email

@seantalts seantalts changed the title stanc3 over propagates to var stanc3 checks data keywords and doesn't allow potential non-data to be passed in. Oct 18, 2019
@nhuurre nhuurre closed this as completed Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants