stanc3 checks `data` keywords and doesn't allow potential non-data to be passed in. #336

wds15 · 2019-10-14T08:46:12Z

When calling the integrate_ode functions inside other functions, then this is prone to over-propagation from data to var for the data arguments. With the old compiler it was possible to sub-select from bigger data structures smaller ones and these then still stay as data (as it should be). The new compiler does not do that. The new compiler does "stronger" over-propagation from data to var which makes the programs fail to compile due to a syntax error.

Attached is a Stan program example which illustrates the problem.

The error message is:

Semantic error in '../stan_pkpd/oral_1cmt_mm/oral_1cmt_mm_run.stan', line 868, column 14 to column 99:
   -------------------------------------------------
   866:      theta_tilde[P+1] = lref[1];
   867:      lref_tilde[1] = lref[2];
   868:      int_sol = integrate_ode(pk_1cmt_mm_lode, lref_tilde, 0, to_array_1d(Dt), theta_tilde, x_r, x_i);//, 1e-4, 1e-4, 1000);
                       ^
   869:      ka = theta[1];
   870:      for(i in 1:num_elements(Dt)) {
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_ode'. Available signatures:
((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
Instead supplied arguments of incompatible type: (real, real[], real[], real[], int[]) => real[], real[], int, real[], real[], real[], int[].

I really think this should be fixed ... if possible before the 2.21 release.

stanc3-overpropagate-error.zip

The text was updated successfully, but these errors were encountered:

seantalts · 2019-10-14T12:12:12Z

@VMatthijs any chance you can quickly see what's going on here? Do we need to turn on your optimization for optimal autodiff levels? We're trying to get the compiler released by Friday :)

VMatthijs · 2019-10-14T12:16:57Z

I think what's going on may be that stanc2 was too lenient in my mind with enforcement of data-only arguments. In stanc3, you need to annotate arguments explicitly as data-only. Otherwise they will get promoted to autodiff. Sorry to not be more helpful at the moment! My teaching is very busy and I am submitting a paper in a week. Time will free up again at the start of November.

…

On Mon, 14 Oct 2019 at 14:12, seantalts ***@***.***> wrote: @VMatthijs <https://github.com/VMatthijs> any chance you can quickly see what's going on here? Do we need to turn on your optimization for optimal autodiff levels? We're trying to get the compiler released by Friday :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#336?email_source=notifications&email_token=ADIVJE66KQ4I6NIDBYURVFDQOROZ5A5CNFSM4JAMHDZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEMQ3I#issuecomment-541640813>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIVJE7E5QNCQZCOM3BQ2DTQOROZ5ANCNFSM4JAMHDZQ> .

wds15 · 2019-10-14T12:27:06Z

So I should change my model and declare all data-only function arguments explicitly?

The code I shared worked all along. It was almost magic to have data being passed around and finally being called in the integrate_ode calls which lives inside functions. There is a lot of sweat to make this work and unfortunate to see this being broken now.

If it is now required to declare some specific arguments as data...then that is a solution, but it disqualifies at least my old programs.

VMatthijs · 2019-10-14T12:36:28Z

You can also try to get the compiler to infer whether something can be safely assumed be data without you writing it explicitly. I wrote a static analysis that does that, but it operates on the MIR rather than AST, while these programs are being rejected before they even reach the MIR. It's a basic information-flow/access control analysis so it's nothing fancy and it could also be achieved on the AST by implementing some form of type inference. But that's not going to be done by Friday. At that point, you should probably ask yourself whether you'd generally like full type inference in Stan or if there is some reason to only limit the type inference to the data vs autodiff levels.

…

On Mon, 14 Oct 2019 at 14:27, wds15 ***@***.***> wrote: So I should change my model and declare all data-only function arguments explicitly? The code I shared worked all along. It was almost magic to have data being passed around and finally being called in the integrate_ode calls which lives inside functions. There is a lot of sweat to make this work and unfortunate to see this being broken now. If it is now required to declare some specific arguments as data...then that is a solution, but it disqualifies at least my old programs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#336?email_source=notifications&email_token=ADIVJE4OS2ERVGIO2CRRU3LQORQRXA5CNFSM4JAMHDZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBEOH6Q#issuecomment-541647866>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIVJEZST7GJMI4AREN2YRLQORQRXANCNFSM4JAMHDZQ> .

seantalts · 2019-10-14T12:36:56Z

Yeah, now that I look at that model I'm not sure how Stan 2 knew that those arguments were data. I think the spec would say that you don't know it's data unless it's marked data, right? @bob-carpenter do you have a minute? Was stanc 2 doing some kind of fancy autodiff-level inference or was it just relying on C++ templates somehow?

seantalts · 2019-10-14T15:08:36Z

I'm going to close this for now as I think the behavior we have is adhering to the Stan spec. I'll add it to the list of changes since Stan 2.

wds15 · 2019-10-14T15:49:27Z

I do disagree here... this is valid Stan code to my best knowledge. Could you please tell me how it is otherwise possible to pass data to function a which then subsets this and passes this to function b, please? In function b the subsetted data must still be data.

This has always been possible and it should by all means remain possible.

Let's have a take from @bob-carpenter on this, before we close.

wds15 · 2019-10-14T16:01:19Z

Ok... I can get stanc3 to accept the program, but now it generates invalid C++ code. I had to pour in a number of "data" qualifiers to get stanc3 to compile it to C++...that's OK.

However, I would like to get a working C++ obviously instead of a compiler error.

oral_1cmt_mm_run_stan.txt

EDIT: It appears to me that there is a major change here. In stan2 it was possible to pass data subsets to another function. That other function then was only needed to have a real[] argument and that was just fine. So the other function then worked for passing in data OR vars. This is no longer possible now. With stanc3 I have to declare the argument as data real[] in order to get the desired behavior for data subsets. Thus, the function cannot be used any more with data subsets or vars! The possibility of stan2 to allow for data or vars was to my knowledge not by accident, but rather by design. I think that should be allowed in stanc3 and it is supposed to be allowed to my understanding of the language.

wds15 · 2019-10-14T19:51:11Z

Ok... so here is a minimal example which should give the key elements of the big example. So the following program is just fine under Stan 2 (I can compile to C++ and I can compile that C++ to binary):

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    if (day == 1) {
      res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    } else {
      res = sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    }
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
}
parameters {
  real theta;
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

This is legal Stan2 syntax, but the stanc3 compiler will propagate the to_array_1s(more_data[1:2]) expression to become a var and thus it will refuse to compile it and it complains as:


--- Translating Stan model to C++ code ---
bin/stanc  --o=../stanc3-examples/data-munging.hpp ../stanc3-examples/data-munging.stan

Semantic error in '../stanc3-examples/data-munging.stan', line 9, column 32 to column 71:
   -------------------------------------------------
     7:      real res = 0;
     8:      if (day == 1) {
     9:        res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
                                         ^
    10:      } else {
    11:        res = sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
   -------------------------------------------------

Ill-typed arguments supplied to function 'foo'. Available signatures:
(data real[], real) => real
Instead supplied arguments of incompatible type: real[], real.

However, it would be a huge limitation if this would be disallowed in the future. All I want is that sub-expressions formed from data-only things will stay data only. This was totally fine in the Stan2 world... I don't really see why this should change. I am also very sure that this is intended behavior.

seantalts · 2019-10-14T22:48:51Z

I don't understand - this still seems like a bug in Stan 2 if this worked there. Here you have an array filled with a parameter (theta gets passed in as num), so it seems pretty clear that it can't be marked data. If you fix the signature and pass in a transformed data variable for num instead of theta, it works fine.

wds15 · 2019-10-15T06:55:27Z

Nope. This is clearly a bug in stanc3 not being able to deal with nested things... and here is the example which should illustrate the point:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
  real derived_data_1 = foo(to_array_1d(other_data[1:2]), 3.0);
}
parameters {
  real theta;
}
transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

So in the transformed parameters it is legal to have a sub-expression which involves only data to be passed to a data only argument of foo. Thus, sub-expressions with only data should always output data.

But the same thing is NOT allowed at the moment inside bar. There the more_inputs argument is being instantiated with a data only argument. The sub-expression I am forming (same as in transformed parameters) is wrongly being cast into a var. This makes stanc3 reject the code as it cannot pass this into the first argument of foo.

This is a major problem for most ODE models in Stan. Whenever you call the ODE functions in Stan and want to act on subsetted data within functions, then this is not anymore possible.

The reason why this shows clearly that this is a stanc3 bug to me is that in transformed parameters things work as they should while the same is rejected if nested in a function.

seantalts · 2019-10-15T12:34:02Z

If there was a spec, I think it would say that you need to mark bar's more_inputs as data, right? In stanc3 we keep track of the autodiff level in the type system and check based on function labels. It would be much harder to perform this checking if we tried to keep track of all of the possible uses of the function instead of requiring the data annotation on bar more_inputs.

Are you saying there's some reason you can't use this signature for bar:

  real bar(real[] inputs, data vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }

wds15 · 2019-10-15T13:06:05Z

As it stands there is an inconsistency in stanc3, because this is allowed:

transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}

so the sub-expression to_array_1d(other_data[1:2]) is data. The same thing does not work inside the function bar. (EDIT: ok, looks like the problem is the vector more_inputs argument declaration which turns it to var).

And no, I would not want to be forced to write bar with data vector more_inputs as argument. It's great to have data as a qualifier to say "this argument is only ever data, nothing else".... but in the past arguments were either data or var.

So take as example pharmacokinetic models. There I write models which track the dose given to patients over time. My functions will be used with dose being most of the time just data... However, sometimes I want to use the very same functions, but I scale the dose first with some parameter. Thus, the function should support to take in data or vars.

All I would like to see is that data stays as data as long as possible. Anything else would be bad for performance anyways (really bad).

I do not think that a spec would say that bar's more_input must be declared as data. Why? Data gets "degraded" to var if something is evaluated which involves a var, but as long as that is not the case why should I convert a data object into a var?

(BTW... changing the definition of bar to what you suggest leads to non-compilable C++ code)

So looking at this, it seems that we now have a data qualifier and everywhere where we do not have a data qualifier is qualified as NOT data - but that's not how it used to work. To my understanding the data qualifier is there to say this is data for sure, but the absence of data means make me data if you can and if not go with var.

seantalts · 2019-10-15T13:13:23Z

Well, it's not that it gets cast to var in the C++, it just gets typechecked as if that was possible, and only programs that we can (easily) prove don't pass var in will be allowed with the current checker.

I think for now the easiest thing is to try turning off this part of typechecking and see if that works for the 2.21 release. Maybe in future releases we add a more sophisticated version that actually tracks data as it flows around the program.

bob-carpenter · 2019-10-15T16:49:17Z

Here's a minimal example:

functions {
  real foo(data real[] x) {
    return sum(x) + 1;
  }
}
data {
  vector[10] y;
}
model {
  real z = foo(to_array_1d(y[1:2]));
}

Because y is data, it's data. Because 1:2 involves only constants, y[1:2] is also data, and thus to_array_1d(y[1:2]) is also data.

What we have now walks over the expression and rejects if there are any parameter variables, transformed parameter variables, or local variables from the transformed parameter or model blocks.

In the future, we should be able to deal with this case of using a local variable in a parameter context:

model {
  real yslice[2]  = to_array_1d(y[1:2]));
  real z = foo(yslice);
}

but this won't work now in the existing parser because yslice gets overpromoted to double.

seantalts · 2019-10-15T17:10:12Z

That all works fine in stanc3 - Sebastian doesn't like the case where, in your example, foo is defined without the data keyword, real foo(real[] x). Right now in Stan 2 that's allowed and it will compile if it is actually passed data, apparently.

bob-carpenter · 2019-10-15T17:32:32Z

Oh, sorry. Didn't mean to include that data declaration---it just jumped out of my fingers knowing the usage. This one still parses in the current compiler

functions {
  real foo(real[] x) {
    return sum(x) + 1;
  }
}
data {
  vector[10] y;
}
model {
  real y2[2] = to_array_1d(y[1:2]);
  real z = foo(y2);
}

wds15 · 2019-10-15T17:48:14Z

@bob-carpenter just to make sure we are on the same page here. So the culprit is this code:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}

With the stan2 this will be ok while stanc3 has trouble right now. The reason is that stan2 will allow me to pass into more_inputs a data argument and then the formed sub-expression to_array_1d(more_inputs[1:2]) will stay data such that foo will accept it.

The stanc3 compiler on the other hand will cast for the type checking more_inputs to behave like a var and as a result I am getting a syntax error.

So for stan2 we have:

everything with a data qualifier must be data and only data
everything without a data qualifier is data if thats possible and var otherwise

For stanc3 its different:

everything with a data qualifier must be data and only data
everything without a data qualifier must not be data => it is treated as var immediately

(my toy example is constructed to nail the point, but basically we loose some flexibility with stanc3 in terms of data staying data in the parser as long as possible)

bob-carpenter · 2019-10-15T18:33:59Z

Exactly---it's a bug in stanc3.

On Oct 15, 2019, at 1:48 PM, wds15 ***@***.***> wrote: @bob-carpenter just to make sure we are on the same page here.

We are. See below.

So for stan2 we have: • everything with a data qualifier must be data and only data • everything without a data qualifier is data if thats possible and var otherwise

I'd formulate as: * a function argument that is marked data must have a data-only argument * an expression is data-only if it does not contain any variables defined in the parameter or transformed parameter block or as a local variable in the model or transformed parameter block This was designed for ODEs because we need to calculate sensitivities to non-data variables. It was not designed for map_rect, where we want instead a notion of the data not changing. But I think it may have the same consequence right now because of the way we promote. The case I worry about in the future is: parameter { real theta; model { real x = theta > 0 ? 1 : 0; // x is type double map_rect(... x ...) If we were more aggressive with type inference, x would be declared as a double locally. But in this case, even though x is a double value, it depends on the value of a parameter theta. So we wouldn't want x going into map_rect data because its value can change every iteration. Maybe we need a second kind of data argument for data that might change, but doesn't need derivatives.

For stanc3 its different: • everything with a data qualifier must be data and only data • everything without a data qualifier must not be data => it is treated as var immediately

That second bullet point is where the bug comes from. I think this is too big a backward-compatibility breaking issue not to fix.

seantalts · 2019-10-15T19:07:49Z

Yeah, obviously there is no spec for this so it's easy to see why Matthijs interpreted the logic around the data keyword as being that anything passed in had to be marked as data. His system has the benefit that there is a Stan compiler error if something that can't be proven to be data is passed in. I actually thought that was the intended behavior as well behind the data keyword, because in this world I think it's a C++ compiler error if an autodiff variable is passed in, right?

In any case, it should hopefully be pretty simple to turn that feature off until it's sophisticated enough to deal with the expanded semantics. :) Seems like not too much is lost relying on the C++ error messages for errors of this type.

wds15 · 2019-10-15T19:25:58Z

Uff... thanks @bob-carpenter ... I was relatively nervous when I saw this issue being closed as it was classified as "too lenient before".

It would not be so nice to have yet another data type, but we can discuss that another day.

@seantalts This is a super subtle thing which was likely not well documented if at all. I am probably one of the persons who notices it most likely as I am running ODE integrators inside functions. This is not the bulk of models being run in Stan, but if we do not go with the stan2 behavior I would likely have to kill quite a few of my Stan programs. Thus, I am much relieved we agree on that the stan2 behavior is correct and will be maintained going forward. I hope this is an easy fix for you.

seantalts · 2019-10-15T20:43:27Z

Bob, I actually think you may be misunderstanding here. Your example works in stanc3. We have implemented:

an expression is data-only if it does not contain any variables defined in the parameter or transformed parameter block or as a local variable in the model or transformed parameter block

in stanc3. I think you can pass arbitrary data expressions in to data-only functions.

What can't happen is nested behavior like in Sebastian's example.

I'm not convinced that this was intentional or optimal in Stan 2's implementation.

wds15 · 2019-10-16T06:04:11Z

I would disagree - the Stan2 behavior treats user functions just like internal stan-math functions. In contrast, the current stanc3 behavior with overpropagation treats user defined functions more restrictive than internal stan functions. I don't see how that makes sense.

Maybe I can restructure my Stan programs by pouring in a number of data qualifiers and then they work, but this is really a major change and most likely limits the utility of many of my functions. If we had overloading things could be better, not sure though.

In any case, we should not start with stanc3 breaking a lot of code. This would be a Stan3 type of change for me and I would actually suggest that we have a transition period where things like this get flagged loudly to the users so that they can adjust.

seantalts · 2019-10-16T11:33:16Z

We’re already fixing a number of other bugs from Stan 2 (and changing a few things ): https://github.com/stan-dev/stanc3/wiki/changes-from-stanc2

I’m not sure if this one belongs on the list or not, but I am curious what Bob thinks about your example. I’m pretty sure he misunderstood the issue given the example he came up with to highlight the problem & how he described it.

wds15 · 2019-10-16T11:39:33Z

No, I think @bob-carpenter understood. I see your point about his example not being on the spot - which is why I clarified things and on that note he clearly confirmed this is a bug in in stanc3.

seantalts · 2019-10-16T12:44:31Z

@bob-carpenter would you mind taking another look?

bob-carpenter · 2019-10-16T12:49:03Z

Stan2 behavior treats user functions just like internal stan-math functions.

That's the key reasoning here. If you have an expression that only contains data, the result is data.

The way I thought about it, the keyword data forces the argument of a function to be data, and thus allows you to reason within a function body that a variable declared as data will be data only. That was necessary to allow integration calls within function bodies. Without the data keyword on the function argument declaratio for x_r, the following will fail to compile:

real foo(real[] x_r) {
  ... integrate_ode( ... , x_r, ...) ...  // FAIL: can't infer `x_r` is data only
}

I missed a condition in the definition of what counted as data above---an expression is not data if it contains a function argument variable that is not marked with the data keyword.

seantalts · 2019-10-16T12:52:13Z

Gotcha. So that's the behavior in stanc3 now - I think Stan 2 didn't check that condition, according to @wds15.

Do you think stanc3 should also not check for its initial release? Or is it better to change a bunch of this stuff all at once?

bob-carpenter · 2019-10-16T13:00:45Z

... His [@VMatthijs's] system has the benefit that there is a Stan compiler error if something that can't be proven to be data is passed in.

The existing system has that guarantee, too. It's easy to show by induction on the structure of an expression that if the only variables in an expression are primitives, then the result is a primitive.

That guarantee is based on the assumption that there are no Stan functions that automaticaly upconvert primitives to autodiff variables. I'm pinging @syclik because this is a general rule that math library functions should obey.

The trickier thing is map_rect and guarantees that data variables remain constant across iterations. Right now, we could do something like this:

parameters {
  real alpha;
model {
  int a = alpha > 0;  // converts parameter theta
  int x_i[1] = { a };
  ... map_rect(...x_i...) ... // WARNING:  value of x_r changes by iteration!!!

The value of x_i is still primitive (it's int after all, so it has to be), but it can vary by iteration. This will require dataflow analysis to flag and is another argument for not branching on parameters or allowing primitives to depend on parameters as in the above assignment to an integer.

bob-carpenter · 2019-10-16T13:06:27Z

It should, but it doesn't. The following file translates to C++, but fails to compile:

functions {
  real[] ode(real time, real[] state, real[] theta,
           real[] x_r, int[] x_i) {
    return { 1.0 };
  }
  real[ , ] foo(real[] theta, real[] x_r) {
    return integrate_ode(ode, { 0.0 }, 0.0, { 1.0, 2.0 }, theta, x_r, { 1 });
  }
}
parameters {
  real theta[1];
  real alpha;
}
model {
  real y[1, 2] = foo(theta, { alpha });
}

seantalts · 2019-10-16T13:11:29Z

Right - whereas stanc3 throws this error:

Semantic error in 'bob.stan', line 7, column 11 to column 76:
   -------------------------------------------------
     5:    }
     6:    real[ , ] foo(real[] theta, real[] x_r) {
     7:      return integrate_ode(ode, { 0.0 }, 0.0, { 1.0, 2.0 }, theta, x_r, { 1 });
                    ^
     8:    }
     9:  }
   -------------------------------------------------

Ill-typed arguments supplied to function 'integrate_ode'. Available signatures: 
((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
Instead supplied arguments of incompatible type: (real, real[], real[], real[], int[]) => real[], real[], real, real[], real[], real[], int[].

@wds15 's point is that enforcing that also breaks some of his programs where he relied on Stan 2 not making that check and would make him carefully track data annotations throughout his program, and that this would force users to create multiple functions if they wanted to use them both with parameters and with data-only arguments. I think a decent example here might be one of @wds15's slicing helper functions - would be useful in both contexts, and checking the data-only purity through function calls would force him to make two copies of that function, one for parameters and one for data (and it obviously gets worse with a larger number of arguments).

bob-carpenter · 2019-10-16T13:15:47Z

I think that program should throw an error as stanc3 does because we don't want things not compiling in C++.

I hadn't realized anyone was relying on this bug and I see where it'll make some things more difficult for such users.

A compromise would be to throw a warning that there's an unguarded use of data that may throw C++ compiler errors if you don't call it correctly.

seantalts · 2019-10-16T13:23:30Z

@enetsee or @rybern would you have time to look at making this into a warning instead of an error before Friday afternoon? I'm guessing I won't be able to get to it given code gen bugs. Doesn't seem like a huge change but I'm not so familiar with semantic check.

wds15 · 2019-10-16T13:35:15Z

This:

real foo(real[] x_r) {
  ... integrate_ode( ... , x_r, ...) ...  // FAIL: can't infer `x_r` is data only
}

has always been fine in Stan2 - and to my understanding rightfully so. It would otherwise never being possible to write Stan models where data is used during the ODE and those calls to the integrator are made within a function. By design we allowed for it and now this is not anymore allowed in stanc3 unless I use the data argument??

Even this:

real foo(vector xv) {
  ... integrate_ode( ... , to_array_1d(xv[1:2]), ...) ...  // FAIL: can't infer `x_r` is data only
}

was always just fine.

I do disagree with this being a bug all along in Stan2. It's a severe bug in stanc3 from my view. I am not even sure this should be a warning, really.

And as I say with "Stan2 behavior treats user functions just like internal stan-math functions." then this would now be changed with stanc3. The call to_array_1d(xv[1:2]) involves data only whenever xv is data and then it returns data.... but users would not be allowed to write such functions, because they have to nail down that the argument must only be data rather than allowing the argument to be either data or not - depending on what I call the function with.

Imagine you want to write a user function which transforms the input like

real foo(vector xv) {
  ... integrate_ode( ... , to_array_1d_and_trasform(xv[1:2]), ...) ...  // FAIL: can't infer `x_r` is data only
}

As stanc3 is designed, the user would have to write a to_array_1d_and_trasform function with a data qualifier. This would disallow using the exact same function with non-data. It doesn't make sense to me at all.

seantalts · 2019-10-16T13:42:34Z

It's a severe bug in stanc3 from my view.

I don't think calling it a "severe bug" is helping :P It was obviously put in intentionally to adhere to a very reasonable spec, one that it sounds like was the one Bob intended when first adding the data keyword.

The problem here is that without checking the data keywords, we can create invalid C++ code in some cases, which is viewed as a bug in Stan any time it happens as far as I know.

wds15 · 2019-10-16T13:47:11Z

Sorry for the "severe" here. It's legit Stan2 to me what I am looking for. (this definitely has a severe impact on a large codebase I have)

The stanc3 spec is in-consistent as I keep saying if this is really what you mean it is. From my view that needs revision.

bob-carpenter · 2019-10-16T13:57:03Z

On Oct 16, 2019, at 9:35 AM, wds15 ***@***.***> wrote: This: real foo(real[] x_r ) { ... integrate_ode( ... , x_r, ...) ... // FAIL: can't infer `x_r` is data only } has always been fine in Stan2 - and to my understanding rightfully so. It would otherwise never being possible to write Stan models where data is used during the ODE and those calls to the integrator are made within a function. By design we allowed for it

I think we need clearer designs. It was my intention all along that the above would indeed fail. I never wanted there to be a possibility of a Stan program transpiling then failing to compile in C++. I was surprised to see that it succeeded in the current parser.

and now this is not anymore allowed in stanc3 unless I use the data argument?? Even this: real foo(vector xv ) { ... integrate_ode( ... , to_array_1d(xv[1:2]), ...) ... // FAIL: can't infer `x_r` is data only } was always just fine.

Applying a function to a bunch of data-only variables should result in something data-only, so this case is no different than the earlier one.

I do disagree with this being a bug all along in Stan2.

It would've been clear if I'd written a clearer spec for data-only args.

It's a severe bug in stanc3 from my view. I am not even sure this should be a warning, really.

It needs to be a warning at the very least in order to clue users in to why their program might be failing to compile in C++ with no error messages (as it does in RStan).

And as I say with "Stan2 behavior treats user functions just like internal stan-math functions." then this would now be changed with stanc3.

No, they'd be the same.

The call to_array_1d(xv[1:2]) involves data only whenever xv is data and then it returns data.

That would stay the same.

... but users would not be allowed to write such functions, because they have to nail down that the argument must only be data rather than allowing the argument to be either data or not - depending on what I call the function with.

No, that's not the intention. The problem is that we want to only allow calls to integrate_ode where we can prove that the argument is data. We can do that if it's a function applied to only data arguments. That means flagging arguments to user-defined functions as data if we want to use them as data internally. That was my intention for that flag all along.

Imagine you want to write a user function which transforms the input like real foo(vector xv ) { ... integrate_ode( ... , to_array_1d_and_trasform(xv[1:2]), ...) ... // FAIL: can't infer `x_r` is data only } As stanc3 is designed, the user would have to write a to_array_1d_and_trasform function with a data qualifier.

No. Just vector xv would need the data qualifier. Then we are guaranteed that applying a function to xv is also data. Otherwise, we have no way to verify that xv will be data in a given instantiated Stan program.

This would disallow using the exact same function with non-data. It doesn't make sense to me at all.

Did that help clarify? To summarize, let me define what a primitive variable is more clearly. 1. integer variables have primitive values 2. real variables declared in the data, transformed data, or generated quantities blocks have primitive values 3. function argument variables with the keyword "data" have primitive values 4. if a sequence of arguments has primitive values, then the result of applying a function (user-defined or built-in) has a primitive value So to_array_1d_and_transform(xv[1:2]) is fine if xv is primitive. But we can't infer it's primitive because it's an argument to a function not marked as data.

bob-carpenter · 2019-10-16T13:58:30Z

The decision in stanc3 may need revision, but it was consistent. More so than what we implemented in Stan 2. I think the spec should be what I wrote in the last message.

wds15 · 2019-10-16T14:11:52Z

Maybe I will just have to rewrite my programs... ok, so be it. Still, where I am lost is how the primitive status is being propagated. There your logic doesn't make sense to me.

To get back to the example which is not allowed in stanc3, but ok in stan2:

functions {
  real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }
}
data {
}
transformed data {
  real some_data[5] = {1., 2., 3., 4., 5.};
  vector[3] other_data = [1., 2., 3.]';
  real derived_data_1 = foo(to_array_1d(other_data[1:2]), 3.0);
}
parameters {
  real theta;
}
transformed parameters {
  // here we pass to_array_1d(other_data[1:2]) to a data only
  // argument. So we form a subexpression which involves only data and
  // we can pass this to the data-only argument of foo.
  real derived_2 = foo(to_array_1d(other_data[1:2]), theta);
}
model {
  real temp1 = foo(some_data, theta);
  real temp2 = bar(some_data, other_data, theta, 1);
  theta ~ normal(temp1, temp2);
}

Here stanc3 complains about more_inputs not being data for the bar function - but the thing is that I am passing other_data into the function in the model block. The other_data is a primitive so everything should be fine, but it's not. Why is that not ok? other_data is known to be data and as such I should be able to pass it into bar as the more_data argument.

The validity checks of what function calls what needs to wait until you actually see the calls from the users as to what is used as arguments.

In case that is not possible, users should at least be able to declare overloads, I think.

... but ok... now I got two language gurus against me... so I will probably silence for the moment...

seantalts · 2019-10-16T14:41:51Z

Still, where I am lost is how the primitive status is being propagated.

It's not being propagated through function boundaries - I think doing this perfectly in all cases is impossible, but I think we could eventually come up with a system that tracks it conservatively and would likely work in most of the cases you care about. That said, I'm not sure if that should be the spec or not... Like, the whole point of the data keyword is that you shouldn't be able to pass potential non-data in there. Otherwise it loses all meaning as far as I can tell and is equivalent to not annotating with data, right? Maybe I'm missing something. And from my perspective that could be an argument for removing the data keyword for Stan 3, if we figure out how to properly track autodiff level through function boundaries in most cases.

Here stanc3 complains about more_inputs not being data for the bar function - but the thing is that I am passing other_data into the function in the model block. The other_data is a primitive so everything should be fine, but it's not. Why is that not ok? other_data is known to be data and as such I should be able to pass it into bar as the more_data argument.

I think this is covered by the above, but I think it's precisely because it's hard to track autodiff level through function boundaries that Bob introduced the data keyword.

Anyway, I would be happy to turn this into a warning for the initial release and long term either deprecate the entire data keyword (if I'm thinking about it right and it becomes useless once the tracking is implemented) or turn that warning into an error with Stan 3.

seantalts · 2019-10-16T14:42:40Z

(assuming someone has time for turning it into a warning before Friday and it's not a huge change).

enetsee · 2019-10-16T14:56:15Z

@seantalts I've had a quick look and this is happening in semantic check (specifically, semantic_check_fn_stan_math , I think).

The check against the stan math signature (stan_math_returntype) requires equality of ad-level and type but we could change this to accumulate warnings on mismatched ad-levels of arguments and return a custom sum type rather than an option e.g.

type ('a,'err) check_result = 
  | CheckOk of 'a
  | CheckWarning of 'a * 'err list
  | CheckError of 'err list

We would also have to modify the Validation applicative to issue warnings from semantic check and update the driver to handle the new return type.

I should be able to turn it around before Friday if you want me to take a run at it.

bob-carpenter · 2019-10-16T15:25:03Z

Here stanc3 complains about more_inputs not being data for the bar function

I think that's correct. With just these:

real foo(data real[] inputs, real num) {
    return sum(inputs) + num;
  }

  real bar(real[] inputs, vector more_inputs, real num, int day) {
    real res = 0;
    // comment out this line => and things work
    res = num + sum(inputs) + foo(to_array_1d(more_inputs[1:2]), num);
    return res;
  }

The problem is that as bar is written, it can be called with more_inputs being not data. I realize that's not the case in this program, but that's the basis of the type inference.

bob-carpenter · 2019-10-16T15:26:24Z

And from my perspective that could be an argument for removing the data keyword for Stan 3, if we figure out how to properly track autodiff level through function boundaries in most cases.

That would work. The only reason it was added was to allow data-only type inference within function bodies using function arguments. This may be a problem with exporting functions, making library functions, etc.

seantalts · 2019-10-16T16:46:01Z

@wds15 would changing it just for application to stan math functions work for you? I know all of your examples involved nested user-defined functions as well, but not sure if that was the important part.

wds15 · 2019-10-16T17:06:57Z

@seantalts So you mean that these restrictions would only be applied to things like the x_r argument of integrate_ode, but not to user defined functions? Yes, that would work, of course.

seantalts · 2019-10-16T17:09:26Z

I think @enetsee was looking at the code to do the opposite of that - loosen the restrictions on stan math functions like integrate_ode but keep the ones on UDFs. So this would essentially mean if you have no data qualifiers anywhere between you and integrate_ode it should just let it flow and let ~~God~~ the C++ compiler sort it out.

wds15 · 2019-10-16T17:19:11Z

No... the UDFs need to be loosened for now. A warning would from stanc3 would be good probably... I am fine with letting C++ sort it out, but that is probably not a good permanent solution.

enetsee · 2019-10-16T18:28:28Z

I've put the machinery in place to support warnings in semantic check but haven't started on the concrete changes. Am I aiming at UDFs only? I think I can do both if that's what we want.

wds15 · 2019-10-16T21:15:13Z

Udf are sufficient, I think.

enetsee · 2019-10-17T08:44:02Z

Ok, I think I have it working for returning and side-effecting UDFs.

This is the diff on the test output. @seantalts we may want to add some more test cases?

File "test/integration/bad/stanc.expected", line 1316, characters 0-1:
| $ ../../../../install/default/bin/stanc functions-bad20.stan
|
|Semantic error in 'functions-bad20.stan', line 4, column 2 to line 6, column 3:
| -------------------------------------------------
| 2: real my_fun3(real x);
| 3:
| 4: real my_fun3(data real x) {
| ^
| 5: return 2 * x;
| 6: }
| -------------------------------------------------
|
|Function 'my_fun3' has already been declared to have type (real) => real
|
| $ ../../../../install/default/bin/stanc functions-bad21.stan
|
-|Semantic error in 'functions-bad21.stan', line 13, column 11 to column 21:
+|Warning: Semantic warning in 'functions-bad21.stan', line 13, column 11 to column 21:
| -------------------------------------------------
| 11: }
| 12: model {
| 13: real z = my_fun3(y);
| ^
| 14: y ~ normal(0,1);
| 15: }
| -------------------------------------------------
|
-|Ill-typed arguments supplied to function 'my_fun3'. Available signatures:
-|(data real) => real
-|Instead supplied arguments of incompatible type: real.
+|Warning: Argument to 'my_fun3' has an incompatible autodiff level.
|
| $ ../../../../install/default/bin/stanc functions-bad22-ode.stan
|
|Semantic error in 'functions-bad22-ode.stan', line 15, column 11 to column 82:
| -------------------------------------------------
| 13: real[,] do_integration_nested(real[] y0, real t0, data real[] ts, real[] theta, matrix xmat_r) {
| 14: int x_i[0];
| 15: return(integrate_ode_rk45(sho, y0, t0, ts, theta, to_array_1d(xmat_r[1]), x_i));
| ^
| 16: }
| 17:
| -------------------------------------------------
|
|Ill-typed arguments supplied to function 'integrate_ode_rk45'. Available signatures:
|((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[]) => real[,]
|((real, real[], real[], data real[], data int[]) => real[], real[], real, real[], real[], data real[], data int[], data real, data real, data real) => real[,]

bob-carpenter · 2019-10-17T16:49:03Z

On Oct 16, 2019, at 10:41 AM, seantalts ***@***.***> wrote: ... the whole point of the data keyword is that you shouldn't be able to pass potential non-data in there.

Right. It's what we need to declare types/behavior for the integrate_ode and mpi functions which require some of their arguments to be data-only.

Otherwise it loses all meaning as far as I can tell and is equivalent to not annotating with data, right? Maybe I'm missing something. And from my perspective that could be an argument for removing the data keyword for Stan 3, if we figure out how to properly track autodiff level through function boundaries in most cases.

I think we'd still want to use it to declare the ode and mpi functions.

Here stanc3 complains about more_inputs not being data for the bar function - but the thing is that I am passing other_data into the function in the model block. The other_data is a primitive so everything should be fine, but it's not. Why is that not ok? other_data is known to be data and as such I should be able to pass it into bar as the more_data argument.

Yes, that'd be great. But it's too hard to compute in general, so we offload the work onto the users. Standard languages like C++ are full of this kind of stuff, so I don't see it's such a big deal.

I think this is covered by the above, but I think it's precisely because it's hard to track autodiff level through function boundaries that Bob introduced the data keyword.

That's why I did it---so that you could infer in a function body that an expression is guaranteed to be data. Part of that's also that the original parser was a single pass without any dataflow analysis.

Anyway, I would be happy to turn this into a warning for the initial release and long term either deprecate the entire data keyword (if I'm thinking about it right and it becomes useless once the tracking is implemented) or turn that warning into an error with Stan 3.

We'll have to work more through what happens getting rid of it.

seantalts closed this as completed Oct 14, 2019

wds15 reopened this Oct 14, 2019

wds15 mentioned this issue Oct 15, 2019

generation of non-compilable C++ #343

Closed

enetsee mentioned this issue Oct 17, 2019

Semantic check warnings #336 #349

Closed

seantalts changed the title ~~stanc3 over propagates to var~~ stanc3 checks data keywords and doesn't allow potential non-data to be passed in. Oct 18, 2019

nhuurre closed this as completed Jun 22, 2020

stanc3 checks data keywords and doesn't allow potential non-data to be passed in. #336

stanc3 checks data keywords and doesn't allow potential non-data to be passed in. #336

Comments

wds15 commented Oct 14, 2019

seantalts commented Oct 14, 2019

VMatthijs commented Oct 14, 2019 via email

wds15 commented Oct 14, 2019

VMatthijs commented Oct 14, 2019 via email

seantalts commented Oct 14, 2019

seantalts commented Oct 14, 2019

wds15 commented Oct 14, 2019

wds15 commented Oct 14, 2019 • edited Loading

wds15 commented Oct 14, 2019

seantalts commented Oct 14, 2019 • edited Loading

wds15 commented Oct 15, 2019

seantalts commented Oct 15, 2019 • edited Loading

wds15 commented Oct 15, 2019

seantalts commented Oct 15, 2019

bob-carpenter commented Oct 15, 2019 • edited Loading

seantalts commented Oct 15, 2019 • edited Loading

bob-carpenter commented Oct 15, 2019

wds15 commented Oct 15, 2019

bob-carpenter commented Oct 15, 2019 via email

seantalts commented Oct 15, 2019

wds15 commented Oct 15, 2019

seantalts commented Oct 15, 2019

wds15 commented Oct 16, 2019

seantalts commented Oct 16, 2019

wds15 commented Oct 16, 2019

seantalts commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019

seantalts commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019

seantalts commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019 • edited Loading

seantalts commented Oct 16, 2019

wds15 commented Oct 16, 2019

seantalts commented Oct 16, 2019

wds15 commented Oct 16, 2019 • edited Loading

bob-carpenter commented Oct 16, 2019 via email

bob-carpenter commented Oct 16, 2019 via email

wds15 commented Oct 16, 2019

seantalts commented Oct 16, 2019

seantalts commented Oct 16, 2019 • edited Loading

enetsee commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019

bob-carpenter commented Oct 16, 2019

seantalts commented Oct 16, 2019

wds15 commented Oct 16, 2019

seantalts commented Oct 16, 2019

wds15 commented Oct 16, 2019

enetsee commented Oct 16, 2019 • edited Loading

wds15 commented Oct 16, 2019

enetsee commented Oct 17, 2019 • edited Loading

bob-carpenter commented Oct 17, 2019 via email

stanc3 checks `data` keywords and doesn't allow potential non-data to be passed in. #336

stanc3 checks `data` keywords and doesn't allow potential non-data to be passed in. #336

wds15 commented Oct 14, 2019 •

edited

Loading

seantalts commented Oct 14, 2019 •

edited

Loading

seantalts commented Oct 15, 2019 •

edited

Loading

bob-carpenter commented Oct 15, 2019 •

edited

Loading

seantalts commented Oct 15, 2019 •

edited

Loading

bob-carpenter commented Oct 16, 2019 •

edited

Loading

wds15 commented Oct 16, 2019 •

edited

Loading

seantalts commented Oct 16, 2019 •

edited

Loading

enetsee commented Oct 16, 2019 •

edited

Loading

enetsee commented Oct 17, 2019 •

edited

Loading