Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs introduced with 3 posteriordb models #94

Closed
sethaxen opened this issue Oct 6, 2022 · 9 comments · Fixed by #97
Closed

NaNs introduced with 3 posteriordb models #94

sethaxen opened this issue Oct 6, 2022 · 9 comments · Fixed by #97

Comments

@sethaxen
Copy link
Member

sethaxen commented Oct 6, 2022

Nikolas Siccha on Slack shared:

I've tried [Pathfinder.jl] for a few (3) models from posteriordb, and pathfinder seems to run into nans relatively often, without being able to recover.

It was the dogs-dogs, kilpisjarvi_mod-kilpisjarvi and diamonds-diamonds posteriors from posteriordb, brought to Julia using bridgestan.

@nsiccha
Copy link

nsiccha commented Oct 6, 2022

That's me!

I'm just here to add that these three posteriors are the only ones that I tried with Pathfinder 😅

@sethaxen
Copy link
Member Author

sethaxen commented Oct 6, 2022

This is the kind of consistency we'd rather not have. 😬

@nsiccha
Copy link

nsiccha commented Oct 6, 2022

Well the upside is that Pathfinder only "sometimes" gets stuck and takes forever because it can't recover.

I do think that Pathfinder should abort as soon as it encounters issues, just be caused you can't really hope for a good approximation anyways and there's no point in spending a lot of computation (on a single Pathfinder run).

@sethaxen
Copy link
Member Author

sethaxen commented Oct 6, 2022

I do think that Pathfinder should abort as soon as it encounters issues, just be caused you can't really hope for a good approximation anyways and there's no point in spending a lot of computation (on a single Pathfinder run).

We used to terminate on NaNs but removed that behavior because some optimizers can recover from NaNs (#25 (comment)). I'd be more inclined to add a terminate_on_non_finite=true kwarg so that users can opt out of this if they know they are using such an optimizer. I don't think one can say for certain that an approximation produced by such a model is not useful. Even a poor approximation can be useful for initialization or diagnostic purposes, and the approximations after recovering from a NaN could be better than the ones from before the NaN.

@nsiccha
Copy link

nsiccha commented Oct 6, 2022

I think the potential benefit of Pathfinder has been a bit overstated and it's not worth it to spend a lot of computation. If the posterior is simple, then it doesn't matter that you can sample from it even more easily, and if the posterior is difficult, then the approximation is not worth much.

But all the decisions are of course yours. I'm happy with the way it is, especially after you told me how to make Pathfinder abort early. But asimple keyword argument would of course be quite nice to have as well.

@sethaxen
Copy link
Member Author

sethaxen commented Oct 6, 2022

I was unable to reproduce the nan behavior for dogs-dogs, but for the other two I was able to. Part of the problem here is that when Stan's log-density hits a NaN it raises an error in Julia. Pathfinder has a retry mechanism (my R is not great, but the reference implementation of Pathfinder seems to have such a mechanism as well: https://github.com/LuZhangstat/Pathfinder/blob/master/utils/sim_pf.R), where when it encounters no finite solutions in a trajectory, it resamples the initialization and runs again until it gets a solution with finite results (or exceeds the maximum number of retries). But to preserve useful debugging, it does not use a try ... catch block, so if an error is raised, the retry mechanism is not used.

Here we can improve things quite a bit by putting a try...catch block in the logp and ∇logp implementations. This works, but as you noted, after encountering a NaN, we continue to sample the trajectory instead of retrying, so this is very wasteful. If we instead use a callback to terminate early when NaNs are encountered, we always get a non-NaN result:

POSTERIOR_DB_PATH = joinpath(ENV["HOME"], "projects", "posteriordb", "posterior_database")

using BridgeStan, Pathfinder, ZipFile, Transducers, Test

function load_model(model_name, data_name; kwargs...)
    data_path = joinpath(POSTERIOR_DB_PATH, "data", "data", "$data_name.json.zip")
    stan_file = joinpath(POSTERIOR_DB_PATH, "models", "stan", "$model_name.stan")
    tmpdir = mktempdir(; cleanup=false)
    data = joinpath(tmpdir, "$data_name.json")
    write(data, ZipFile.Reader(data_path).files[1])
    return StanModel(; stan_file, data, kwargs...)
end

function make_pathfinder_inputs(model)
    function logp(x)
        return try
            log_density(model, convert(Vector{Float64}, x); propto=true, jacobian=true)
        catch e
            NaN
        end
    end
    function ∇logp(x)
        return try
            log_density_gradient(model, convert(Vector{Float64}, x); propto=true, jacobian=true)[2]
        catch e
            fill(NaN, length(x))
        end
    end
    dim = param_unc_num(model)
    return logp, ∇logp, Int(dim)
end

ndraws = 10
nruns = 5
executor = Transducers.SequentialEx()

model_data_names = [
    ("dogs", "dogs"),
    ("kilpisjarvi", "kilpisjarvi_mod"),
    ("diamonds", "diamonds"),
]

@testset "posteriordb" begin
    @testset "$model_name" for (model_name, data_name) in model_data_names
        model = load_model(model_name, data_name)
        logp, ∇logp, dim = make_pathfinder_inputs(model)
        callback = (x, lp) -> !isfinite(lp) || any(!isfinite, x)
        for _ in 1:100
            result = pathfinder(logp, ∇logp; dim, ndraws, callback)
            @test all(isfinite, result.draws)
        end
    end
end
Test Summary: | Pass  Total   Time
posteriordb   |  300    300  36.4s

Just retrying until we succeed feels wrong, especially since with HMC the wisdom is to never ignore numerical issues, but IIUC, this same approach is what was used for the paper, where the method in general outperformed ADVI. And we're not trying to get exact draws; we're primarily trying to get even just 1 point in the typical set and secondarily to get a reasonable estimate of the metric.

So three action items I see here are:

  1. we should document the use of a try...catch to benefit from the retry mechanism at the risk of obscuring sources of error
  2. we should add the terminate_on_non_finite=true flag discussed above
  3. we should run on more posteriordb models to identify if there are other models for which these two tweaks would not work.

@sethaxen
Copy link
Member Author

sethaxen commented Oct 6, 2022

I think the potential benefit of Pathfinder has been a bit overstated and it's not worth it to spend a lot of computation. If the posterior is simple, then it doesn't matter that you can sample from it even more easily, and if the posterior is difficult, then the approximation is not worth much.

There are many ways to use Pathfinder. In general, I find Pathfinder is not that useful for a well-behaved model. That is, if you're able to write your model in such a way that the step size and metric are quickly learned during warm-up (roughly, all unconstrained parameters are roughly on the same scale, zero-centered, and uncorrelated with low curvature and no heavy tails), then Pathfinder may perform warm-up more quickly, but maybe not by much.

Where I think Pathfinder is useful is early in the modeling process, where the model may be poorly parametrized, and the posterior may have poor geometry. Ideally you can tell this from looking at the model, but that's often not the case, especially for novices. A useful tool for diagnosing model problems is to look at draws, but for such models, warm-up is often unbearably slow. Pathfinder can speed up these early stages by quickly (1-2 orders of magnitude faster than HMC warm-up) providing low quality draws and metric estimates that may still be diagnostically useful for improving the model to the point where Pathfinder is no longer useful.

@nsiccha
Copy link

nsiccha commented Oct 7, 2022

Oh right, I did wrap BridgeStan's lp/lpg calls in a try catch block and returned -Inf,Infs when an exception ocurred. I actually hadn't tried Pathfinder without that (or had I 🤔).

BTW, would it make sense for pathfinder to allow the gradient function to return the density and its gradient?

One of the selling points of multipathfinder is that it can (on expectation) handle posteriors with "irrelevant" minor modes, but to be able to do this, the assumption is that while some/many single pathfinder runs can/will fail, we can still recover by only using the "good" pathfinder runs (by resampling). So I think failing early and returning the last non-inf point or rerunning (or just having many runs in parallel in the first place) is in the spirit of the method.

@sethaxen
Copy link
Member Author

sethaxen commented Oct 7, 2022

BTW, would it make sense for pathfinder to allow the gradient function to return the density and its gradient?

Perhaps in terms of API, but we would internally end up creating separate functions. SciMLBase.OptimizationFunction, which we create internally for use with Optimization.jl, needs separate functions for the function, its gradient, and its hessian. AdvancedHMC also expects functions of this form, so it makes sense for the user to have already created them themselves. Turing's optimization function API, which we use, also produces separate functions like this. So at the moment it would seem to offer no benefit to take a single log-density-with-gradient function.

One of the selling points of multipathfinder is that it can (on expectation) handle posteriors with "irrelevant" minor modes, but to be able to do this, the assumption is that while some/many single pathfinder runs can/will fail, we can still recover by only using the "good" pathfinder runs (by resampling). So I think failing early and returning the last non-inf point or rerunning (or just having many runs in parallel in the first place) is in the spirit of the method.

Yeah you've convinced me that this is more user-friendly. I wonder if it makes more sense to only terminate early on non-finite points, non-finite gradients, or +Inf or NaN log-density. That is, a log density of -Inf is fine and can easily happen if the posterior is much narrower than the [-2, 2] initialization range, and such a log density can still have finite gradients that are useful for optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants