Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

Closed
goedman opened this issue Nov 21, 2022 · 11 comments · Fixed by #47
Closed
Labels
bug Something isn't working

Comments

@goedman
Copy link

goedman commented Nov 21, 2022

I wonder if this is something I'm doing wrong or if it is a bug. All goes well until the last call to convert to a DataFrame. MWE:

using CSV, DataFrames, NamedTupleTools
using StanSample
using InferenceObjects

stan_schools = """
data {
    int<lower=0> J;
    real y[J];
    real<lower=0> sigma[J];
}

parameters {
    real mu;
    real<lower=0> tau;
    real theta_tilde[J];
}

transformed parameters {
    real theta[J];
    for (j in 1:J)
        theta[j] = mu + tau * theta_tilde[j];
}

model {
    mu ~ normal(0, 5);
    tau ~ cauchy(0, 5);
    theta_tilde ~ normal(0, 1);
    y ~ normal(theta, sigma);
}

generated quantities {
    vector[J] log_lik;
    vector[J] y_hat;
    for (j in 1:J) {
        log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
        y_hat[j] = normal_rng(theta[j], sigma[j]);
    }
}
""";

data = Dict(
    "J" => 8,
    "y" => [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0],
    "sigma" => [15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]
)

m_schools = SampleModel("eight_schools", stan_schools)
rc = stan_sample(m_schools; data, save_warmup=true)

if success(rc)

    idata = inferencedata(m_schools)

    nt = namedtuple(data)

    # Until InferenceObjects issue #36 is merged
    ntu=(sigma=nt.sigma, J=[4], y=nt.y)
    
    idata = merge(idata, from_namedtuple(; observed_data = ntu))

else
    @warn "Sampling failed."
end

if :observed_data in propertynames(idata)
    idata.observed_data
end
DataFrame(idata.observed_data)

keys(idata.posterior)
post_schools = read_samples(m_schools, :dataframe)

# This fails:
DataFrame(idata.posterior)

I don't think it is caused by the draw indices, warmup_posterior and posterior_predictive also fail. They all have set variables, e.g.
like theta.1, theta.2, etc.

@sethaxen
Copy link
Member

It seems this error only occurs if StanSample (I assume the master branch) is loaded:

julia> using DimensionalData, DataFrames

julia> mu = DimArray(randn(1000, 4), (Dim{:draw}(1:1000), Dim{:chain}(1:4)); name=:mu);

julia> DataFrame(mu);

julia> using StanSample

julia> DataFrame(mu);
ERROR: ArgumentError: Some dims were not found in object
Stacktrace:
 [1] _errorextradims()
   @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:646
 [2] dimnum
   @ ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:201 [inlined]
 [3] getcolumn
   @ ~/.julia/packages/DimensionalData/K9D4P/src/tables.jl:196 [inlined]
 [4] fromcolumns(x::DimTable{(:draw, :chain, :mu), DimStack{NamedTuple{(:mu,), Tuple{Matrix{Float64}}}, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, NamedTuple{(:mu,), Tuple{Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}}}}, DimensionalData.Dimensions.LookupArrays.NoMetadata, NamedTuple{(:mu,), Tuple{DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{DimensionalData.DimColumn{Int64, Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, DimensionalData.DimColumn{Int64, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}}, NamedTuple{(:mu,), Tuple{DimensionalData.DimArrayColumn{Float64, DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64}}}}, names::Vector{Symbol}; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:36
 [5] DataFrame(x::DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata}; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:59
 [6] DataFrame(x::DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata})
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:48
 [7] top-level scope
   @ REPL[7]:1

(jl_vCqw12) pkg> st
Status `/tmp/jl_vCqw12/Project.toml`
  [a93c6f00] DataFrames v1.4.3
  [0703355e] DimensionalData v0.23.0
  [c1514b29] StanSample v6.12.0 `https://github.com/StanJulia/StanSample.jl.git#master`

I suspect the issue is caused by this line: https://github.com/StanJulia/StanSample.jl/blob/209a5bf4d0a6e15a316b46407c194a808389c506/src/utils/dimarray.jl#L7 . If we simply use a different chain key, all is well:

julia> mu2 = DimArray(randn(1000, 4), (Dim{:draw}(1:1000), Dim{:foo}(1:4)); name=:mu);

julia> DataFrame(mu2);

I suspect the issue is that by declaring these dimensions with @dim, :chain as a key is now reserved for your dimension type. This will actually interfere with InferenceObjects, where we've explicitly avoided declaring such a type to prevent problems. Note if we just specify the dimension name chain, StanSamples's dimension type is now used instead of ours.

julia> mu = DimArray(randn(1000, 4), (draw=1:1000, chain=1:4); name=:mu)
1000×4 DimArray{Float64,2} mu with dimensions: 
  Dim{:draw} Sampled{Int64} 1:1000 ForwardOrdered Regular Points,
  chain Sampled{Int64} 1:4 ForwardOrdered Regular Points
        1          2          3          4
    1   0.783298   0.0830577  0.597823   1.02076
    2   0.353957  -0.207669   1.06499    0.466934
                                       
  998   0.590198  -1.0169     0.660785  -0.0630021
  999  -0.44761   -2.01247    1.34981    1.58058
 1000  -0.273417  -0.769404   0.140136   0.174886

Since the original error occurs independently of InferenceObjects, I'd suggest opening an issue on DimensionalData. But I'd also suggest reconsidering whether you really need to define these dimensions using @dim.

@goedman
Copy link
Author

goedman commented Nov 22, 2022

Hi Seth, thanks a lot for looking into this. Removing the original :dimarray option in StanSample solves the issue.

As Inference[Object/Data] is a much better solution I'm leaning towards dropping dimarray altogether. I introduced it primarily because I liked the display of :dimarray based chains but that is even better supported by InferenceData.

I'll think about it a bit more over the next few days, make a decision and close this issue then.

Thanks again for your help!

@sethaxen
Copy link
Member

Hi @goedman I gave this a little more thought, and actually, I think this is an InferenceObjects issue after all. DimensionalData itself has the special dims X, Y, and Ti, and if our users use any of those as symbolic dim names, then we end up pulling those into our code, which is not great:

julia> DimArray(randn(2, 3), (X=1:2, Y=1:3); name=:x)
2×3 DimArray{Float64,2} x with dimensions: 
  X Sampled{Int64} 1:2 ForwardOrdered Regular Points,
  Y Sampled{Int64} 1:3 ForwardOrdered Regular Points
     1        2         3
 1  -1.40336  0.254703  0.598698
 2   1.61594  0.220872  1.39326

I think the problem is this line of code:

dim_names = Dimensions.name(Dimensions.basedims((dims..., default_dims...)))

basedims is not part of DimensionalData's API, and if it is passed a Symbol, it will assign your specialized chain type if defined. It's a fairly easy fix to not use basedims here, and then other packages won't be able to break our code like this in the future.

As Inference[Object/Data] is a much better solution I'm leaning towards dropping dimarray altogether. I introduced it primarily because I liked the display of :dimarray based chains but that is even better supported by InferenceData.

I'll think about it a bit more over the next few days, make a decision and close this issue then.

No problem! One thing your :dimarray approach has that we are currently missing is the ability to reshape and stack all data into a single multidimensional array with dims (param, draw, chain). We need this functionality also for using MCMCDiagnostics with Datasets, but it's also important for constructing any marginal plots using the Tables interface, so I'm working on a utility function for our API to do just that given a Dataset (#5). Then, yes, I don't see any advantages to having a :dimarray specific output format.

@goedman
Copy link
Author

goedman commented Nov 24, 2022

Hi @sethaxen , StanSample.jl v6.13.0 for now has disabled :dimarray[s] as an option in read_samples(). I left the code mostly in there so it would be simple to enable it again. If/once you've made your changes I'll give it another try.

As I tend to use DataFrames for further work with posterior values, including stacked values, I'll see what it takes to expand what is currently in ./util/dataframes.jl for that purpose.

Looking at the 8_schools Pluto notebook (e.g. in Stan.jl v9.10.1, ./Examples_Notebooks/inferencedata.jl) the inferencedata based posterior_schools has 256000 rows and post_schools (simple read_samples(..., :dataframe) has 8000 rows.

@sethaxen
Copy link
Member

sethaxen commented Dec 8, 2022

Looking at the 8_schools Pluto notebook (e.g. in Stan.jl v9.10.1, ./Examples_Notebooks/inferencedata.jl) the inferencedata based posterior_schools has 256000 rows and post_schools (simple read_samples(..., :dataframe) has 8000 rows.

This makes sense. There's more than one way one might flatten multidimensional arrays into a tabular structure. The "widest" possible way is to make each marginal parameter its own column, so one would have nchains*ndraws rows. This is what I think your read_samples(..., :dataframe) code does. In this case, each element of theta would have its own column.

The "tallest" possible way effectively vcats all parameters into a single array as a var column. The default tabular form of Dataset is a compromise between these two. Each parameter (e.g. theta or tau) gets its own column, but all elements of theta are included in that column. The dimensions of theta are in their own columns. Note this stacking will cause some values to be repeated (think of it like Iterators.product applied to the indices of all unique dimensions). If the user specifies that some parameters share a dimension, then less repetition is performed, so there will be fewer rows. e.g.

julia> DataFrame(inferencedata(m_schools).posterior) |> size
(256000, 8)

julia> DataFrame(inferencedata(m_schools; dims=(theta=[:school], theta_tilde=[:school]).posterior) |> size
32000

Since each of these Tables has its uses for downstream analysis and plotting, the end goal is to have convenience functions so the user can convert between them easily.

As I tend to use DataFrames for further work with posterior values, including stacked values

What kinds of downstream analyses do you do with the dataframes? I'm wondering how we should document this behavior.

@ahartikainen
Copy link

Just a small comment, do you use dataframes because the Stan output is given in a wide-format table?

In python we have had this wide vs long -format discussion before and that is why having nD structure (Dataset) is better.

Btw I think R tools (tidy-bayes?) use long-format.

@goedman
Copy link
Author

goedman commented Dec 8, 2022

Typically I've been trying "more advanced" formats (such as DimensionalData and AxisKeys) and I certainly like aspects of it. But they also come with a learning (and re-learning) curve. As an end-user I (personally) always seem to gravitate back to DataFrames for my own use (Statistical Rethinking and Regression and Other Stories related projects).

For Stan.jl this is a bit different. If possible I like to support important new work such as C++ threads, BridgeStan, ODE improvements, InferenceObjects and PosteriorDB. As an additional benefit, InferenceObjects show/display works very well in Pluto so I'm quite motivated to support conversion to that format.

For above mentioned projects plotting and summarizing are how I use chain-stacked, wide-format DataFrames. Seth's example above:

julia> DataFrame(inferencedata(m_schools; dims=(theta=[:school], theta_tilde=[:school])).posterior)

is a strong argument to learn this "dims" DSL!

@goedman
Copy link
Author

goedman commented Feb 18, 2023

Thanks Seth!

I've had some issues installing the last 2 version of InferenceObjects.j:

(@v1.10) pkg> add InferenceObjects@0.3.4
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package InferenceObjects [b5cf5a8d]:
 InferenceObjects [b5cf5a8d] log:
 ├─possible versions are: 0.1.0-0.3.2 or uninstalled
 └─restricted to versions 0.3.4 by an explicit requirement — no versions left

(@v1.10) pkg> add InferenceObjects@0.3.3
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package InferenceObjects [b5cf5a8d]:
 InferenceObjects [b5cf5a8d] log:
 ├─possible versions are: 0.1.0-0.3.2 or uninstalled
 └─restricted to versions 0.3.3 by an explicit requirement — no versions left

This is in a pretty empty environment:

julia> versioninfo()
Julia Version 1.10.0-DEV.632
Commit cbbfc68f81 (2023-02-17 10:33 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.3.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores
Environment:
  LD_LIBRARY_PATH = /Users/rob/local/lib
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = subl
  JULIA_SVG_BROWSER = Google Chrome.app
  JULIA_SPECIALFUNCTIONS_BUILD_SOURCE = true
  JULIA_MPI_PATH = /usr/local/Cellar/open-mpi/4.0.2
  JULIA_PKG3_PRECOMPILE = true
  JULIA_PKG_SERVER = pkg.julialang.org
  JULIA_SR_HOME = /Users/rob/.julia/dev/StatisticalRethinking/src
  JULIA_ROS_HOME = /Users/rob/Projects/R/ROS-Examples
  JULIA_MIXTAPE_HOME = /Users/rob/Projects/R/mixtape
  JULIA_WOOLDRIDGE_HOME = /Users/rob/Projects/R/wooldridge

(@v1.10) pkg> st
Status `~/.julia/environments/v1.10/Project.toml`
  [634d3b9d] DrWatson v2.12.1
  [d254efa0] PkgSkeleton v1.0.0
  [c3e4b0f8] Pluto v0.19.22
  [295af30f] Revise v3.5.1
  [44cfe95a] Pkg v1.10.0

Same on J1.8 and J1.9-beta4.

InferenceObjects v0.3.4 is pretty fresh, sometimes (not very often though) it takes a bit longer before it is visible, but v0.3.3 should certainly visible.

Will keep on checking.

@sethaxen
Copy link
Member

That's really strange, I have no problem installing on v1.9 and v1.9-beta4. I wonder if there's a way to manually trigger your local copy of the registry to update.

@goedman
Copy link
Author

goedman commented Feb 18, 2023

Ok, this morning no problem upgrading to v0.3.4 anymore! I've occasionally seen these kinds of delays.

@goedman
Copy link
Author

goedman commented Feb 18, 2023

In this PR I am trying to revert back to have both InferenceObjects (v0.3.4) and DimensionalData(v0.28.2) as extensions to StanSample.jl.

The updates still create an issue::

Testing: test_inferencedata/test_inferencedata.jl.
[ Info: /var/folders/pf/2m__rnm54153mj3198b5xxn00000gn/T/jl_g2j5fD/eight_schools.stan updated.
InferenceData interface: Error During Test at /Users/rob/.julia/dev/StanSample/test/runtests.jl:89
  Got exception outside of a @test
  LoadError: ArgumentError: Some dims were not found in object
  Stacktrace:
    [1] _errorextradims()
      @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/6v7CY/src/Dimensions/primitives.jl:649
    [2] dimnum
      @ ~/.julia/packages/DimensionalData/6v7CY/src/Dimensions/primitives.jl:201 [inlined]
    [3] getcolumn
      @ ~/.julia/packages/DimensionalData/6v7CY/src/tables.jl:196 [inlined]
    [4] fromcolumns(x::DimTable{(:draw, :chain, :school, :theta_tilde, :mu, :tau, :theta), Dataset{NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}, Array{Float64, 3}}}, DimStack{NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}, Array{Float64, 3}}}, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.NoLookup{UnitRange{Int64}}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.NoLookup{Base.OneTo{Int64}}}, Dim{:school, DimensionalData.Dimensions.LookupArrays.NoLookup{Base.OneTo{Int64}}}}, Tuple{}, NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}, Dim{:school, Colon}}, Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}},
... <snipped>

There is more info in the StanSample CI action logs on Github (Inferencedata dimarray tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants