DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

goedman · 2022-11-21T22:56:27Z

I wonder if this is something I'm doing wrong or if it is a bug. All goes well until the last call to convert to a DataFrame. MWE:

using CSV, DataFrames, NamedTupleTools
using StanSample
using InferenceObjects

stan_schools = """
data {
    int<lower=0> J;
    real y[J];
    real<lower=0> sigma[J];
}

parameters {
    real mu;
    real<lower=0> tau;
    real theta_tilde[J];
}

transformed parameters {
    real theta[J];
    for (j in 1:J)
        theta[j] = mu + tau * theta_tilde[j];
}

model {
    mu ~ normal(0, 5);
    tau ~ cauchy(0, 5);
    theta_tilde ~ normal(0, 1);
    y ~ normal(theta, sigma);
}

generated quantities {
    vector[J] log_lik;
    vector[J] y_hat;
    for (j in 1:J) {
        log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
        y_hat[j] = normal_rng(theta[j], sigma[j]);
    }
}
""";

data = Dict(
    "J" => 8,
    "y" => [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0],
    "sigma" => [15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]
)

m_schools = SampleModel("eight_schools", stan_schools)
rc = stan_sample(m_schools; data, save_warmup=true)

if success(rc)

    idata = inferencedata(m_schools)

    nt = namedtuple(data)

    # Until InferenceObjects issue #36 is merged
    ntu=(sigma=nt.sigma, J=[4], y=nt.y)
    
    idata = merge(idata, from_namedtuple(; observed_data = ntu))

else
    @warn "Sampling failed."
end

if :observed_data in propertynames(idata)
    idata.observed_data
end
DataFrame(idata.observed_data)

keys(idata.posterior)
post_schools = read_samples(m_schools, :dataframe)

# This fails:
DataFrame(idata.posterior)

I don't think it is caused by the draw indices, warmup_posterior and posterior_predictive also fail. They all have set variables, e.g.
like theta.1, theta.2, etc.

The text was updated successfully, but these errors were encountered:

sethaxen · 2022-11-22T07:33:46Z

It seems this error only occurs if StanSample (I assume the master branch) is loaded:

julia> using DimensionalData, DataFrames

julia> mu = DimArray(randn(1000, 4), (Dim{:draw}(1:1000), Dim{:chain}(1:4)); name=:mu);

julia> DataFrame(mu);

julia> using StanSample

julia> DataFrame(mu);
ERROR: ArgumentError: Some dims were not found in object
Stacktrace:
 [1] _errorextradims()
   @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:646
 [2] dimnum
   @ ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:201 [inlined]
 [3] getcolumn
   @ ~/.julia/packages/DimensionalData/K9D4P/src/tables.jl:196 [inlined]
 [4] fromcolumns(x::DimTable{(:draw, :chain, :mu), DimStack{NamedTuple{(:mu,), Tuple{Matrix{Float64}}}, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, NamedTuple{(:mu,), Tuple{Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}}}}, DimensionalData.Dimensions.LookupArrays.NoMetadata, NamedTuple{(:mu,), Tuple{DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{DimensionalData.DimColumn{Int64, Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, DimensionalData.DimColumn{Int64, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}}, NamedTuple{(:mu,), Tuple{DimensionalData.DimArrayColumn{Float64, DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Int64}}}}, names::Vector{Symbol}; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:36
 [5] DataFrame(x::DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata}; copycols::Nothing)
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:59
 [6] DataFrame(x::DimArray{Float64, 2, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, Symbol, DimensionalData.Dimensions.LookupArrays.NoMetadata})
   @ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/other/tables.jl:48
 [7] top-level scope
   @ REPL[7]:1

(jl_vCqw12) pkg> st
Status `/tmp/jl_vCqw12/Project.toml`
  [a93c6f00] DataFrames v1.4.3
  [0703355e] DimensionalData v0.23.0
  [c1514b29] StanSample v6.12.0 `https://github.com/StanJulia/StanSample.jl.git#master`

I suspect the issue is caused by this line: https://github.com/StanJulia/StanSample.jl/blob/209a5bf4d0a6e15a316b46407c194a808389c506/src/utils/dimarray.jl#L7 . If we simply use a different chain key, all is well:

julia> mu2 = DimArray(randn(1000, 4), (Dim{:draw}(1:1000), Dim{:foo}(1:4)); name=:mu);

julia> DataFrame(mu2);

I suspect the issue is that by declaring these dimensions with @dim, :chain as a key is now reserved for your dimension type. This will actually interfere with InferenceObjects, where we've explicitly avoided declaring such a type to prevent problems. Note if we just specify the dimension name chain, StanSamples's dimension type is now used instead of ours.

julia> mu = DimArray(randn(1000, 4), (draw=1:1000, chain=1:4); name=:mu)
1000×4 DimArray{Float64,2} mu with dimensions: 
  Dim{:draw} Sampled{Int64} 1:1000 ForwardOrdered Regular Points,
  chain Sampled{Int64} 1:4 ForwardOrdered Regular Points
        1          2          3          4
    1   0.783298   0.0830577  0.597823   1.02076
    2   0.353957  -0.207669   1.06499    0.466934
    ⋮                                   
  998   0.590198  -1.0169     0.660785  -0.0630021
  999  -0.44761   -2.01247    1.34981    1.58058
 1000  -0.273417  -0.769404   0.140136   0.174886

Since the original error occurs independently of InferenceObjects, I'd suggest opening an issue on DimensionalData. But I'd also suggest reconsidering whether you really need to define these dimensions using @dim.

goedman · 2022-11-22T14:18:34Z

Hi Seth, thanks a lot for looking into this. Removing the original :dimarray option in StanSample solves the issue.

As Inference[Object/Data] is a much better solution I'm leaning towards dropping dimarray altogether. I introduced it primarily because I liked the display of :dimarray based chains but that is even better supported by InferenceData.

I'll think about it a bit more over the next few days, make a decision and close this issue then.

Thanks again for your help!

sethaxen · 2022-11-23T14:15:08Z

Hi @goedman I gave this a little more thought, and actually, I think this is an InferenceObjects issue after all. DimensionalData itself has the special dims X, Y, and Ti, and if our users use any of those as symbolic dim names, then we end up pulling those into our code, which is not great:

julia> DimArray(randn(2, 3), (X=1:2, Y=1:3); name=:x)
2×3 DimArray{Float64,2} x with dimensions: 
  X Sampled{Int64} 1:2 ForwardOrdered Regular Points,
  Y Sampled{Int64} 1:3 ForwardOrdered Regular Points
     1        2         3
 1  -1.40336  0.254703  0.598698
 2   1.61594  0.220872  1.39326

I think the problem is this line of code:

InferenceObjects.jl/src/dimensions.jl

Line 60 in 84a5aa8

dim_names = Dimensions.name(Dimensions.basedims((dims..., default_dims...)))

basedims is not part of DimensionalData's API, and if it is passed a Symbol, it will assign your specialized chain type if defined. It's a fairly easy fix to not use basedims here, and then other packages won't be able to break our code like this in the future.

As Inference[Object/Data] is a much better solution I'm leaning towards dropping dimarray altogether. I introduced it primarily because I liked the display of :dimarray based chains but that is even better supported by InferenceData.

I'll think about it a bit more over the next few days, make a decision and close this issue then.

No problem! One thing your :dimarray approach has that we are currently missing is the ability to reshape and stack all data into a single multidimensional array with dims (param, draw, chain). We need this functionality also for using MCMCDiagnostics with Datasets, but it's also important for constructing any marginal plots using the Tables interface, so I'm working on a utility function for our API to do just that given a Dataset (#5). Then, yes, I don't see any advantages to having a :dimarray specific output format.

goedman · 2022-11-24T17:33:47Z

Hi @sethaxen , StanSample.jl v6.13.0 for now has disabled :dimarray[s] as an option in read_samples(). I left the code mostly in there so it would be simple to enable it again. If/once you've made your changes I'll give it another try.

As I tend to use DataFrames for further work with posterior values, including stacked values, I'll see what it takes to expand what is currently in ./util/dataframes.jl for that purpose.

Looking at the 8_schools Pluto notebook (e.g. in Stan.jl v9.10.1, ./Examples_Notebooks/inferencedata.jl) the inferencedata based posterior_schools has 256000 rows and post_schools (simple read_samples(..., :dataframe) has 8000 rows.

sethaxen · 2022-12-08T15:09:15Z

Looking at the 8_schools Pluto notebook (e.g. in Stan.jl v9.10.1, ./Examples_Notebooks/inferencedata.jl) the inferencedata based posterior_schools has 256000 rows and post_schools (simple read_samples(..., :dataframe) has 8000 rows.

This makes sense. There's more than one way one might flatten multidimensional arrays into a tabular structure. The "widest" possible way is to make each marginal parameter its own column, so one would have nchains*ndraws rows. This is what I think your read_samples(..., :dataframe) code does. In this case, each element of theta would have its own column.

The "tallest" possible way effectively vcats all parameters into a single array as a var column. The default tabular form of Dataset is a compromise between these two. Each parameter (e.g. theta or tau) gets its own column, but all elements of theta are included in that column. The dimensions of theta are in their own columns. Note this stacking will cause some values to be repeated (think of it like Iterators.product applied to the indices of all unique dimensions). If the user specifies that some parameters share a dimension, then less repetition is performed, so there will be fewer rows. e.g.

julia> DataFrame(inferencedata(m_schools).posterior) |> size
(256000, 8)

julia> DataFrame(inferencedata(m_schools; dims=(theta=[:school], theta_tilde=[:school]).posterior) |> size
32000

Since each of these Tables has its uses for downstream analysis and plotting, the end goal is to have convenience functions so the user can convert between them easily.

As I tend to use DataFrames for further work with posterior values, including stacked values

What kinds of downstream analyses do you do with the dataframes? I'm wondering how we should document this behavior.

ahartikainen · 2022-12-08T15:31:29Z

Just a small comment, do you use dataframes because the Stan output is given in a wide-format table?

In python we have had this wide vs long -format discussion before and that is why having nD structure (Dataset) is better.

Btw I think R tools (tidy-bayes?) use long-format.

goedman · 2022-12-08T18:59:55Z

Typically I've been trying "more advanced" formats (such as DimensionalData and AxisKeys) and I certainly like aspects of it. But they also come with a learning (and re-learning) curve. As an end-user I (personally) always seem to gravitate back to DataFrames for my own use (Statistical Rethinking and Regression and Other Stories related projects).

For Stan.jl this is a bit different. If possible I like to support important new work such as C++ threads, BridgeStan, ODE improvements, InferenceObjects and PosteriorDB. As an additional benefit, InferenceObjects show/display works very well in Pluto so I'm quite motivated to support conversion to that format.

For above mentioned projects plotting and summarizing are how I use chain-stacked, wide-format DataFrames. Seth's example above:

julia> DataFrame(inferencedata(m_schools; dims=(theta=[:school], theta_tilde=[:school])).posterior)

is a strong argument to learn this "dims" DSL!

goedman · 2023-02-18T00:45:19Z

Thanks Seth!

I've had some issues installing the last 2 version of InferenceObjects.j:

(@v1.10) pkg> add InferenceObjects@0.3.4
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package InferenceObjects [b5cf5a8d]:
 InferenceObjects [b5cf5a8d] log:
 ├─possible versions are: 0.1.0-0.3.2 or uninstalled
 └─restricted to versions 0.3.4 by an explicit requirement — no versions left

(@v1.10) pkg> add InferenceObjects@0.3.3
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package InferenceObjects [b5cf5a8d]:
 InferenceObjects [b5cf5a8d] log:
 ├─possible versions are: 0.1.0-0.3.2 or uninstalled
 └─restricted to versions 0.3.3 by an explicit requirement — no versions left

This is in a pretty empty environment:

julia> versioninfo()
Julia Version 1.10.0-DEV.632
Commit cbbfc68f81 (2023-02-17 10:33 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.3.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores
Environment:
  LD_LIBRARY_PATH = /Users/rob/local/lib
  JULIA_NUM_THREADS = 4
  JULIA_EDITOR = subl
  JULIA_SVG_BROWSER = Google Chrome.app
  JULIA_SPECIALFUNCTIONS_BUILD_SOURCE = true
  JULIA_MPI_PATH = /usr/local/Cellar/open-mpi/4.0.2
  JULIA_PKG3_PRECOMPILE = true
  JULIA_PKG_SERVER = pkg.julialang.org
  JULIA_SR_HOME = /Users/rob/.julia/dev/StatisticalRethinking/src
  JULIA_ROS_HOME = /Users/rob/Projects/R/ROS-Examples
  JULIA_MIXTAPE_HOME = /Users/rob/Projects/R/mixtape
  JULIA_WOOLDRIDGE_HOME = /Users/rob/Projects/R/wooldridge

(@v1.10) pkg> st
Status `~/.julia/environments/v1.10/Project.toml`
  [634d3b9d] DrWatson v2.12.1
  [d254efa0] PkgSkeleton v1.0.0
  [c3e4b0f8] Pluto v0.19.22
  [295af30f] Revise v3.5.1
  [44cfe95a] Pkg v1.10.0

Same on J1.8 and J1.9-beta4.

InferenceObjects v0.3.4 is pretty fresh, sometimes (not very often though) it takes a bit longer before it is visible, but v0.3.3 should certainly visible.

Will keep on checking.

sethaxen · 2023-02-18T00:48:54Z

That's really strange, I have no problem installing on v1.9 and v1.9-beta4. I wonder if there's a way to manually trigger your local copy of the registry to update.

goedman · 2023-02-18T13:57:27Z

Ok, this morning no problem upgrading to v0.3.4 anymore! I've occasionally seen these kinds of delays.

goedman · 2023-02-18T17:28:39Z

In this PR I am trying to revert back to have both InferenceObjects (v0.3.4) and DimensionalData(v0.28.2) as extensions to StanSample.jl.

The updates still create an issue::

Testing: test_inferencedata/test_inferencedata.jl.
[ Info: /var/folders/pf/2m__rnm54153mj3198b5xxn00000gn/T/jl_g2j5fD/eight_schools.stan updated.
InferenceData interface: Error During Test at /Users/rob/.julia/dev/StanSample/test/runtests.jl:89
  Got exception outside of a @test
  LoadError: ArgumentError: Some dims were not found in object
  Stacktrace:
    [1] _errorextradims()
      @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/6v7CY/src/Dimensions/primitives.jl:649
    [2] dimnum
      @ ~/.julia/packages/DimensionalData/6v7CY/src/Dimensions/primitives.jl:201 [inlined]
    [3] getcolumn
      @ ~/.julia/packages/DimensionalData/6v7CY/src/tables.jl:196 [inlined]
    [4] fromcolumns(x::DimTable{(:draw, :chain, :school, :theta_tilde, :mu, :tau, :theta), Dataset{NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}, Array{Float64, 3}}}, DimStack{NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Array{Float64, 3}, Matrix{Float64}, Matrix{Float64}, Array{Float64, 3}}}, Tuple{Dim{:draw, DimensionalData.Dimensions.LookupArrays.NoLookup{UnitRange{Int64}}}, Dim{:chain, DimensionalData.Dimensions.LookupArrays.NoLookup{Base.OneTo{Int64}}}, Dim{:school, DimensionalData.Dimensions.LookupArrays.NoLookup{Base.OneTo{Int64}}}}, Tuple{}, NamedTuple{(:theta_tilde, :mu, :tau, :theta), Tuple{Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}, Dim{:school, Colon}}, Tuple{Dim{:draw, Colon}, Dim{:chain, Colon}},
... <snipped>

There is more info in the StanSample CI action logs on Github (Inferencedata dimarray tests).

sethaxen added the bug Something isn't working label Feb 17, 2023

sethaxen mentioned this issue Feb 17, 2023

Avoid user-defined dimensions being injected into our code #47

Merged

sethaxen closed this as completed in #47 Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

goedman commented Nov 21, 2022

sethaxen commented Nov 22, 2022

goedman commented Nov 22, 2022

sethaxen commented Nov 23, 2022

goedman commented Nov 24, 2022

sethaxen commented Dec 8, 2022

ahartikainen commented Dec 8, 2022

goedman commented Dec 8, 2022 •

edited

Loading

goedman commented Feb 18, 2023

sethaxen commented Feb 18, 2023

goedman commented Feb 18, 2023

goedman commented Feb 18, 2023 •

edited

Loading

DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

DataFrame conversion from more complex posteriors fails in below MWE. Simple posteriors work. #37

Comments

goedman commented Nov 21, 2022

sethaxen commented Nov 22, 2022

goedman commented Nov 22, 2022

sethaxen commented Nov 23, 2022

goedman commented Nov 24, 2022

sethaxen commented Dec 8, 2022

ahartikainen commented Dec 8, 2022

goedman commented Dec 8, 2022 • edited Loading

goedman commented Feb 18, 2023

sethaxen commented Feb 18, 2023

goedman commented Feb 18, 2023

goedman commented Feb 18, 2023 • edited Loading

goedman commented Dec 8, 2022 •

edited

Loading

goedman commented Feb 18, 2023 •

edited

Loading