-
Notifications
You must be signed in to change notification settings - Fork 12
Switch to absolute path #100
Switch to absolute path #100
Conversation
Thank you @Byrth for reporting this and preparing the PR. I'll study it over the next few days to make sure I understand this problem. I rarely work with CmdStan anymore and have switched to StanSample, but would like to see if this applies there as well. Recently I removed pmap from StanSample as pmap can create problems on clusters but management of file paths is slightly different there. It would definitely be surprising (to me) if cd() works across threads but your case is pretty strong! |
A few questions. Are you just trying to get it to run using the threads model? I definitely can reproduce below error:
Or is your application more targeting a few models each with different observational data? Rob |
Just to make sure we're testing the same MWE, I see the following error:
using the following MWE:
This fails even with |
Without your PR, on my system
works fine. I pass in the correct dir in the call to stan(). |
At this point, unfortunately I am not really convinced about this approach. Duplicating Stanmodels and updating some of the paths might not fit well with how CmdStan.jl was originally designed (~8 years ago). I've merged your changes in a branch Of course, I'm not quite sure what you are trying to achieve, what your models looks like, how you run julia ( I believe currently pmap within pmap is not save in Julia 1.5, I had to revert that in StanSample.jl if the total number of threads (jobs * chains) exceeded the number of cores/processors. I do agree that maybe a redesign of the Julia It is interesting though that over the last year or so Stan has officially released |
Sorry for the delay. Use case: I am running CmdStan as part of a modeling process that measures out of sample predictive accuracy for multiple modeling strategies. To this end, I partition data and fit multiple copies of the same model, which is how I ended up using An example of the println("hither : ",Threads.threadid()," : ",pwd()) # hither : 1 : /tmp
Threads.@spawn cd( () -> ( println("tither : ",Threads.threadid(), " : ", pwd()); sleep(3)), "..") # tither : 2 : /
sleep(1);println("hither : ",Threads.threadid(), " : ", pwd());sleep(3) # hither : 1 : /
println("hither : ",Threads.threadid()," : ",pwd()) # hither : 1 : /tmp So
Using an adaptation of your example and JULIA_NUM_THREADS of 8, I get errors: using CmdStan#, Distributed
using StatsBase: sample
ProjDir = "/root"
bernoullimodel = "
data {
int<lower=1> N;
int<lower=0,upper=1> y[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1);
y ~ bernoulli(theta);
}
";
n = 10;
observeddata = Dict("N" => n, "y" => sample([0,1],n))
sm = Stanmodel(name="bernoulli", model=bernoullimodel, output_format=:namedtuple);
println("\nThreads loop\n")
p1 = 15 # p1 is the number of models to fit
estimates = Vector(undef, p1)
Threads.@threads for i in 1:p1
pdir = pwd()
while ispath(pdir)
pdir = tempname()
end
new_model= deepcopy(sm)
new_model.pdir = pdir
new_model.tmpdir = joinpath(splitpath(pdir)...,"tmp")
mkpath(new_model.tmpdir)
CmdStan.update_model_file(joinpath(new_model.tmpdir, "$(new_model.name).stan"), strip(new_model.model))
rc, samples, cnames = stan(new_model, observeddata, new_model.pdir);
if rc == 0
estimates[i] = samples
end
rm(pdir; force=true, recursive=true)
end
I anticipated needing to add more data to see errors, which is why I included StatsBase, but it ended up being irrelevant (I hope). With the test-threads branch, it passes. |
Hmmm, on MacOS (9 cores) I see very different behavior using the test_threads branch. I see 3 problems: -------- 1 ---------- With p1 == 1 all is fine. The first problem shows with p1==2, very often I get:
Parse errors like that I've seen happening often when a thread is reading and another is writing to the same file. -------- 2 --------------- The second problem is that I (mostly) need to disable summary generation by cmdstan in the call to stan():
I think this is probably due to the stansummary pipeline also needs fixed paths. Likely a fourth point where cd-ing is happening. If I happen to get the stansummary, the results look identical. -------- 3 --------------- The third problem in the few cases it works with very low p1 values, the samples.theta results are identical:
With Threads.nthreads() == 1 all works fine (sequential of course). Running In the end I started to play around with separating stanc compilation and sampling, to no avail:
If branch test_threads works on your machine that is a possibility. I would adapt |
I think that I accidentally deleted my branch or something when it was merged (darned cleaning habits). Those are all errors that I see without my changes. Perhaps it reverted to #master for you? For me it kept trying to use my Julia cache because the branch didn't exist anymore or something. --1-- is caused by two processes logging to the same files, as you indicate. --2-- is caused by the summary stuff, but I am fairly sure I got them all --3-- is caused by the directory switching two threads into the same directory before they start reading, which results in them reading the same samples. It should be the samples of the last spawned process. These are all cd collisions in the sampling cd. I can take a look at the example later tonight. |
Yes, when you merge Github closes the PR. You'll find it under closed PR. Or you can select the test_thread branch of CmdStan.jl on Github. I pretty much I had the same thought, but I compared the test_thread branch I'm using with the PR. I'll check it again! |
BREAKTROUGH! I might have missed the |
Lovely! I spent a solid hour this morning in a similar situation debugging #master. Not sure what happened. |
Every time a question such as this comes up I feel like chasing my tail. But you always learn something new. On my system, the current test_threads branch on Github works fine with One file that seems to pop up (or not?) at random spots is Somehow when it fails on reruns, the first tmpdir created never seems to contain data files etc. I can't pin it down, vaguely it reminds me of the problems I saw when Julia 1.3 was released with the new thread model. But I test on 1.5 and 1.6-DEV. On your system, can you run either of these scripts multiple times without a Julia restart? |
Running I don't understand why that would be but can look around for It occurs to me that the only thing from the I am doing this in a CentOS7-based docker container. |
Yes, today I did spend several hours again testing and testing. The current version of threads_02.jl sometimes behaves ok, but not reliable so. I also think in the test_threads branch models are always compiled. I think the choices I made when to compile, sample and reading the .csv files etc. in CmdStan.jl were so-so. StanSample.jl behaves much better (the equivalent setup is in |
Tonight I tried more and was able to replicate your finding (unreliable completion, more likely on iteration 2+ than on the first run). I found one thing that I overlooked initially, which is However, even after changing that and scoping down the
which references this line in [5]: isfile("$(path_prefix)_make.log") && rm("$(path_prefix)_make.log")
I am beginning to believe that this is just a limitation of Julia's filesystem integration using libuv. Even if I start the fit from CMDSTAN_HOME so there is no explicit directory movement at all, I still got the above error. libuv is generally not thread safe, but the Julia team seems to have overcome that at least to the extent that they avoid deadlocks. Perhaps they stopped short of making sure the behavior is consistent, though. |
ok, I cannot replicate this with base Julia so I retract my accusation but still have no explanation for the above error. I replaced the Then I replaced the absolute_tempdir_path = if model.tmpdir[1] == '/'
model.tmpdir
else
joinpath(splitpath(pwd)...,splitpath(model.tmpdir))
end No errors since then with >10 |
But, unless this exercise is just for learning, always useful, your problem, if I understand it correctly, is not difficult to program (not using threads).Either the cluster example or switching to StanSample would work. From my point of view CmdStan.jl is slowly falling behind. The CmdStan cluster example and projects don’t smoothly integrate. |
Yes. I believe this might be more trouble than it is worth, particularly with CmdStan.jl on the path to deprecation. I am currently in a bit of a time crunch in life, but I will try to switch to the other StanJulia ecosystem when I can make time for it and until then I just won't do multithreaded model fits. Thanks for entertaining this idea! |
Indeed, your last proposed change for Over the next few weeks I'll look into what's happening with Until then I'll leave the |
I have had a rather pernicious problem with CmdStan that I was unable to consistently replicate. I've known it has something to do with threading, but was unable to figure out how it was thread-unsafe. When I was testing stuff today, I hit upon a way to replicate it and think I figured out the problem.
cd()
changes the path for all threads, which means that running multiple CmdStan.jl models on different threads within a process results in somewhat random behavior, it seems.The particular behavior I have is:
With the current 6.0.8, this errors some of the time (p1 > 5 or so) if fitting an identical model because it is
cd()
ing around and ends up attempting to read files using a relative path that isn't accurate because another thread moved to its own directory.This PR switches to using an absolute path. Line 99 of
stancode.jl
now grabs the absolute path of the passedmodel.tmpdir
and uses it to make a good portion of the code (other than that line and themake
section) working-directory independent. It isn't a full fix, really, but I haven't been able to make it happen with these changes. I don't think a full fix would be possible until CmdStan becomes working directory-independent.The long and the short of it is that I added another
_file
field to Stanmodel that is tmpdir+name and now it is referenced in most of the places where the code used to usename
.Tests pass locally (on Linux).