Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`Feature/1180 pathfinder single path csvs #1184

Merged
merged 20 commits into from
Aug 21, 2023

Conversation

mitzimorris
Copy link
Member

Submisison Checklist

  • Run tests: ./runCmdStanTests.py src/test
  • Declare copyright holder and open-source license: see below

Summary:

Added flag save_single_path to method pathfinder, per discussion in #1180

Intended Effect:

Better user experience.

How to Verify:

Unit tests

Side Effects:

N/A

Documentation:

see PR stan-dev/docs#651

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@mitzimorris mitzimorris requested a review from WardBrian August 17, 2023 20:28
src/cmdstan/command.hpp Outdated Show resolved Hide resolved
src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved
src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved
src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved
@WardBrian
Copy link
Member

Do we feel like this should interact with the diagnostic files at all? I know at the moment there are no multipath-pathfinder specific diagnostics in reality, so it might be odd to require the save_single_paths argument to get any diagnostics out

@mitzimorris
Copy link
Member Author

it might be odd to require the save_single_paths argument to get any diagnostics out

but this isn't required - save_single_paths outputs CSV draws from the individual pathfinders.
the diagnostics files report on the Pathfinder trajectory via LBFGS.

@WardBrian
Copy link
Member

I understand the current behavior, I'm asking if we think it should have an impact

Imagine one day we actually use the argument for multi-pathfinder to have diagnostics (independent of the single path diagnostics). Then it would seem natural that this argument also controlled whether the single-path diagnostics were saved.

@mitzimorris
Copy link
Member Author

I see - so if you want the diagnostics file then you'd probably want the single-path pathfinders as well?
which means that we don't need "save_single_paths" - diagnostic_file arg controls everything?
and then you'd have both .json and .csv files which have the same basename?

very elegant!

@WardBrian
Copy link
Member

I was thinking of more or less the flip of that - what should this argument do if there are multiple kinds of diagnostics (like there are multiple kinds of output csvs at the moment)

@mitzimorris
Copy link
Member Author

add more arguments to pathfinder method that are fine-grained enough to handle this and forget about output arg diagnostics_file?

@WardBrian
Copy link
Member

I think I was unclear.

This PR was motivated because the current “output” argument controls both the output of the PSIS draws and the individual paths. My point is just that one day, if we actually use the diagnostic file argument to the multi-path algorithm, we will have the same thing with the “diagnostic file” argument - some for the PSIS, some for single paths. Whether we care is another question, and the answer can be “no”

@mitzimorris
Copy link
Member Author

mitzimorris commented Aug 18, 2023

if I understand correctly proposal is this:

  • diagnostic_file - is an output sub-argument which is reserved for multi-path Pathfinder.
  • "save_single_paths" - for method pathfinder is a boolean arg which, if true, will create both the CSV file of draws and the json file of ELBO evaluations.

and for the user docs, we should explain when save_single_paths is useful - I think this is intended as a tool to debug Pathfinder behavior on challenging or ill-specified models.

@WardBrian
Copy link
Member

Currently, this PR implements:

- output=foo diagnostic_file=bar
save_single_paths=0 One file, foo.csv, containing PSIS draws Several bar_N.json files, one for each path, containing single-pathfinder diagnostics
save_single_paths=1 The same foo.csv file, plus individual foo_path_N.csv files, one for each path Several bar_N.json files, one for each path, containing single-pathfinder diagnostics

My question was, should the matrix look instead like:

- output=foo diagnostic_file=bar
save_single_paths=0 One file, foo.csv, containing PSIS draws One file, bar.json, containing the diagnostics for multi-path pathfinder (currently, this is completely unused).
save_single_paths=1 The same foo.csv file, plus individual foo_path_N.csv files, one for each path The same bar.json file, plus individual bar_path_N.json files containing the diagnostics for each single-path

As noted, there are currently no diagnostics for multi-path besides each individual path's diagnostics, but if we ever think there would be then maybe the second table makes more sense. Of course, if someone is asking for diagnostics, maybe it's fine just to give them more than they asked for. It's a much less common use case than output, at any rate

@mitzimorris
Copy link
Member Author

two things:

  • creating an empty multi-path pathfinder diagnostics file is confusing
  • having to specify diagnostic file basename could be skipped in favor of adding distinguishing tags for output file basename.

…thub.com/stan-dev/cmdstan into feature/1180-pathfinder-single-path-csvs

:wq# Please enter a commit message to explain why this merge is necessary,
@WardBrian WardBrian mentioned this pull request Aug 18, 2023
27 tasks
@mitzimorris
Copy link
Member Author

mitzimorris commented Aug 19, 2023

per discussion w/ @WardBrian and @SteveBronder:

  • output sub-arg diagnostic_file will be used for outputs from multi-path Pathfinder. currently, dummy json writer passed in to service method call.
  • pathfinder boolean arg save_single_paths, when true, will save both single-path sample as well as single-path diagnostics which report on the ELBO evaluations. these files names will constructed from the output_file basename, plus _path + path id.

Question: what should single path pathfinder output file be named if num_paths= 1? what about num_paths=1 and save_single_paths=1? Choices for output filenames:

  • a. basename_path_1.csv and basename_path_1.json
  • b. basename.csv and basename.json

really, no preference here - option a) is perfectly reasonable and it's OK if non-default options result in complicated names. opinions? @bob-carpenter ?

@WardBrian
Copy link
Member

I think if num paths is 1 it should just be “output.csv”, and the diagnostic should probably come from the diagnostic file argument

@mitzimorris
Copy link
Member Author

how about this:

if num_paths=1 and save_single_paths=1 and arg diagnostic_file is unspecified, then single-path diagnostics are in file output.json.

if num_paths=1 and diagnostic_file="foo.json", then single-path diagnostics are in file foo.json.

@mitzimorris
Copy link
Member Author

mitzimorris commented Aug 20, 2023

this is ready for re-review.

these are the relevant combinations of arguments / output files

  • mymodel pathfinder - single output file output.csv which contains PSIS draws over 4 single-path pathfinders

  • mymodel pathfinder num_paths=1 - single output file output.csv contains single-path pathfinder draws

  • mymodel pathfinder save_single_paths=1 - 9 output files:
    - output.csv, as above
    - output_path_1.csv ... output_path_4.csv - single-path pathfinder draws
    - output_path_1.json ... output_path_1.json - single-path ELBO iterations

  • mymodel pathfinder num_paths=1 save_single_paths=1 - 2 output files: output.csv contains single-path pathfinder draws and output.json which contains infor from each ELBO iteration.

  • mymodel pathfinder num_paths=1 save_single_paths=1 output diagnostic_file="diag" - 2 output files: output.csv and diag.json.

@mitzimorris
Copy link
Member Author

CI failed on upstream/(downstream?) performance tests - one of the benchmarks gold files is missing?
@WardBrian @serban-nicusor-toptal


Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: gp_exp_quad_cov: sigma is 0, but must be positive! (in '../stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', line 14, column 4 to line 15, column 59)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:  100 / 2000 [  5%]  (Warmup)
Iteration:  200 / 2000 [ 10%]  (Warmup)
Iteration:  300 / 2000 [ 15%]  (Warmup)
Iteration:  400 / 2000 [ 20%]  (Warmup)
Iteration:  500 / 2000 [ 25%]  (Warmup)
Iteration:  600 / 2000 [ 30%]  (Warmup)
Iteration:  700 / 2000 [ 35%]  (Warmup)
Iteration:  800 / 2000 [ 40%]  (Warmup)
Iteration:  900 / 2000 [ 45%]  (Warmup)
Iteration: 1000 / 2000 [ 50%]  (Warmup)
Iteration: 1001 / 2000 [ 50%]  (Sampling)
Iteration: 1100 / 2000 [ 55%]  (Sampling)
Iteration: 1200 / 2000 [ 60%]  (Sampling)
Iteration: 1300 / 2000 [ 65%]  (Sampling)
Iteration: 1400 / 2000 [ 70%]  (Sampling)
Iteration: 1500 / 2000 [ 75%]  (Sampling)
Iteration: 1600 / 2000 [ 80%]  (Sampling)
Iteration: 1700 / 2000 [ 85%]  (Sampling)
Iteration: 1800 / 2000 [ 90%]  (Sampling)
Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 1.795 seconds (Warm-up)
               2.345 seconds (Sampling)
               4.14 seconds (Total)

Traceback (most recent call last):
  File "./runPerformanceTests.py", line 424, in <module>
    results = list(results)
  File "./runPerformanceTests.py", line 345, in process_test_wrapper
    time_, (fails, errors) = run(exe, data, overwrite, check_golds,
  File "./runPerformanceTests.py", line 275, in run
    summary = csv_summary(tmp)
  File "./runPerformanceTests.py", line 163, in csv_summary
    with open(csv_file, 'r', encoding = 'utf-8') as raw:
FileNotFoundError: [Errno 2] No such file or directory: 'golds/stat_comp_benchmarks_benchmarks_gp_pois_regr_gp_pois_regr.gold.tmp'
script returned exit code 1

@WardBrian
Copy link
Member

It’s because the performance tests request output files that end in .tmp:
https://github.com/stan-dev/performance-tests-cmdstan/blob/ffff830fbd6d5e09dbee67dad336e957e9e19b2a/runPerformanceTests.py#L268

But the latest changes here (to make_filenames specifically) always overwrite that to be .csv

src/cmdstan/command.hpp Outdated Show resolved Hide resolved
src/cmdstan/command_helper.hpp Outdated Show resolved Hide resolved
@WardBrian WardBrian merged commit 763beb9 into develop Aug 21, 2023
@WardBrian WardBrian deleted the feature/1180-pathfinder-single-path-csvs branch August 21, 2023 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants