Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding data keyword to validphys #651

Merged
merged 167 commits into from
Jul 16, 2020
Merged

Adding data keyword to validphys #651

merged 167 commits into from
Jul 16, 2020

Conversation

wilsonmr
Copy link
Contributor

Adding the data keyword to validphys, supersedes #515

the idea here is discussed at length in PR #515 but the idea is to have a flat input of datasets rather than introduce experiment hierarchy at the level of the runcard - instead arbitrary groupings can be done with meta data or whatever

defaults will be filled in - waiting on lockfiles for this

we think that the current implementation is fully backwards compatible - all actions will now depend on data and all fits which do not have a data key are given one in as_input from their experiment specification. Then just need to be careful to make sure that action dependencies are all sorted out correctly

To do:

  • parsing data from runcard as some simple object (list of dataset_inputs)
  • produce data which is more complicated object - currently abusing ExperimentSpec
  • add backwards compatibility at the level of FitSpec
  • interface with lockfile/defaults when available so user can just input list of dataset names
  • sensible way to group data in various ways
  • check we didn't break anything

"""Filter closure test data."""
total_data_points = 0
total_cut_data_points = 0
fakeset = fakepdfset.load()
# Load experiments
for experiment in experiments:
for dataset in data:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be slightly changed. We want more like:

loaded_data = data.load.__wrapped__(experiment) # I assume this bypasses the lru_cache?
loaded_data.MakeClosure(fakeset, fakenoise)
for dataset in data.datasets:
    path = filter_path / dataset.name
    nfull, ncut = _write_ds_cut_data(path, dataset)
    total_data_points += nfull
    total_cut_data_points += ncut
    loaded_ds = loaded_exp.GetSet(j)
    if errorsize != 1.0:
        loaded_ds.RescaleErrors(errorsize)
        loaded_ds.Export(str(path))

because we need the level 1 shift to account for correlated systematics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I thought you had to put it back inside an experiment again to get it to work

# keep backwards compatibility
old_experiments_input = d.get("experiments")
if old_experiments_input:
d["data"] = data_from_experiment(old_experiments_input)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to change this so it adds data_input to the runcard not data

@wilsonmr
Copy link
Contributor Author

ok so I think we have a minimum working example including vp-setupfit. Provided there are no issues with this structure I guess we need to then address all the places in validphys which use these functions

@tgiani
Copy link
Contributor

tgiani commented Feb 26, 2020

@wilsonmr @RosalynLP ok so if I ve understood correctly now we should

Correct?

@RosalynLP
Copy link
Contributor

@wilsonmr @RosalynLP ok so if I ve understood correctly now we should

* go through any vp2 action which takes `experiments` and change it so that it can take `data_input` as well

* work on grouping datasets according to some chosen criterium, as started in #481

Correct?

Yes that's what I thought, perhaps we can divvy up who's doing what tomorrow?

@tgiani
Copy link
Contributor

tgiani commented Mar 2, 2020

@wilsonmr have a look if we are doing the right thing..
@RosalynLP and I have done the following

  1. produce_fitcontext and produce_fitinputcontext not much to be changed: we just changed experiments -> data_input and the previous change in core.py should do the rest

  2. Regarding produce_matched_datasets_from_dataspecs the old runcard looks like
    https://vp.nnpdf.science/24VZ1rh6QEm79AkcuGjZkA==/input/ and we want to

  • keep backward compatibility
  • put in the runcard the keyword data_input
  • change the tuple (datasets, dsinputs) to just dsinputs (we don't need datasets anymore/no longer exists in runcards)
  • change the labelling from (dataset name, experiment name, process) to (process, dataset name) because we don't want keyword experiments anymore

try:
_, data_input = self.parse_from_(
None, 'data_input', write=False)
except:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to except something specific here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea was to keep backward compatibility with old runcards using produce_matched_datasets_from_dataspecs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By except something specific what do you mean exactly sorry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so except the exact error you get when you should be parsing experiment.

This catches all exceptions which could be bad if it raises an IO error or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's probably a ConfigError but I'm not sure

@wilsonmr
Copy link
Contributor Author

wilsonmr commented Mar 2, 2020

I think it looks good so far

with self.set_context(ns=self._curr_ns.new_child({'theoryid':thid})):
_, experiments = self.parse_from_('fit', 'experiments', write=False)
_, data_input = self.parse_from_('fit', 'dataset_inputs', write=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I don't think we need to set context anymore, since dataset_inputs doesn't require theoryid :)

@wilsonmr
Copy link
Contributor Author

wilsonmr commented Mar 4, 2020

ok so we now have the stuff required to get results.py moving. In particular we can do the following:

def group_results(data, ...):
    """Like results but for a group of data"""
    ...

# requires user to specify `metadata_group_key` in runcard
metadata_groups_results = collect(
    "group_results",
    ("group_data_by_metadata",)
)

# `metadata_group_key` is set to `experiments` by the production rule
fits_experiments_results = collect(
    "metadata_groups_results",
    ("fits", "fitcontext", "groupby_experiment")
)

which works for both new fits and old fits and also for new runcards and old runcards. The idea is to change all functions which take an experiment to take data, so by default they act on all of the data and then to use the grouping production rule/rules to instead make them act on subsets of data

the grouping function we tried to act on the simplest objects possible which are dataset_inputs, and so don't explicitly require theoryid or cuts

@Zaharid
Copy link
Contributor

Zaharid commented Mar 4, 2020

Seems like the right direction.

@RosalynLP
Copy link
Contributor

Right I tested that and it gives the desired output.

@RosalynLP
Copy link
Contributor

I tried to do a rebase and totally messed it up though

@siranipour
Copy link
Contributor

siranipour commented Jul 9, 2020

Sometimes it's easier with --rebase-merges

@RosalynLP
Copy link
Contributor

Sometimes it's easier with --rebase-merges

Omg this is a gem, never came across this before thanks

@scarrazza
Copy link
Member

@RosalynLP could you please fix the conflict?

@RosalynLP
Copy link
Contributor

@scarrazza conflicts fixed

@RosalynLP RosalynLP dismissed stale reviews from siranipour and voisey via 5c7fc55 July 16, 2020 09:40
Copy link
Contributor

@voisey voisey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can now be merged

@Zaharid Zaharid merged commit 6130eb7 into master Jul 16, 2020
@Zaharid Zaharid deleted the add-data-kw branch July 16, 2020 12:28
@siranipour
Copy link
Contributor

Should we tag this merge commit?

@Zaharid
Copy link
Contributor

Zaharid commented Jul 16, 2020 via email

@RosalynLP
Copy link
Contributor

I actually had a dream about merging this PR the other day so I'm happy about this 🎉. Let's hope it doesn't break too much 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants