Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placeholders break soopervisor #116

Closed
edublancas opened this issue Oct 12, 2022 · 8 comments · Fixed by #120 or ploomber/ploomber#1042
Closed

Placeholders break soopervisor #116

edublancas opened this issue Oct 12, 2022 · 8 comments · Fixed by #120 or ploomber/ploomber#1042

Comments

@edublancas
Copy link
Contributor

edublancas commented Oct 12, 2022

In Ploomber, we have placeholders that users can put in their pipeline.yaml. For example, to index experiments by {{git_hash}} or timestamp with {{now}}. However, the value of such placeholders is computed when loading the pipeline.yaml.

This works fine in Ploomber since the pipelines execute in a single machine; however, when using soopervisor, this breaks. For example, the {{git_hash}} isn't available because we don't copy the git repository, only the existing files. And {{now}} breaks because it's evaluated for each task. Hence, each one sees a different timestamp.

We can solve this by resolving placeholders when executing soopervisor export and putting a "resolved" pipeline.yaml in the Docker container. Ploomber already has a feature that allows defining placeholders in a separate file (env.yaml), we can use this feature to facilitate the implementation.

This is a high level description of what I think it's the best way to do this:

docker.build is the function that builds the docker image, and it already receives the path to the pipeline.yaml in the entry_point argument. We can use this value and pass it to this function (note that this is defined in ploomber), which will return is what env.yaml file the entry_point will use, let's call this value path_to_env.

We can use this function to get the values of the placeholders (like {{now}} and {{git_hash}}).

from ploomber.env.envdict import EnvDict

env=EnvDict({})

env._data

Then, we apply some logic depending on the value of path_to_env, right before compressing the source code.

if path_to_env is None, we create an env.yaml in the same directory as the pipeline.yaml, and write the contents of the placeholders, if path_to_env is not None, then we load its contents and append the values of the placeholders. However, we should not override keys (only add the ones that do not exist).

@edublancas
Copy link
Contributor Author

note that this will also close #114

@edublancas
Copy link
Contributor Author

hey, @feribg. thoughts? wanna tackle this?

@feribg
Copy link
Contributor

feribg commented Oct 12, 2022

Yeah sounds fair, how about append or override null keys in env.yaml re the last step ?
@edublancas

@feribg
Copy link
Contributor

feribg commented Oct 12, 2022

Do you mean here we should actually add the values function
For example placeholders['git_hash'] = '{{git_hash}}' should be come placeholders['git_hash'] = repo.get_current_hash() or along those lines.

Should we not have code to remove those from evaluating in the runtime in the docker container then ? I guess it will still look for {{git_hash}} to replace at runtime.

Also I think we can use this generic mechanism and extend the scope here so that the env var substitution can be done here as well. IE referring env vars as placeholders in the pipeline config and then injecting them into this new .env file. This might require some convention for which im not sure env[] is the ideal syntax due to the common usage of those as special chars, but either way is fine.

@edublancas
Copy link
Contributor Author

For the env, I think this is where we can pass os.environ: https://github.com/ploomber/ploomber/blob/856369d600ddd6a11f2ae06f415ea819191bc353/src/ploomber/env/expand.py#L170 - but I'm not 100% sure

@feribg
Copy link
Contributor

feribg commented Nov 3, 2022

@edublancas Here are the preliminary PRs
#120
ploomber/ploomber#1042

I still have one question re {{now}} I dont think we can inject that in the env.yaml as the other vars as this is idempotent per run but not per build. IE i want all my steps to use the now as of starting the pipeline but not as of building it. How do you think we should tackle this, its the biggest blocker for now i think the rest is resolved there.

@edublancas
Copy link
Contributor Author

can you clarify what you mean by run and build?

I think {{now}} should have the same value across all tasks, so it should be computed once when submitting to AWS Batch, and then tasks should use that value. I think that's desirable since it'll allow users to index their runs (e.g., today's run, yesterday's run, etc.).

However, I believe this is not what you want in your use case, but you want the {{now}} timestamp to be computed whenever a task starts, correct?

If so, I think the simplest way to solve it is by adding a new placeholder automatically, e.g., {{start}} - which can behave as I described (one per run), and we can leave {{now}} untouched.

@feribg
Copy link
Contributor

feribg commented Nov 6, 2022

That is indeed correct I want {{now}} to be evaluated when the whole pipeline starts, but is not the case with our current solution, be cause currently now would be inserted in the env.yaml upon building the docker image. So if i run this image 10 times at different times, now will be the same (as of building the image).

Currently now is being evaluated when a task starts (this is why we have a problem with dependencies because we put {{now}} in the path for products and downstream tasks run later hence cant find the deps of upstream tasks.

In a sense we almost need {{now_as_of_pipeline_run}} and {{now_as_of_task_run}}, the latter being the current behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants