Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed batch related bug in convergence. #710

Merged
merged 1 commit into from
Sep 21, 2021

Conversation

valayDave
Copy link
Collaborator

There was a bug in batch.py; this small patch fixes it.

The following was the bug: the self.create_job function is given as task_spec when the function doesn't take a task_spec as an input.

@savingoyal savingoyal merged commit c460efd into Netflix:convergence Sep 21, 2021
romain-intel added a commit that referenced this pull request Sep 24, 2021
* Add external_field to parameter

* Fix timedate conversion from service to client

* New datastore implementation

This commit contains only the new datastore code and none of the backend implementations.
The datastore is now split into different files:
  - flow_datastore.py contains the top-level FlowDataStore implementation
  - task_datastore.py contains the task-level datastore implementation
  - content_addressed_store.py contains the underlying content addressed store used by
    both previous data-stores.

* Local backend for the datastore.

The local backend is used to save and load local files.

* Datatools performance improvements

Datatools will now cache the s3 client it uses for single operations
resulting in faster operation times.

Another optimization (non advertised) is that datatools can now take
an IOBase directly to avoid having an additional copy.

Finally, __del__ now performs a close so the S3 datatool can be used
as a regular object as opposed to just within a context.

* Added S3 datastore backend.

This backend allows the datastore to interface with S3.

* Pure non-semantic changes (name changes and comment changes).

One tiny semantic change in the way a Tar file is read (using the recommended
open method instead of TarFile object)

* Integrate the new datastore implementation into the current code

* Remove no longer needed files for the datastore.

* New features for the S3 datatools:
  - support for range queries (in get and get_many)
  - support for content-type and user metadata in put, put_many and put_files
    (metadata can also be retrieved using any of the get calls)
  - support for info and info_many to retrieve information about a file without
    fetching it.

* Fix metadata to allow for non-string arguments

* [WIP] Modify new datastore to use metadata in backend storage

Instead of encoding needed information directly in the file, we now
encode this information as file metadata leveraging the support for metadata
for the S3 datatools and implementing support for it in the local
filesystem (creating a separate file)

* Properly use download_fileobj when no range is specified

* Re-add intermediate compatibility mode for loading blobs from the CAS

* Remove print message; fix local_backend in some cases

* [WIP] Datatools: See if moving info to worker makes a performance difference

* Fix issue with s3 datastore backend for multiple files

* Fix issue with s3 datastore backend for multiple files

* Addressed comments

* Improve import of metaflow_custom to allow for top-level imports

* Fix small issue with sidecar message types

* Fix tags for attempt-done in task-datastore

* Fix small issue with sidecar message types

* Addressed comments -- minor tweaks and file refactoring

* Addressed comment: removed handling of ephemeral CAS encoding directly in file

* Addressed comments; refactored use of pipes

Addressed the minor style comments.
Removed the use of a pipe to communicate between workers and master.
Communication now occurs via a file.

* Forgot file in previous commit

* Typo in parentheses causing issues with put of multiple files

* Fix issue with S3PutObject

* Fix bugs due to int/string conversion issues with new file mechanism

* Fix issue with is_none and py2 vs py3

* Fix issue with metaflow listing all flows

* Handle non-integer task/run IDs in local metadata

* WIP: MFLog on convergence branch

This is still in progress but pushing for reference.

* Fixups to track master MFLog and specific convergence fixups

* Fix to Batch logging

* Add support for getting all artifacts from the filecache

* Improve MFLog:
  - Save logs later in runtime to get all logs (even messages printed on error)
  - Allow for escaping variables in bash commands

* Fix a bug when missing data is requested from the datastore

* Trigger test on convergence branch

* Fix failing tests

* Fix issue caused by merge_artifacts now having artifacts read from datastore in write mode (#577)

* Convergence master merge (#578)

* UBF Implementation (#554)

Unbounded Foreach Implementation

Co-authored-by: Savin <savingoyal@gmail.com>

* Check if R step names are valid Python identifiers (#573)

* Check if R step names are valid Python identifiers

* Accidentally changed a docstring step_name to validator. Reverting.

* get rid of  _METAFLOW_RESUMED_RUN (#575)

* Add Unbounded Foreach support

Co-authored-by: Savin <savingoyal@gmail.com>
Co-authored-by: David Neuzerling <mdneuzerling@users.noreply.github.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>

* Revert "Add external_field to parameter"

This reverts commit b118685.

* Small MFlog refactor

* Add a way to pass an external client to S3()

* Cache the S3 boto connection in the datastore (#570)

* The datastore now caches the S3 boto connection and passes it to the S3 client

This allows us to use S3() using with (and therefore properly clean up the temp
directory) but also benefit from the connection reuse.

As part of this change, we also had to change the interface to load_bytes. This
change is invisible to the user as it is 100% internal to the datastore.

Cleaned up the s3tail.py and s3util.py files to co-locate them with the
datatools instead of the datastore (since they were no longer used directly
in the datastore). This simplified the import story.

* Addressed comments and simplified the code a bit

- Use a single CloseAfterUse class.
- Remove the use of caching on write (in artifact and blob cache); the cache
  now only caches things that are read which is the way it is being used
- Simplified the types of caches used in datastore_set.py

* Remove no longer needed file

* Clean up usage of CloseAfterUse

* remove artifact_cache, simplify blob_cache

* get rid of redundant LazyFile

* fix FileBlobCache root

* fix paths in content_addressed_store

* Addressed comments

* Move check on file existence in CAS to backend for better efficiency

* Fix modes for datastore due to merge_artifacts functionality

* Ignore error when overwrite is False and file exists

Co-authored-by: Ville Tuulos <tuulos@gmail.com>

* Add an option to define your own AWS client provider (#611)

* Add an option to define your own AWS client provider

You can now specify a function that returns an AWS client. This is
useful if you want to use something other than boto3 and can be used
with the metaflow_custom mechanism to provide your own authentication

* Typo

* Do not remove _expected_extensions since it is used in a function

* Another typo

* Addressed comments

* Add inputs to task_pre_step (#616)

* Tweaks and optimizations to datastore

* Revert "Move check on file existence in CAS to backend for better efficiency"

This reverts commit d0ac6fe.

* Change encoding type to gzip+pickle to keep forward compatibility

* Reduce memory usage of artifact save and load

We ensure that we do not keep around temporaries and allow the
GC to remove objects that are no longer needed

* Make is_file parallel

We can now call is_file in parallel. It is unclear if this helps a lot given the
overhead of loading the s3op and starting them all. It may help in the case of
a very large number of artifacts

* Set overwrite=True when uploading metadata

* Improve fetching info and file at the same time

* reduce duplicate copies in datastore by using generators

* minimize memory consumption by handling artifacts one by one

* fix docstrings and documentation

* fix @Batch to work with the new datastore

* do not implicitly assume a correct key order in flow_datastore.load_data

* Other minor consistency tweaks

Co-authored-by: Romain Cledat <rcledat@netflix.com>

* Merge master into convergence

* Add inputs to task_pre_step (#615)

* Add inputs to task_pre_step

* Addressed comments

* Forgot to remove an import

* Refactor @resources decorator (#617)

* Refactor @resources decorator

@resources decorator is shared by all compute related decorators -
@Batch, @lambda, @K8s, @titus. This patch moves it out of
batch_decorator.py so that other decorators can cleanly reference
it.

* Update __init__.py

* Add an option to define your own AWS client provider (#620)

You can now specify a function that returns an AWS client. This is
useful if you want to use something other than boto3 and can be used
with the metaflow_custom mechanism to provide your own authentication

* Add S3 tests (#613)

* Add S3 tests

* Addressed comments

* silence tar timestamp warnings (#627)

* Handle None as default parameters properly in AWS Step Functions (#630)

A parameter specification like -
```
Parameter(name="test_param", type=int, default=None)
```
will result in an error even though the default has been specified
```
Flow failed:
    The value of parameter test_param is ambiguous. It does not have a default and it is not required.
```

This PR fixes this issue.

* IncludeFile now returns the included file in the client (#607)

* DRAFT: IncludeFile now returns the included file in the client and CLI

THIS IS NOT FINISHED; DO NOT MERGE AS IS.

* Fix the tests

* Forgot to update type check for multiple encoding

* Add resource tags to AWS Batch jobs (#631)

* Add resource tags to AWS Batch jobs

This PR assumes that Batch:TagResource is whitelisted for the
role which submits the AWS Batch job.

* propagate tags

* add env var

* better commens

* add production token

* New patch release (#633)

* Bug fix with s3tail exception (#635)

* Revert "IncludeFile now returns the included file in the client (#607)" (#637)

This reverts commit 0575f8d.

* patch release (#639)

* Update Unbounded Foreach Test in the case of a Conda environment (#626)

Co-authored-by: Savin <savingoyal@gmail.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>

* Fix Py2 compatibility

* Fix typo in s3op.py

* Pin test executions to master branch

* Fix names of metadata file to include  ending

* Renamed metaflow_custom to metaflow_extensions

* Remove duplicate code

* remove .json suffix from metadata

* test commit

* remove spurious commits

* Fix merge

* remove new line

* Fixed batch related bug in convergence. (#710)

* fix merge conflicts

* Batch fixes for convergence branch

* fixup aws batch

* fix whitespace

* fix imports

* fix newline

* fix merge

* fix local metadata for aws batch

* merge mflog

* stylistic fixes

* Added preliminary documentation on the datastore (#593)

* Added preliminary documentation on the datastore

* Update datastore.md

Update with new name for the backend.

* Final set of patches to convergence (#714)

* convergence fixes

* fixes

* gzip ts changes

* fix logs subcommand

* typo

* Address comments; add support for var_transform in bash_capture_logs

* Fix local metadata sync

* Forgot to remove duplicate sync_metadata

Co-authored-by: Romain Cledat <rcledat@netflix.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>

* Typo in error message

Co-authored-by: Savin <savingoyal@gmail.com>
Co-authored-by: David Neuzerling <mdneuzerling@users.noreply.github.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: Ville Tuulos <tuulos@gmail.com>
Co-authored-by: Valay Dave <valaygaurang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants