-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store artifacts on s3 #15
Comments
TL;DR: my_dataset_to_version:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: s3://path/to//file.csv to my_dataset_to_version:
type: kedro_mlflow.io.MlflowDataSet
data_set:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv # must be local! Hello @akruszewski, nice to hear that you try the plugin out. In order to give you a more complete and accurate answer, I will need a bit more detail on how you set up your mlflow but the answer is yes, you can store the artifacts basically anywhere with the plugin and it should be completely straightforward. A bit of context: mlflow under the hoodWarning : I don't want to be pedantic: If you are perfectly aware of how to configure mlflow, skip this section Mlflow separate HOW you store the artifacts from WHERE you store them.
How to setup WHERE artifacts are recorded your S3 bucketWith this setup in mind, it should be clear that WHERE the artifacts are recorded does not depend on HOW you log them. It is a first configuration that you must do outside the logging part. Here I have to make some hypothesis on how your mlflow is configurated but I can imagine that you either:
How to use with the pluginSetup WHERE mlflow records artifacts with the pluginThe
HOW to log with the pluginThe plugin object for HOW to store artifacts is the my_dataset_to_version:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv you just have to replace it with: my_dataset_to_version:
type: kedro_mlflow.io.MlflowDataSet
data_set:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: /path/to/a/local/destination/file.csv When the file will be saved (at the end of a node), it will be automatically uploaded WHERE your mlflow is configured to send it whatever it is (s3 or other). If you manage to log your dataset to your S3 bucket with |
Hi @Galileo-Galilei. First of all sorry for lame introduction. I'm working with @kaemo and I'm planing to help with development of this plugin. Unfortunately, he is off for next two weeks, so you would here mostly from myself. Thanks for your detailed answer, which is really insightful. I wasn't really clear with my question, so let put more context on it. In more detail: we are developing example pipeline (using titanic dataset) and trying to create idiomatic kedro pipeline I was hoping that I would be able to log artifacts which are stored in s3 with help of data catalog, something like: my_dataset_to_version:
type: kedro_mlflow.io.MlflowDataSet
data_set:
type: pandas.CSVDataSet # or any valid kedro DataSet
filepath: s3://path/to/a/destination/file.csv That was my misconception, as you explained.
In my opinion it would be more convenient to use Right now I'm playing with I'm also happy to contribute to this project, if you have any propositions let me know. |
Nice to hear that you want to get involved! Regarding the different points you adress:
|
I close this issue since detailed documentation is now available on readthedocs. Feel free to reopen if needed. Above answer is still valid, but many improvements have been made since:
|
What a great thread! From what I understand, in your first comment you laid down how kedro-mlflow handles Scenario 4. I'm currently exploring ways to use kedro-mlflow, but for Scenario 5. From the client's perspective, the main difference is that all artifact paths should start with the Edit: Nevermind, I've just tested Scenario 5 and it turns out |
Hi @foxale, Sorry for the response delay. There is no reason kedro-mlflow will not work for saving artifacts (it just call log_artifact under the hood), while we may have issue loading artifacts without the prefix. Actually, the MlflowArtifactDataSet does not load from the server but from the local path, except if you specify the run id explictly (which is a very uncommon way to use kedro-mlflow, usually you just let the plugin open a new run_id for a new kedro run). I'd be glad to get feedback if you have any issue with this modern way to set up mlflow I have never tried myself yet! |
Hi @Galileo-Galilei, predictions:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pandas.ParquetDataset
filepath: s3://my-bucket/data/09_results/predictions.parquet
credentials: s3_creds Any guidance or workaround to handle this scenario would be greatly appreciated! |
Hi @OlegInsait, as described in above thread, limiting artifact logging to local filepaths and not S3 filepaths is a limitation of mlflow itself, not kedro-mlflow. You need to configure your mlflow server to have a S3 backend, then all call to log_artifacts will log in this remote storage. Can you elaborate more on your setup if you can't make it work so I can help? |
This is exactly the issue. The source of the files to be logged is a S3 bucket. Setting up a S3 backend will only allow to log into S3. |
You can do perfectly do that, and you can implement your custom kedro dataset to do that, but I won't support it because:
EDIT: The key idea would be to modify this section in the code : kedro-mlflow/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py Lines 67 to 78 in 64b8e94
to store the data in a tempfolder and then log in mlflow, using the underlying dataset by copying it and modify inplace its path location. |
Thank you @Galileo-Galilei ! |
I'm just testing this plugin and trying to send artifacts to s3 bucket. After reading a code of this project, I figure out that in current state it's not possible. I just want to make sure that this is the case before I will implement this functionality. @Galileo-Galilei could you confirm that this is the case?
The text was updated successfully, but these errors were encountered: