Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto ML assets #25466

Merged
merged 10 commits into from
Jan 23, 2023
Merged

Auto ML assets #25466

merged 10 commits into from
Jan 23, 2023

Conversation

MaksYermak
Copy link
Contributor

I have created links and updated system tests for Auto ML operators.

Co-authored-by: Wojciech Januszek januszek@google.com
Co-authored-by: Lukasz Wyszomirski wyszomirski@google.com
Co-authored-by: Maksim Yermakou maksimy@google.com


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Aug 2, 2022

Errors :(

@potiuk
Copy link
Member

potiuk commented Aug 4, 2022

Rebased to acount for Flask 2.2 errors fixed yesterday.

@raphaelauv
Copy link
Contributor

raphaelauv commented Aug 9, 2022

there is a csv file tests/system/providers/google/cloud/automl/resources/bank-marketing.csv

of 45K char , is that normal ?

@potiuk
Copy link
Member

potiuk commented Aug 10, 2022

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

@MaksYermak
Copy link
Contributor Author

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size?

@bhirsz
Copy link
Contributor

bhirsz commented Aug 25, 2022

Yeah 45K lines of .csv file is NOT something we want. Few options:

  1. what happens when you zip the file ? how big it is going to get
  2. Do we REALLY need as big of a file?
  3. We could easily place it it in our Amazon S3 bucket to download it for the test when needed, we could make it publicly available

This .csv is needed for training an AutoML model, in order to start the training .csv should consist more then 1000 rows. For our test I can reduce the file to 2100 rows. @potiuk what do you think about reducing the file size?

@potiuk Catching attention :) I think 2100 is okayish (not the best but certainly better than 50k). Please comment if you still think it should be stored in the external storage.

@potiuk
Copy link
Member

potiuk commented Aug 25, 2022

Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unlikely to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git.

It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,

@MaksYermak
Copy link
Contributor Author

Can we compress it (and dynamically decompress during test?). Just zipping it is 20K instead of 160K. This file is unlikely to ever change and it is cimpletely uninteresting to see what's in when you review the cod, so there is no particular reason to keep text file in Git.

It's not only the size that matters in this case. Keeping it plain text has this really nasty effect that it when you search something in the source code in your IDE, you will find some matching words here likely, so keeping the file uncompressed make it very prone to falling search&replace victim,

@potiuk I have done it

@potiuk
Copy link
Member

potiuk commented Sep 9, 2022

Sorry for delay - been a bit busy.

No, It's not compressed - it's just bundled in .tar now not .zipped (.tar-ing single file kinda make no sense) . Stil takes 170 instead of 20K (and this PR needs rebase anyway).

@MrGeorgeOwl MrGeorgeOwl force-pushed the automl-assets branch 3 times, most recently from 6adb962 to 076d91a Compare October 14, 2022 09:03
@potiuk
Copy link
Member

potiuk commented Oct 24, 2022

conflicts need to be resolved after string normalisation

@potiuk
Copy link
Member

potiuk commented Oct 31, 2022

Rebased to rebuild.

@potiuk potiuk force-pushed the automl-assets branch 2 times, most recently from cfb9f8b to 9b863c6 Compare November 2, 2022 05:11
@potiuk
Copy link
Member

potiuk commented Dec 4, 2022

Tests failing.

@MrGeorgeOwl MrGeorgeOwl force-pushed the automl-assets branch 2 times, most recently from c1bc806 to 0c94ca8 Compare December 19, 2022 15:40
@potiuk
Copy link
Member

potiuk commented Jan 17, 2023

static check failures.

@potiuk
Copy link
Member

potiuk commented Jan 18, 2023

REbased - static checks fixed in main (mysql python connector release breaking mypy)

@MrGeorgeOwl
Copy link
Contributor

@potiuk I think that PR can be merged. I can't do that because I am not the author of PR and I don't have write access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants