Skip to content
This repository has been archived by the owner on Jan 27, 2024. It is now read-only.

add --zip option to zip the submission file prior to uploading it #52

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

queirozfcom
Copy link

@queirozfcom queirozfcom commented Oct 21, 2017

If option is passed, a zip file is created (using python zipfile module) with a random name (cache busting) and the zipped file is uploaded instead, to use less bandwidth. The zip file is deleted afterwards to avoid using up too much disk.

Copy link
Owner

@floydwch floydwch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

And there are some feedbacks to the implementation, before the merge, let's discuss and make some updates on it. Thank you.

@@ -25,7 +25,7 @@ def get_parser(self, prog_name):
parser.add_argument('-u', '--username', help='username')
parser.add_argument('-p', '--password', help='password')
parser.add_argument('-z', '--zip', type=self._str2bool, nargs='?', const=True, default=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use the same argument setting as config.py?

i.e. parser.add_argument('-z', '--zip', help='zip the submission file before uploading?', action='store_true')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@@ -98,7 +98,7 @@ def take_action(self, parsed_args):
parsed_arg_dict = vars(parsed_args)

if DATA_OPTIONS & set(
filter(lambda x: parsed_arg_dict[x], parsed_arg_dict)
filter(lambda x: parsed_arg_dict[x], parsed_arg_dict)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an accident duplicated indent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE issues I guess

@@ -35,7 +35,12 @@ def take_action(self, parsed_args):
username = config.get('username', '')
password = config.get('password', '')
competition = config.get('competition', '')
zip = config.get('zip', False)
zip_flag = config.get('zip', 'no')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use getboolean method instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floydwch but the config object here is just a python dict, it's not a ConfigParser object

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue should be resolved at the config provider level, I'll address it.

@@ -46,7 +51,7 @@ def take_action(self, parsed_args):
entry = parsed_args.entry
message = parsed_args.message

archive_name = Submit._rand_str(10)+'.zip'
archive_name = Submit._rand_str(10) + '.zip'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the uploaded file's name will also show on the Kaggle's dashboard. A related name might be better. i.e. append the uuid or timestamp to the original submission file's name.

And another concern there is, for myself habit, the submission files are already suffixed a timestamp to distinguish the trial. Adding additional suffix is duplicated.

Copy link
Author

@queirozfcom queirozfcom Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. What if I prepend a short random string instead?

@queirozfcom
Copy link
Author

Thanks for the feedback. I'll edit some things and remake the request

@queirozfcom
Copy link
Author

I've applied the changes you suggested (except for using config.getboolean(), because the object is not a ConfigParser, but a normal python dict)

Copy link
Owner

@floydwch floydwch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New feedbacks here. And please also update the README.md to explain the new function.

And I'll update the master branch to provide the coerced bool field. You can rebase to the master after I updated it.

@@ -35,7 +35,12 @@ def take_action(self, parsed_args):
username = config.get('username', '')
password = config.get('password', '')
competition = config.get('competition', '')
zip = config.get('zip', False)
zip_flag = config.get('zip', 'no')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue should be resolved at the config provider level, I'll address it.

zip = True
else:
zip = False
zip = Submit._str2bool(zip_flag)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding the built-in function name is discouraged. You can just apply the _str2bool to the zip_flag = config.get('zip', 'no') , i.e. zip_flag = Submit._str2bool(config.get('zip', 'no')).

But I'll address the boolean coercion issue at config provider level, you can get a boolean by config.get('zip') after I updated it.

Copy link
Owner

@floydwch floydwch Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type coercion enhancement has been done, see 9158c36 and 551a1b2 .

You can now config.get('zip').

archive_name = Submit._rand_str(10) + '.zip'
archive_name = Submit._make_archive_name(entry)

# print(archive_name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incident comments here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops!

@@ -129,29 +130,38 @@ def take_action(self, parsed_args):
os.remove(target_name)

@staticmethod
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static method is not the best choice in such case. And an underscore prefix method indicates this method is private, which means it'll access the object's private fields, but it's not the case here. I suggest you extract the method to the module scope, i.e. move the _make_archive_name outside the Submit class but still in the submit.py.

else:
archive_name = original_basename+".zip"

# this is used to prevent caching issues
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still can't understand why it has caching issues. Can you explain the scenario?

Copy link
Author

@queirozfcom queirozfcom Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this but I think that when we upload two submissions (different files) with the same same (say, "mysubmission.csv"), this may trigger some form of caching in Kaggle

(They may only check the file name and, if the file name is something they have seen before, they may not even read the file).

Again, I'm not sure about this but since caching is done by many people on many websites, I though it would be a safe thing to do at little cost (just add a couple characters to the file name).

But I agree it's not essential.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching usually applied on read operation, caching a file upload would be considered a bug.

But I just tested if Kaggle caches the file? The answer is no.

I uploaded the benchmark model for the titanic competition and updated the file's content manually and uploaded again to see if it's cached. The scores returned by Kaggle were different in the two submissions. So basically, it's not an issue.

@queirozfcom
Copy link
Author

@floydwch the changes have been ready for a couple of days now... anything you want to add?

@floydwch
Copy link
Owner

floydwch commented Nov 5, 2017

@queirozfcom sorry, I'm busy now. I'll reply to you ASAP.

@floydwch
Copy link
Owner

floydwch commented Nov 6, 2017

@queirozfcom I added new feedback on the caching issue.

And another concern is, please rebase the commits to the master branch for better commit logs, don't use merge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants