-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1714-2 migrate 2020-2021 data #1747
Conversation
PR review ETA by 06-07-2024 |
for data in tqdm(response.iter_content()): | ||
file.write(data) | ||
for chunk in tqdm(response.iter_content(chunk_size=chunk_size), desc="Downloading data"): | ||
if chunk: # filter out keep-alive new chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you encountering keep-alive chunks while using this script? just curious how this was affecting the result csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to handle HTTP's keep-alive mechanism which may send empty chunks to keep the connection open especially for downloading larger files. i didnt check to see if there were empty chunks downloaded prior, but just implement check if the current chunk is not empty as a precaution, so that it only downloads non-empty chunks.
@@ -0,0 +1,7 @@ | |||
def get_correct_column_count(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly requesting a change here since these are clearly helper methods. But would be nice to generalize these for use in other scripts, or make them command-line friendly (specifying input files, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested change impemented
@@ -0,0 +1,8 @@ | |||
def find_problematic_line(file_path, num_lines=5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested change impemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with a few comments, and a question
Fixes #1714
Note
Added 2020 to 2021 data. Fixed corrupted 2021 data.
Changes
migrateOldHfDataset
with chunk size download to handle large input file with efficiencyPre-Merge Checklist
main
branch