Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1714-2 migrate 2020-2021 data #1747

Merged
merged 14 commits into from
Jun 8, 2024
Merged

Conversation

Skydodle
Copy link
Member

@Skydodle Skydodle commented Jun 5, 2024

Fixes #1714

Note

Added 2020 to 2021 data. Fixed corrupted 2021 data.

Changes

  • Updated script migrateOldHfDataset with chunk size download to handle large input file with efficiency
  • Added debugging py scripts to fix corrupted 2021 data
  • Added dataset registration for said years
  • Enabled calendar navigation for said years
  • Updated tooltip text

Pre-Merge Checklist

  • Up to date with main branch
  • Branch name follows guidelines
  • All PR Status checks are successful
  • Peer reviewed and approved

@Skydodle Skydodle requested a review from ryanfchase June 5, 2024 23:43
@ryanfchase
Copy link
Member

PR review ETA by 06-07-2024

for data in tqdm(response.iter_content()):
file.write(data)
for chunk in tqdm(response.iter_content(chunk_size=chunk_size), desc="Downloading data"):
if chunk: # filter out keep-alive new chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you encountering keep-alive chunks while using this script? just curious how this was affecting the result csv

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to handle HTTP's keep-alive mechanism which may send empty chunks to keep the connection open especially for downloading larger files. i didnt check to see if there were empty chunks downloaded prior, but just implement check if the current chunk is not empty as a precaution, so that it only downloads non-empty chunks.

@@ -0,0 +1,7 @@
def get_correct_column_count(file_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly requesting a change here since these are clearly helper methods. But would be nice to generalize these for use in other scripts, or make them command-line friendly (specifying input files, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested change impemented

@@ -0,0 +1,8 @@
def find_problematic_line(file_path, num_lines=5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested change impemented

Copy link
Member

@ryanfchase ryanfchase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a few comments, and a question

@Skydodle Skydodle merged commit 7877a77 into hackforla:main Jun 8, 2024
@Skydodle Skydodle deleted the 1714-migrate-old-data branch June 8, 2024 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate external Huggingface data to 311 Data Huggingface repo
2 participants