Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flattener #235

Merged
merged 27 commits into from
Sep 21, 2022
Merged

Flattener #235

merged 27 commits into from
Sep 21, 2022

Conversation

yolile
Copy link
Member

@yolile yolile commented Sep 20, 2022

No description provided.

@yolile yolile marked this pull request as draft September 20, 2022 04:04
@yolile
Copy link
Member Author

yolile commented Sep 20, 2022

@jpmckinney, this is what I have so far. Some additional things I needed to do this time:

  • As I added the flatten step as an additional job, I have to edit the Exporter class to have different lock files depending on whether the export generates the JSON files or CSV/Excels.
  • As this step is now separate from the JSON export, I have to decompress the jsonl.gz files before flattening them and then delete the jsonl files.
  • I'm creating CSV files for all the existing jsonl files. Do we always want that? Or should I check for size limits? Especially for "all time" files.

@yolile yolile marked this pull request as ready for review September 20, 2022 18:42
@yolile yolile requested a review from jpmckinney September 20, 2022 18:42
views:
- Guard against path traversal by malicious input
- Set export_format query string parameter to full suffix to avoid extra view logic
- Abbreviate export_format to format in query string, and export_formats to formats in template

template:
- Sort years in consistent order (as before)
- Add condition for all full files
- Add {% empty %} if files not yet available

util:
- Ensure the template always receives expected keys
- Don't process unknown suffixes
- Fix "jsonl" vs "json" typo for export_format
- Fix "csv.gz" -> "csv.tar.gz" typo

worker:
- Do not compress Excel file (an Excel file is already a ZIP file)
- Take advantage of Export().directory returning a pathlib.Path (temp file had  ".jsonl.jsonl" suffix)
- Do "var = condition" not "if condition: var = True; else: var = False"
- Merge flatten_and_package_file() into callback()

perf:
- Use tempfile module to clean up files (temp files are written to the SSD, which is faster)
- Use scandir() (faster) instead of listdir()
@jpmckinney
Copy link
Member

I made some edits and fixes. Have you tested it manually?

After we deploy, to process existing data, we can run publish({"job_id": self.job.id}, "flattener_init") for every relevant job.

The other option is to create a Task (in the database) for every Job, and then change the Job status accordingly, but this would be a lot more effort. I think the first option will work – we just won't have any job management features.

@jpmckinney
Copy link
Member

Also, we need to translate "Files not yet available."

@yolile
Copy link
Member Author

yolile commented Sep 21, 2022

I made some edits and fixes. Have you tested it manually?

Yes, I've tested it so I can test it again tomorrow before merging. I tried using a Temp directory instead of creating and removing the directory and file and I had some issues, so I want to test this before merging.

exporter/views.py Outdated Show resolved Hide resolved
@yolile yolile merged commit 88c2b0e into main Sep 21, 2022
@yolile yolile deleted the flattener branch September 21, 2022 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants