-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8524 adding mechanism for storing tab. files with variable headers #10282
Merged
stevenwinship
merged 17 commits into
develop
from
8524-store-tabular-files-with-varheaders
Feb 7, 2024
Merged
8524 adding mechanism for storing tab. files with variable headers #10282
stevenwinship
merged 17 commits into
develop
from
8524-store-tabular-files-with-varheaders
Feb 7, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nit tests should be passing. # 8524
…larSubsetGenerator, for clarity etc. #8524
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
landreev
changed the title
8524 adding mechanism for store tab. files with variable headers
8524 adding mechanism for storing tab. files with variable headers
Jan 31, 2024
This comment has been minimized.
This comment has been minimized.
landreev
added
the
Size: 30
A percentage of a sprint. 21 hours. (formerly size:33)
label
Jan 31, 2024
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
sekmiller
approved these changes
Feb 6, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
See the opening comment in the issue for the back story and the rationale for implementing this feature.
Note that the original title of the issue was "Stop storing ingested tabular data files WITHOUT the variable name header". I changed it to the current "Add mechanism for storing ..." etc. This is not something we can arbitrarily stop doing, the way we rearrange database tables with a deployment of a new version of the application. For example, the IQSS production instance has more than a TB of ingested tabular files. Adding these headers to all of these files in S3 storage would be a time-consuming effort, putting it mildly. Even for a site with 1/10 of that amount and the files stored on the filesystem, this would be too serious a task to be made a required upgrade step.
So what we want is an option to have tab. files stored with that header, alongside with any legacy files stored without it.
This PR adds the following:
What's NOT part of this PR:
It's important to note that for an instance deploying this new code, but choosing not to enable the new setting, nothing changes in the way the ingest and everything else functions on their system.
Which issue(s) this PR closes:
Closes #8524
Special notes for your reviewer:
You will notice that my implementation makes changes in all the individual ingest plugins for all the supported file formats. Intuitively, it would be less code to change/fewer places for things to go wrong, to add this header in one place, somewhere in IngestServiceBean, once the tab-delimited file and the list of variables have been produced by a specific plugin. It just felt wrong to implement it like that, because there is really no efficient way to "insert" an extra line at the beginning of a file. In practice, the only way to do this is to re-write the entire file on disk: first write the extra header line into a new empty file, then add the rest of the content from the source file. Which amounts to temporarily doubling the size of this temp. data. So, I chose to instead modify the plugins to add the headers right away, as the tab-delimited files are first written on disk. This does result in more places to potentially introduce an error. But, as you will see, the actual changes are really trivial in each plugin.
Suggestions on how to test this:
Ingest a few files, with and without the new setting enabled. Compare both the end result for the user; and the physical files as stored on the filesystem.
The integration test added in the PR is an example of a simple test scenario.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: