-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency documention #844
Merged
maxinelasp
merged 4 commits into
IMAP-Science-Operations-Center:dev
from
maxinelasp:dep_doc
Sep 16, 2024
Merged
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
Data Dependency Management | ||
========================== | ||
|
||
The IMAP science data center utilizes an event-based processing system that allows for | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
processing as soon as data is available. This system is designed to be flexible to | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
accommodate the various requirements and inter-dependencies for all 10 instruments. | ||
|
||
As part of our requirements, we need some way to explicitly describe the dependencies | ||
for each file. We also need to be able to flexibly update the dependencies on a regular | ||
basis, to accommodate changing requirements. | ||
|
||
Overview | ||
-------- | ||
|
||
When a file lands in the SDC, it is added to our data bucket (Also called S3 or S3 bucket.) This bucket, as the name implies, is a simple collection which contains all the files in the SDC, organized | ||
like a file system. | ||
|
||
Each data file is put into a specific subfolder depending on the file name. For example, a file named ``imap_swe_l0_sci_20240105_20240105_v00-01.pkts`` would be placed in the ``imap/swe/l0/2024/01`` folder. | ||
More information about the naming conventions can be found in :ref:`naming-conventions`. | ||
|
||
When a file of any level arrives in the bucket, it triggers the rest of processing. This is how we manage file processing within the SDC, rather than waiting until all files have arrived | ||
or running at particular times of day. This allows us to quickly process data as soon as all the required pieces are available to us, and create a flexible system which can easily be updated | ||
to add exceptions or new requirements on a per-instrument or per-level basis. | ||
|
||
.. note:: | ||
This document, and our tooling, uses the terms "upstream dependencies" and "downstream dependencies" to describe the relationships between files. A "downstream dependency" for a given file | ||
means that the current file is required for processing of the downstream files - so, for example, an L2 file is a downstream dependency of an L1 file. An "upstream dependency" is the opposite, | ||
describing a file which is required to begin processing the current file. For example, an L1 file is an upstream dependency of an L2 file. | ||
|
||
Detailed Description of File Processing | ||
--------------------------------------- | ||
|
||
For explicit descriptions of the tools and technical choices of the IMAP SDC, please refer to `this Galaxy page <https://lasp.colorado.edu/galaxy/display/IMAP/SDC+Architecture>`_. | ||
This section is intended to act as a high level overview for the data processing architecture of the IMAP SDC, in less technical terms. | ||
|
||
.. image:: ../_static/architecture_overview.png | ||
|
||
`Up to date overview chart in Galaxy <https://lasp.colorado.edu/galaxy/display/IMAP/SDC+Processing+Architecture+Overview>`_ | ||
|
||
Each science file that arrives is treated the same, regardless of level or instrument. When a file is placed in the file storage system, it triggers a step to index the file ("indexer lambda"). | ||
This step adds the file to the database and triggers the next step in processing (batch starter lambda). | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This step is what determines if a instrument and level is ready for processing, by checking dependencies. For each file that arrives, the system check to see what the downstream dependencies are - | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
meaning, what future files need this file in order to complete processing. For example, if a MAG L1A file arrived, this step would determine that the MAG L1B mago and magi files are dependent on | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the L1A file, and therefore MAG L1B may be ready to begin processing. | ||
|
||
Then, for each anticipated job, the batch starter process checks to see if all the upstream dependencies are met. Although we know we have one of the upstream dependencies for an expected job, | ||
it's possible that there are other required dependencies that have not yet arrived. If we are missing required dependencies, then the system stops processing and waits for the missing files to arrive. | ||
Once missing file has arrived, it will trigger the same process, but all the required dependencies will be ready, and therefore processing can continue. | ||
|
||
For example, SWAPI L3 requires both SWAPI L2 files and MAG L1D (previously called L2pre) files. The SWAPI L2 job and the MAG L1D job are run independently, so there is no guarantee that they will finish | ||
at the same time. Let's assume that the MAG L1D job finishes first, since it is the lower level. When that file arrives, one of the downstream dependencies is going to be the SWAPI L3 processing. | ||
However, when batch starter checks the upstream dependencies for SWAPI L3, it will find that SWAPI L2 is missing. Therefore, processing won't start. Once the SWAPI L2 processing finishes, | ||
and the SWAPI L2 file arrives, the batch starter is triggered with that file. Once again, SWAPI L3 is a downstream dependency, but this time, both upstream dependencies for SWAPI L2 are present. | ||
Therefore, processing for SWAPI L3 can begin. | ||
|
||
The status of different files is recorded in the status tracking table. This table records the status of each anticipated output file as "in progress", "complete", or "failed." Through this, | ||
we can track processing for specific files and determine if a file exists quickly. | ||
|
||
Data Dependency Database Table | ||
------------------------------ | ||
|
||
How does the SDC track which files are dependent on others? In order to decide what the downstream or upstream dependencies of a file are, and what the nature of those dependencies are, we | ||
need some way to request the upstream or downstream dependencies of a given file. The current dependencies between instruments are recorded in `Galaxy <https://lasp.colorado.edu/galaxy/display/IMAP/Dependency+Between+Instrument+-+Diagram>`_. | ||
|
||
We handle this using a SQL database. All of our databases are described in detail `here <https://lasp.colorado.edu/galaxy/display/IMAP/SDC+Database+Tables>`_. | ||
|
||
The database has the following structure: | ||
|
||
========== ===== ========== ===================== ================ ===================== ======================== ====================== | ||
instrument level descriptor dependency_instrument dependency_level dependency_descriptor relationship_description DOWNSTREAM or UPSTREAM | ||
========== ===== ========== ===================== ================ ===================== ======================== ====================== | ||
mag l1a norm-mago mag l1b norm-mago HARD DOWNSTREAM | ||
mag l1a norm-magi mag l1b norm-magi HARD DOWNSTREAM | ||
mag l1d norm swapi l3 sci HARD DOWNSTREAM | ||
swapi l2 sci swapi l3 sci HARD DOWNSTREAM | ||
========== ===== ========== ===================== ================ ===================== ======================== ====================== | ||
maxinelasp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Dependency Types | ||
---------------- | ||
|
||
Right now, we only have HARD dependencies. This means that upstream processing is blocked on the existence of dependent files. However, we have also have requirements for SOFT dependencies - where | ||
processing should use the file if it exists, but can continue without it. This is a future feature. The relationship description column exists so we can define specific relationships between | ||
files. | ||
|
||
This can also include information on reprocessing rules. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,4 +97,5 @@ Other pages | |
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
calibration-files | ||
calibration-files | ||
data-dependency |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really clean and clear explanation of the processing system and dependency. Nice work! and Thank you!