-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement prepbufr raobs obs builder 185 #395
base: main
Are you sure you want to change the base?
Conversation
…appropriate dat for the report type
I recommend using 3.12 because I have done all my prepbufr testing on 3.12.
I know that version works with xarray, NCEPLIBS-bufr, and netcdf all of
which are needed for the ingest container.
…On Fri, Aug 2, 2024 at 9:40 AM Ian McGinnis ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pyproject.toml
<#395 (comment)>:
> -python = "^3.11"
+python = "^3.12"
If we're updating to Python 3.12, we should update our dockerfile & CI
too. Unless 3.12 is required for the RAOBS, I could see an argument for
making that a separate change/PR to keep the number of changes down. That
being said, I'd be in favor of updating to 3.12 if all of our dependencies
work with it.
------------------------------
In pyproject.toml
<#395 (comment)>:
> @@ -25,6 +26,7 @@ pytest = "^8.1.1"
types-pyyaml = "^6.0.12.20240311"
ruff = "^0.3.5"
coverage = "^7.4.4"
+ipykernel = "^6.29.4"
Do we need the jupyter python kernel for this project? I know sometimes
VSCode insists on including it but if it's not required, I'd rather remove
it so we don't have to worry about updating it for the security scanners.
------------------------------
In pyproject.toml
<#395 (comment)>:
> pyyaml = "^6.0.1"
xarray = "^2024.3.0"
+pyarrow = "^16.1.0"
I wasn't seeing pyarrow imported anywhere, did I miss a use case? I know
it can be useful for working with Parquet files & doing columnar operations
on data in memory. So it could be interesting to evaluate at some point.
------------------------------
In src/vxingest/grib2_to_cb/grib_builder.py
<#395 (comment)>:
> - spfh = []
+ specific_humidity = []
This is a great naming improvement. Thanks!
------------------------------
In src/vxingest/grib2_to_cb/grib_builder.py
<#395 (comment)>:
> for station in self.domain_stations:
geo_index = get_geo_index(
self.ds_translate_item_variables_map["fcst_valid_epoch"], station["geo"]
)
x_gridpoint = station["geo"][geo_index]["x_gridpoint"]
y_gridpoint = station["geo"][geo_index]["y_gridpoint"]
- spfh.append((float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint)))
- return spfh
+ specific_humidity.append(
+ (float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint))
We could lose an extra set of parentheses here to improve readability -
the ones around float don't do anything.
⬇️ Suggested change
- (float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint))
+ float(self.interp_grid_box(values, y_gridpoint, x_gridpoint))
------------------------------
In src/vxingest/grib2_to_cb/grib_builder.py
<#395 (comment)>:
> +class RaobModelNativeBuilderV01(GribModelBuilderV01):
+ """This is the builder for model data that is ingested from grib2 NATIVE levels files.
+ It is a concrete builder specifically for the model raob data that are organized based
+ on the models preset vertical levels. This varies quite a bit from model to model
+ and is dependent on the configuration set up before the model runs.
+ This builder is a subclass of the GribModelBuilderV01 class.
+ The primary differences in these two classes are the handlers that derive the pressure level.
+ The pressure level needs to be interpolated according to a specific algorithm.
+
+ Args:
+ load_spec (Object): The load spec used to init the parent
+ ingest_document (Object): the ingest document
+ number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize.
More so I can follow along - do we ingest RAOB's from GRIB? I thought it
was all sourced from PREPBUFR right now.
------------------------------
In src/vxingest/prepbufr_to_cb/README.md
<#395 (comment)>:
> +```text
+ To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive.
+```
Using a code block for this confused my for a minute if you were quoting a
source document or if this was an aside, if it's a quote you can use
Markdown's "quote" syntax (>) and do:
⬇️ Suggested change
-```text
- To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive.
-```
+> To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP.
I also removed the last line since I originally took it to refer to our
ingest when it's referring to Fortran code displayed on the website.
------------------------------
In src/vxingest/prepbufr_to_cb/README.md
<#395 (comment)>:
> +I'm only putting this here temporarily so that I don't lose it before it gets implemented.
+RUC domain
+RRFS North American domain
+Great Lakes
+Global (all lat/lon)
+Tropics (-20 <= lat <= +20)
+Southern Hemisphere (-80 <= lat < -20)
+Northern Hemisphere (+20 < lat <= +80)
+Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC?
+Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC?
+Alaska
+Hawaii
+HRRR domain
+Eastern HRRR domain
+Western HRRR domain
+CONUS
+Eastern CONUS (lon <= 100W)
+Western CONUS (lon <= 100W)
+Northeastern CONUS
+Southeastern CONUS
+Central CONUS
+Southern CONUS
+Northwest CONUS
+Southern Plain
Markdown treats single newlines as part of a paragraph so this renders
weirdly. If this is still needed, the following will render better:
⬇️ Suggested change
-I'm only putting this here temporarily so that I don't lose it before it gets implemented.
-RUC domain
-RRFS North American domain
-Great Lakes
-Global (all lat/lon)
-Tropics (-20 <= lat <= +20)
-Southern Hemisphere (-80 <= lat < -20)
-Northern Hemisphere (+20 < lat <= +80)
-Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC?
-Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC?
-Alaska
-Hawaii
-HRRR domain
-Eastern HRRR domain
-Western HRRR domain
-CONUS
-Eastern CONUS (lon <= 100W)
-Western CONUS (lon <= 100W)
-Northeastern CONUS
-Southeastern CONUS
-Central CONUS
-Southern CONUS
-Northwest CONUS
-Southern Plain
+I'm only putting this here temporarily so that I don't lose it before it gets implemented.
+
+* RUC domain
+* RRFS North American domain
+* Great Lakes
+* Global (all lat/lon)
+* Tropics (-20 <= lat <= +20)
+* Southern Hemisphere (-80 <= lat < -20)
+* Northern Hemisphere (+20 < lat <= +80)
+* Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC?
+* Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC?
+* Alaska
+* Hawaii
+* HRRR domain
+* Eastern HRRR domain
+* Western HRRR domain
+* CONUS
+* Eastern CONUS (lon <= 100W)
+* Western CONUS (lon <= 100W)
+* Northeastern CONUS
+* Southeastern CONUS
+* Central CONUS
+* Southern CONUS
+* Northwest CONUS
+* Southern Plain
------------------------------
In src/vxingest/prepbufr_to_cb/README.md
<#395 (comment)>:
> +HRRR domain
+Eastern HRRR domain
+Western HRRR domain
+CONUS
+Eastern CONUS (lon <= 100W)
+Western CONUS (lon <= 100W)
+Northeastern CONUS
+Southeastern CONUS
+Central CONUS
+Southern CONUS
+Northwest CONUS
+Southern Plain
+
+## Ingest template
+The ingest template for prepbufr RAOBS is "MD:V01:RAOB:obs:ingest:prepbufr".
+It follows the same small Domain Specific Language (DSL) that all ingest templates follow. This is the template portion...
For my knowledge - do we have the DSL documented somewhere? You've
explained the syntax with *, &, and | but I keep forgetting it, so it'd
be handy to have a reference. I was thinking it'd be nice to have a link to
that reference here.
If not, I'll make an issue to document it in something like a
docs/ingest-dsl.md file and then link to it where appropriate.
------------------------------
In src/vxingest/prepbufr_to_cb/README.md
<#395 (comment)>:
> +There are four sections of mappings.
+1 header basic header data like lat, lon, and station name
+2 q_marker quality data
+3 obs_err observation error data
+4 obs_data_120 observation MASS data
+5 obs_data_220 observation WIND data
This also renders weirdly due to how Markdown handles newlines. If it's an
ordered list, I'd recommend:
⬇️ Suggested change
-There are four sections of mappings.
-1 header basic header data like lat, lon, and station name
-2 q_marker quality data
-3 obs_err observation error data
-4 obs_data_120 observation MASS data
-5 obs_data_220 observation WIND data
+There are four sections of mappings.
+
+1. `header` basic header data like lat, lon, and station name
+2. `q_marker` quality data
+3. `obs_err` observation error data
+4. `obs_data_120` observation MASS data
+5. `obs_data_220` observation WIND data
Otherwise, you could do a code block or a table to preserve the formatting:
⬇️ Suggested change
-There are four sections of mappings.
-1 header basic header data like lat, lon, and station name
-2 q_marker quality data
-3 obs_err observation error data
-4 obs_data_120 observation MASS data
-5 obs_data_220 observation WIND data
+```text
+There are four sections of mappings.
+1 header basic header data like lat, lon, and station name
+2 q_marker quality data
+3 obs_err observation error data
+4 obs_data_120 observation MASS data
+5 obs_data_220 observation WIND data
+```
------------------------------
On src/vxingest/prepbufr_to_cb/run_ingest_threads.py
<#395 (comment)>:
A general note, we'll need to incorporate this into main.py.
We'll import the prepbufr_to_cb module here with an identifiable name
like PrepbufrIngest if this is general to PREPBUFR files, or
PrepbufrRaobIngest if we'll need to add a new module for other PREPBUFR
data types:
https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L26-L35
And then, it's not my favorite approach, but we'll need to add an entry to
our large switch statement, here:
https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L404-L467
We also may need to update the CLI flags if we have custom flags in the
code here.
------------------------------
On src/vxingest/prepbufr_to_cb/prepbufr_builder.py
<#395 (comment)>:
Generally, I'm concerned about how deeply nested parts (e.g. -
interpolate_data and handle_document) of the prepbufr_to_cb module are.
Deeply nested code has been shown to be difficult to reason about and debug
and is mentioned in the Zen of Python - "Flat is better than nested."
<https://peps.python.org/pep-0020/> Typically, folks recommend keeping
nesting to 2-3 levels deep. They deal with deeper nesting via guard clauses
and extracting functions. If you'd like to refactor this, I'd find it
really useful to pair with you on it. Selfishly, I'd love to better
understand the ingest code & PREPBUFR data.
Some resources:
- Google's Code Health Blog: Nesting guidance
<https://testing.googleblog.com/2017/06/code-health-reduce-nesting-reduce.html?m=1>
- Google's Code Health Blog: Simplifying Control flow
<https://testing.googleblog.com/2023/10/simplify-your-control-flows.html?m=1>
- Jeff Attwood on "Flattening arrow code"
<https://blog.codinghorror.com/flattening-arrow-code/>
- Two methods for refactoring deeply nested code
<https://shuhanmirza.medium.com/two-simple-methods-to-refactor-deeply-nested-code-78eb302bb0b4>
------------------------------
In src/vxingest/prepbufr_to_cb/prepbufr_builder.py
<#395 (comment)>:
> + # I cannot process this station - there is no array of pressure data
+ del interpolated_data[station]
Do we want to remove the station from the data if it can't be processed or
do we want to set it to None? del has its uses but it can be rare to see
in Python so this caught my eye.
------------------------------
In src/vxingest/prepbufr_to_cb/prepbufr_builder.py
<#395 (comment)>:
> + :return: the document_map
+ """
+ try:
+ if len(self.same_time_rows) != 0:
+ self.handle_document()
+ return self.document_map
+ except Exception as _e:
+ logger.exception(
+ "%s get_document_map: Exception in get_document_map: %s",
+ self.__class__.__name__,
+ str(_e),
+ )
+ return None
+
+ # named functions
+ def meterspersecond_to_milesperhour(self, params_dict):
Two thoughts:
1. Can we utilize Pint for these conversions?
2. If not, should we move these "data transformation" functions into a
separate module so we can use them across the various ingest types?
Bigger picture, and more as a discussion point, I wonder if we should make
more use of Pint within the new Ingest to help wrangle units. If we did
want to do that, it'd be handled in a separate issue.
------------------------------
In tests/vxingest/README.md
<#395 (comment)>:
> @@ -40,9 +40,9 @@ Note that this currently (as of 1/2024) disables most of the tests.
## Test data
-For now, you'll need test resources from: https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_link unpacked to `/opt/data` in order to run the test suite.
+For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_linkunpacked) to '/opt/data' in order to run the test suite.
Looks like "unpacked" was added to the URL here and that made it invalid:
⬇️ Suggested change
-For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_linkunpacked) to '/opt/data' in order to run the test suite.
+For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69) to '/opt/data' in order to run the test suite.
------------------------------
In tests/vxingest/prepbufr_to_cb/test_unit_prepbufr_builder.py
<#395 (comment)>:
> +def test_read_header(mock_header_bufr):
+ # Create an instance of PrepbufrBuilder
+ builder = PrepbufrRaobsObsBuilderV01(
+ None,
+ {
+ "template": {"subset": "RAOB"},
+ "ingest_document_ids": {},
+ "file_type": "PREPBUFR",
+ "origin_type": "GDAS",
+ "mnemonic_mapping": hdr_template,
+ },
+ )
+
+ # Call the read_header method with the mock bufr object
+ header_data = builder.read_data_from_bufr(mock_header_bufr, hdr_template)
+
+ # Assert the expected values
+ assert header_data["station_id"] == "SID123"
+ assert header_data["lon"] == 45.679
+ assert header_data["lat"] == -123.457
+ assert header_data["obs-cycle_time"] == 0.5
+ assert header_data["station_type"] == 1
+ assert header_data["elevation"] == 100.0
+ assert header_data["report_type"] == 2
I like the way the data is mocked out here!
------------------------------
In tests/vxingest/prepbufr_to_cb/test_int_read_data_from_file.py
<#395 (comment)>:
> +def test_read_header():
+ queue_element = (
+ "/opt/data/prepbufr_to_cb/input_files/241011200.gdas.t12z.prepbufr.nr"
+ )
+ vx_ingest = setup_connection()
+ ingest_doc = vx_ingest.collection.get("MD:V01:RAOB:obs:ingest:prepbufr").content_as[
+ dict
+ ]
+ template = ingest_doc["mnemonic_mapping"]
+ builder = PrepbufrRaobsObsBuilderV01(
+ None,
+ ingest_doc,
+ )
+
+ bufr = ncepbufr.open(queue_element)
+ bufr.advance()
+ assert bufr.msg_type == template["bufr_msg_type"], "Expected ADPUPA message type"
+ bufr.load_subset()
+ header = builder.read_data_from_bufr(bufr, template["header"])
+ bufr.close()
+ assert header is not None
+ assert header["station_id"] == "89571"
+ assert header["lon"] == 77.97
+ assert header["lat"] == -68.58
+ assert header["obs-cycle_time"] == -0.5
+ assert header["elevation"] == 18.0
+ assert header["data_dump_report_type"] == 11.0
+ assert header["report_type"] == 120
We may need to mark these as integration tests for CI with
@pytest.mark.integration().
@pytest.mark.integration()def test_read_header():
...
------------------------------
In src/vxingest/grib2_to_cb/grib_builder.py
<#395 (comment)>:
> +class RaobModelNativeBuilderV01(GribModelBuilderV01):
+ """This is the builder for model data that is ingested from grib2 NATIVE levels files.
+ It is a concrete builder specifically for the model raob data that are organized based
+ on the models preset vertical levels. This varies quite a bit from model to model
+ and is dependent on the configuration set up before the model runs.
+ This builder is a subclass of the GribModelBuilderV01 class.
+ The primary differences in these two classes are the handlers that derive the pressure level.
+ The pressure level needs to be interpolated according to a specific algorithm.
+
+ Args:
+ load_spec (Object): The load spec used to init the parent
+ ingest_document (Object): the ingest document
+ number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize.
We discussed this during an in-person meeting - these are placeholder
classes for extracting model data to compare with the RAOB obs data.
—
Reply to this email directly, view it on GitHub
<#395 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPUCKDYDWFHRKSD7VJTZPOR7PAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJTGM3TKNRWGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>
Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>
Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>
good catch Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>
…ub.com/NOAA-GSL/VxIngest into implement_prepbufr_RaobsObsBuilder_185
…ub.com/NOAA-GSL/VxIngest into implement_prepbufr_RaobsObsBuilder_185
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple notes based on your emails from Friday. It looks like you've resolved most of the lint issues but let me know if you have questions about the linter!
I have resolved the lint issues but the tests are failing because CI cannot
find the ncepbufr module. I realize that I need to log onto aws and build
the x86 linux version, but I don't have access to an x86 MAC (mine is ARM)
to build that version. Still when I build the new module I don't remember
how the CI finds the ncepbufr module. That still has me slightly confused.
I think I will also add an integration test that has main.py as its entry
point so no one will make the mistake of not integrating a new builder into
main.py again.
randy
randy
…On Mon, Aug 5, 2024 at 9:27 AM Ian McGinnis ***@***.***> wrote:
***@***.**** commented on this pull request.
I added a couple notes based on your emails from Friday. It looks like
you've resolved most of the lint issues but let me know if you have
questions about the linter!
------------------------------
In docker/ingest/Dockerfile
<#395 (comment)>:
> -FROM python:3.11-slim-bookworm AS builder
+FROM python:3.12-slim-bookworm AS builder
We have a number of layers in this Dockerfile so this will also need to be
done on Line 6 & Line 72.
------------------------------
In src/vxingest/prepbufr_to_cb/vx_ingest_manager.py
<#395 (comment)>:
> + stmnt_mysql = f'select wmoid,press,z,t,dp,rh,wd,ws from ruc_ua_pb.RAOB where date = "{date}" and press = {level} and wmoid = "{station}";'
+ _mysql_db = mysql.connector.connect(
+ host=self.load_spec["_mysql_host"],
+ user=self.load_spec["_mysql_user"],
+ password=self.load_spec["_mysql_pwd"],
+ )
+ my_cursor = _mysql_db.cursor()
+ my_cursor.execute(stmnt_mysql)
+ my_result_final = my_cursor.fetchall()
I missed this on my initial walkthrough - do we need to connect to a MySQL
database? I thought VxIngest was Couchbase-only so that we could move away
from our old MySQL DB.
And - if we do need that data, is there any other possible way to get it?
I'd really like to avoid making the new ingest dependent on the old ingest.
—
Reply to this email directly, view it on GitHub
<#395 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPRTMT5TWXKBEW67MD3ZP6KU3AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJZGI3TIMBVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
…raded numpy (don't have a platform to rebuild it.)
Thanks! I'll try this out and let you know. I thought poetry followed the
python markers. Where, BTW, did you find the reference to PEP 508 markers?
randy
…On Tue, Aug 6, 2024 at 10:38 AM Ian McGinnis ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pyproject.toml
<#395 (comment)>:
> +ncepbufr = [
+ { platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" },
+ { platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" }
+]
It looks like Poetry's platform field takes 1 of darwin, linux, or win32.
To get at the architecture Poetry supports PEP 508 markers
<https://peps.python.org/pep-0508/#environment-markers>. You'd do: marker
= "platform_machine == 'arm64'" for ARM, for example.
I also bumped the version number to match the new nceplibs version in the
recommendation below.
⬇️ Suggested change
-ncepbufr = [
- { platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" },
- { platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" }
-]
+ncepbufr = [
+ { platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" },
+ { platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" }
+]
—
Reply to this email directly, view it on GitHub
<#395 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPQGAD75KPY4EV22DT3ZQD3X7AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMRRG44DONBSGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Certainly! Poetry references them in their Dependency Specification guide. |
Now that I'm back from leave - what's the status on this PR? Last I recall, we had discovered we needed to install It'd be good to get this finished and merged before we get pulled away onto other issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I had a few pending PR comments around the Dockerfile & nceplibs-bufr version so I'm adding those here to make sure they don't get forgotten.
I have discovered two additional discrepancies in the data results compared
to the legacy system - which is based on Ming's data assimilation code. One
had to do with q-markers and that has now been implemented. The other is
more subtle. The other one has to do with multiple entries of type 120 and
type 220 reports within a single gdas prepbufr file. This must be
understood before we consider this complete.
…On Wed, Aug 21, 2024 at 10:58 AM Ian McGinnis ***@***.***> wrote:
Now that I'm back from leave - what's the status on this PR? Last I
recall, we had discovered we needed to install nceplibs-bufr.
It'd be good to get this finished and merged before we get pulled away
onto other issues.
—
Reply to this email directly, view it on GitHub
<#395 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPQL6KSDETMURQFT2B3ZSTBJVAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGU2TMMRYHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
ncepbufr = [ | ||
{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" }, | ||
{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" } | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do ARM Linux as well - it's an important platform in AWS:
ncepbufr = [ | |
{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" }, | |
{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" } | |
] | |
ncepbufr = [ | |
{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" }, | |
{ platform = "linux", markers = "platform_machine == 'aarch64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_aarch64.whl" }, | |
{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" } | |
] |
json_data = json.dumps(list(document_map.values())) | ||
json_data = json.dumps(list(document_map.values()), indent=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to pretty-print the JSON when writing it out? It'll take up extra space compared to the minified version.
def parse_args(self, args): | ||
"""This method is intended to be overridden""" | ||
"""This method is intended to be overridden""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the docstring was accidentally duplicated
(pathlib.PurePath(filename).name).split(".")[0], | ||
file_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pathlib has a handy .stem
method for extracting the filename without the extension:
(pathlib.PurePath(filename).name).split(".")[0], | |
file_mask, | |
pathlib.Path(filename).stem, file_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive we still need the new 12.1.0 libraries built & added to third_party
.
def test_one_thread_specify_file_pattern(tmp_path: Path): | ||
"""Note: this test takes a long time to run (few minutes)""" | ||
try: | ||
log_queue = Queue() | ||
vx_ingest = setup_connection() | ||
# # stations = [ | ||
# # "70026", | ||
# # "72393", | ||
# # "74794", | ||
# # "71119", | ||
# # "76225", | ||
# # "76256", | ||
# # "76458", | ||
# # "76526", | ||
# # "76595", | ||
# # "76612", | ||
# # "76644", | ||
# # "76654", | ||
# # "76679", | ||
# # "76692", | ||
# # "76743", | ||
# # "76903", | ||
# # "78384", | ||
# # "78397", | ||
# # "78486", | ||
# # "78526", | ||
# # "78583", | ||
# # "78954", | ||
# # "78970", | ||
# # "82022", | ||
# # "82026", | ||
# # "82099", | ||
# # "82107", | ||
# # "82193", | ||
# # "82244", | ||
# # "82332", | ||
# # "82411", | ||
# # "82532", | ||
# # "82599", | ||
# # "82705", | ||
# # ] | ||
# print("Testing stations: ", stations) | ||
# print(f"output path is: {tmp_path}") | ||
# vx_ingest.write_data_for_station_list = stations | ||
# vx_ingest.write_data_for_levels = [200, 300, 500, 700, 900] | ||
try: | ||
vx_ingest.runit( | ||
{ | ||
"job_id": "JOB-TEST:V01:RAOB:PREPBUFR:OBS", | ||
"credentials_file": os.environ["CREDENTIALS"], | ||
"file_name_mask": "%y%j%H%M", # only tests the first part of the file name i.e. 241011200.gdas.t12z.prepbufr.nr -> 241011200 | ||
"output_dir": f"{tmp_path}", | ||
"threads": 1, | ||
"file_pattern": "242130000*", # specifically /opt/data/prepbufr_to_cb/input_files/242130000.gdas.t00z.prepbufr.nr, | ||
# "file_pattern": "242131200*", # specifically /opt/data/prepbufr_to_cb/input_files/242131200.gdas.t00z.prepbufr.nr, | ||
# "file_pattern": "242121800*", # specifically /opt/data/prepbufr_to_cb/input_files/242121800.gdas.t00z.prepbufr.nr, | ||
# "file_pattern": "241570000*", # specifically /opt/data/prepbufr_to_cb/input_files/241570000.gdas.t00z.prepbufr.nr, | ||
}, | ||
log_queue, | ||
stub_worker_log_configurer, | ||
) | ||
except Exception as e: | ||
raise AssertionError(f"Exception: {e}") from e | ||
# Test that we have one or more output files | ||
output_file_list = list( | ||
tmp_path.glob( | ||
"[0123456789]????????.gdas.t[0123456789][0123456789]z.prepbufr.nr.json" | ||
) | ||
) | ||
|
||
# Test that we have one "load job" ("LJ") document | ||
lj_doc_regex = ( | ||
"LJ:RAOB:vxingest.prepbufr_to_cb.run_ingest_threads:VXIngest:*.json" | ||
) | ||
num_load_job_files = len(list(tmp_path.glob(lj_doc_regex))) | ||
assert ( | ||
num_load_job_files >= 1 | ||
), f"Number of load job files is incorrect {num_load_job_files} is not >= 1" | ||
|
||
# Test that we have one output file per input file | ||
input_path = Path("/opt/data/prepbufr_to_cb/input_files") | ||
num_input_files = len(list(input_path.glob("242130000*"))) | ||
# num_input_files = len(list(input_path.glob("242131200*"))) | ||
# num_input_files = len(list(input_path.glob("242121800*"))) | ||
# num_input_files = len(list(input_path.glob("241011200*"))) | ||
num_output_files = len(output_file_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest cleaning up the commented-out code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check and make sure all FROM
lines are using the same version of Python - the "builder" image is still using 3.11
here.
def write_data_for_debug(self, builder, document_map): | ||
""" | ||
write the raw data and interpolated for a specific set of stations for debugging purposes | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the debug function? If not, we can eliminate the tabulate
& mysql.connector
dependencies which would be good from a security standpoint.
Otherwise, if we do want to keep this functionality around, it'd be good to move it to a test suite or utility script so that we can move tabulate
& mysql.connector
from production to dev dependencies in our pyproject.toml
file.
_val = ( | ||
round(float(line.split()[1])) | ||
if line.split()[1] != "MISSING" | ||
else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is repeated a lot throughout this script - it should be extracted into a function so it's easier to update & maintain.
def process_line(line: str) -> float | None:
"""
Could use a better name & docstring
Assumes a line that looks like: <some example input>
"""
parts = line.split()
value = parts[1]
return round(float(value)) if value != "MISSING" else None
parser.add_argument( | ||
"--station_list", | ||
type=list, | ||
required=False, | ||
default=[], | ||
help="The list of station ids for a RAOB prepbufr diagnostic report. Default is [].", | ||
) | ||
parser.add_argument( | ||
"--levels_list", | ||
type=list, | ||
required=False, | ||
default=[], | ||
help="The list of levels for a RAOB prepbufr diagnostic report. Default is [].", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two CLI flags appear to be unused. We should remove them, if so.
This is the implementation of the prepbufr builder.