Implement prepbufr raobs obs builder 185 #395

randytpierce · 2024-07-31T20:11:37Z

This is the implementation of the prepbufr builder.

…appropriate dat for the report type

pyproject.toml

src/vxingest/grib2_to_cb/grib_builder.py

src/vxingest/prepbufr_to_cb/prepbufr_builder.py

tests/vxingest/README.md

tests/vxingest/prepbufr_to_cb/test_unit_prepbufr_builder.py

tests/vxingest/prepbufr_to_cb/test_int_read_data_from_file.py

randytpierce · 2024-08-02T16:09:10Z

I recommend using 3.12 because I have done all my prepbufr testing on 3.12. I know that version works with xarray, NCEPLIBS-bufr, and netcdf all of which are needed for the ingest container.

…

On Fri, Aug 2, 2024 at 9:40 AM Ian McGinnis ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pyproject.toml <#395 (comment)>: > -python = "^3.11" +python = "^3.12" If we're updating to Python 3.12, we should update our dockerfile & CI too. Unless 3.12 is required for the RAOBS, I could see an argument for making that a separate change/PR to keep the number of changes down. That being said, I'd be in favor of updating to 3.12 if all of our dependencies work with it. ------------------------------ In pyproject.toml <#395 (comment)>: > @@ -25,6 +26,7 @@ pytest = "^8.1.1" types-pyyaml = "^6.0.12.20240311" ruff = "^0.3.5" coverage = "^7.4.4" +ipykernel = "^6.29.4" Do we need the jupyter python kernel for this project? I know sometimes VSCode insists on including it but if it's not required, I'd rather remove it so we don't have to worry about updating it for the security scanners. ------------------------------ In pyproject.toml <#395 (comment)>: > pyyaml = "^6.0.1" xarray = "^2024.3.0" +pyarrow = "^16.1.0" I wasn't seeing pyarrow imported anywhere, did I miss a use case? I know it can be useful for working with Parquet files & doing columnar operations on data in memory. So it could be interesting to evaluate at some point. ------------------------------ In src/vxingest/grib2_to_cb/grib_builder.py <#395 (comment)>: > - spfh = [] + specific_humidity = [] This is a great naming improvement. Thanks! ------------------------------ In src/vxingest/grib2_to_cb/grib_builder.py <#395 (comment)>: > for station in self.domain_stations: geo_index = get_geo_index( self.ds_translate_item_variables_map["fcst_valid_epoch"], station["geo"] ) x_gridpoint = station["geo"][geo_index]["x_gridpoint"] y_gridpoint = station["geo"][geo_index]["y_gridpoint"] - spfh.append((float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint))) - return spfh + specific_humidity.append( + (float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint)) We could lose an extra set of parentheses here to improve readability - the ones around float don't do anything. ⬇️ Suggested change - (float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint)) + float(self.interp_grid_box(values, y_gridpoint, x_gridpoint)) ------------------------------ In src/vxingest/grib2_to_cb/grib_builder.py <#395 (comment)>: > +class RaobModelNativeBuilderV01(GribModelBuilderV01): + """This is the builder for model data that is ingested from grib2 NATIVE levels files. + It is a concrete builder specifically for the model raob data that are organized based + on the models preset vertical levels. This varies quite a bit from model to model + and is dependent on the configuration set up before the model runs. + This builder is a subclass of the GribModelBuilderV01 class. + The primary differences in these two classes are the handlers that derive the pressure level. + The pressure level needs to be interpolated according to a specific algorithm. + + Args: + load_spec (Object): The load spec used to init the parent + ingest_document (Object): the ingest document + number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize. More so I can follow along - do we ingest RAOB's from GRIB? I thought it was all sourced from PREPBUFR right now. ------------------------------ In src/vxingest/prepbufr_to_cb/README.md <#395 (comment)>: > +```text + To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive. +``` Using a code block for this confused my for a minute if you were quoting a source document or if this was an aside, if it's a quote you can use Markdown's "quote" syntax (>) and do: ⬇️ Suggested change -```text - To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive. -``` +> To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. I also removed the last line since I originally took it to refer to our ingest when it's referring to Fortran code displayed on the website. ------------------------------ In src/vxingest/prepbufr_to_cb/README.md <#395 (comment)>: > +I'm only putting this here temporarily so that I don't lose it before it gets implemented. +RUC domain +RRFS North American domain +Great Lakes +Global (all lat/lon) +Tropics (-20 <= lat <= +20) +Southern Hemisphere (-80 <= lat < -20) +Northern Hemisphere (+20 < lat <= +80) +Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? +Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? +Alaska +Hawaii +HRRR domain +Eastern HRRR domain +Western HRRR domain +CONUS +Eastern CONUS (lon <= 100W) +Western CONUS (lon <= 100W) +Northeastern CONUS +Southeastern CONUS +Central CONUS +Southern CONUS +Northwest CONUS +Southern Plain Markdown treats single newlines as part of a paragraph so this renders weirdly. If this is still needed, the following will render better: ⬇️ Suggested change -I'm only putting this here temporarily so that I don't lose it before it gets implemented. -RUC domain -RRFS North American domain -Great Lakes -Global (all lat/lon) -Tropics (-20 <= lat <= +20) -Southern Hemisphere (-80 <= lat < -20) -Northern Hemisphere (+20 < lat <= +80) -Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? -Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? -Alaska -Hawaii -HRRR domain -Eastern HRRR domain -Western HRRR domain -CONUS -Eastern CONUS (lon <= 100W) -Western CONUS (lon <= 100W) -Northeastern CONUS -Southeastern CONUS -Central CONUS -Southern CONUS -Northwest CONUS -Southern Plain +I'm only putting this here temporarily so that I don't lose it before it gets implemented. + +* RUC domain +* RRFS North American domain +* Great Lakes +* Global (all lat/lon) +* Tropics (-20 <= lat <= +20) +* Southern Hemisphere (-80 <= lat < -20) +* Northern Hemisphere (+20 < lat <= +80) +* Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? +* Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? +* Alaska +* Hawaii +* HRRR domain +* Eastern HRRR domain +* Western HRRR domain +* CONUS +* Eastern CONUS (lon <= 100W) +* Western CONUS (lon <= 100W) +* Northeastern CONUS +* Southeastern CONUS +* Central CONUS +* Southern CONUS +* Northwest CONUS +* Southern Plain ------------------------------ In src/vxingest/prepbufr_to_cb/README.md <#395 (comment)>: > +HRRR domain +Eastern HRRR domain +Western HRRR domain +CONUS +Eastern CONUS (lon <= 100W) +Western CONUS (lon <= 100W) +Northeastern CONUS +Southeastern CONUS +Central CONUS +Southern CONUS +Northwest CONUS +Southern Plain + +## Ingest template +The ingest template for prepbufr RAOBS is "MD:V01:RAOB:obs:ingest:prepbufr". +It follows the same small Domain Specific Language (DSL) that all ingest templates follow. This is the template portion... For my knowledge - do we have the DSL documented somewhere? You've explained the syntax with *, &, and | but I keep forgetting it, so it'd be handy to have a reference. I was thinking it'd be nice to have a link to that reference here. If not, I'll make an issue to document it in something like a docs/ingest-dsl.md file and then link to it where appropriate. ------------------------------ In src/vxingest/prepbufr_to_cb/README.md <#395 (comment)>: > +There are four sections of mappings. +1 header basic header data like lat, lon, and station name +2 q_marker quality data +3 obs_err observation error data +4 obs_data_120 observation MASS data +5 obs_data_220 observation WIND data This also renders weirdly due to how Markdown handles newlines. If it's an ordered list, I'd recommend: ⬇️ Suggested change -There are four sections of mappings. -1 header basic header data like lat, lon, and station name -2 q_marker quality data -3 obs_err observation error data -4 obs_data_120 observation MASS data -5 obs_data_220 observation WIND data +There are four sections of mappings. + +1. `header` basic header data like lat, lon, and station name +2. `q_marker` quality data +3. `obs_err` observation error data +4. `obs_data_120` observation MASS data +5. `obs_data_220` observation WIND data Otherwise, you could do a code block or a table to preserve the formatting: ⬇️ Suggested change -There are four sections of mappings. -1 header basic header data like lat, lon, and station name -2 q_marker quality data -3 obs_err observation error data -4 obs_data_120 observation MASS data -5 obs_data_220 observation WIND data +```text +There are four sections of mappings. +1 header basic header data like lat, lon, and station name +2 q_marker quality data +3 obs_err observation error data +4 obs_data_120 observation MASS data +5 obs_data_220 observation WIND data +``` ------------------------------ On src/vxingest/prepbufr_to_cb/run_ingest_threads.py <#395 (comment)>: A general note, we'll need to incorporate this into main.py. We'll import the prepbufr_to_cb module here with an identifiable name like PrepbufrIngest if this is general to PREPBUFR files, or PrepbufrRaobIngest if we'll need to add a new module for other PREPBUFR data types: https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L26-L35 And then, it's not my favorite approach, but we'll need to add an entry to our large switch statement, here: https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L404-L467 We also may need to update the CLI flags if we have custom flags in the code here. ------------------------------ On src/vxingest/prepbufr_to_cb/prepbufr_builder.py <#395 (comment)>: Generally, I'm concerned about how deeply nested parts (e.g. - interpolate_data and handle_document) of the prepbufr_to_cb module are. Deeply nested code has been shown to be difficult to reason about and debug and is mentioned in the Zen of Python - "Flat is better than nested." <https://peps.python.org/pep-0020/> Typically, folks recommend keeping nesting to 2-3 levels deep. They deal with deeper nesting via guard clauses and extracting functions. If you'd like to refactor this, I'd find it really useful to pair with you on it. Selfishly, I'd love to better understand the ingest code & PREPBUFR data. Some resources: - Google's Code Health Blog: Nesting guidance <https://testing.googleblog.com/2017/06/code-health-reduce-nesting-reduce.html?m=1> - Google's Code Health Blog: Simplifying Control flow <https://testing.googleblog.com/2023/10/simplify-your-control-flows.html?m=1> - Jeff Attwood on "Flattening arrow code" <https://blog.codinghorror.com/flattening-arrow-code/> - Two methods for refactoring deeply nested code <https://shuhanmirza.medium.com/two-simple-methods-to-refactor-deeply-nested-code-78eb302bb0b4> ------------------------------ In src/vxingest/prepbufr_to_cb/prepbufr_builder.py <#395 (comment)>: > + # I cannot process this station - there is no array of pressure data + del interpolated_data[station] Do we want to remove the station from the data if it can't be processed or do we want to set it to None? del has its uses but it can be rare to see in Python so this caught my eye. ------------------------------ In src/vxingest/prepbufr_to_cb/prepbufr_builder.py <#395 (comment)>: > + :return: the document_map + """ + try: + if len(self.same_time_rows) != 0: + self.handle_document() + return self.document_map + except Exception as _e: + logger.exception( + "%s get_document_map: Exception in get_document_map: %s", + self.__class__.__name__, + str(_e), + ) + return None + + # named functions + def meterspersecond_to_milesperhour(self, params_dict): Two thoughts: 1. Can we utilize Pint for these conversions? 2. If not, should we move these "data transformation" functions into a separate module so we can use them across the various ingest types? Bigger picture, and more as a discussion point, I wonder if we should make more use of Pint within the new Ingest to help wrangle units. If we did want to do that, it'd be handled in a separate issue. ------------------------------ In tests/vxingest/README.md <#395 (comment)>: > @@ -40,9 +40,9 @@ Note that this currently (as of 1/2024) disables most of the tests. ## Test data -For now, you'll need test resources from: https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_link unpacked to `/opt/data` in order to run the test suite. +For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_linkunpacked) to '/opt/data' in order to run the test suite. Looks like "unpacked" was added to the URL here and that made it invalid: ⬇️ Suggested change -For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_linkunpacked) to '/opt/data' in order to run the test suite. +For now, you'll need test resources from: [opt_data.tar.gz](https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69) to '/opt/data' in order to run the test suite. ------------------------------ In tests/vxingest/prepbufr_to_cb/test_unit_prepbufr_builder.py <#395 (comment)>: > +def test_read_header(mock_header_bufr): + # Create an instance of PrepbufrBuilder + builder = PrepbufrRaobsObsBuilderV01( + None, + { + "template": {"subset": "RAOB"}, + "ingest_document_ids": {}, + "file_type": "PREPBUFR", + "origin_type": "GDAS", + "mnemonic_mapping": hdr_template, + }, + ) + + # Call the read_header method with the mock bufr object + header_data = builder.read_data_from_bufr(mock_header_bufr, hdr_template) + + # Assert the expected values + assert header_data["station_id"] == "SID123" + assert header_data["lon"] == 45.679 + assert header_data["lat"] == -123.457 + assert header_data["obs-cycle_time"] == 0.5 + assert header_data["station_type"] == 1 + assert header_data["elevation"] == 100.0 + assert header_data["report_type"] == 2 I like the way the data is mocked out here! ------------------------------ In tests/vxingest/prepbufr_to_cb/test_int_read_data_from_file.py <#395 (comment)>: > +def test_read_header(): + queue_element = ( + "/opt/data/prepbufr_to_cb/input_files/241011200.gdas.t12z.prepbufr.nr" + ) + vx_ingest = setup_connection() + ingest_doc = vx_ingest.collection.get("MD:V01:RAOB:obs:ingest:prepbufr").content_as[ + dict + ] + template = ingest_doc["mnemonic_mapping"] + builder = PrepbufrRaobsObsBuilderV01( + None, + ingest_doc, + ) + + bufr = ncepbufr.open(queue_element) + bufr.advance() + assert bufr.msg_type == template["bufr_msg_type"], "Expected ADPUPA message type" + bufr.load_subset() + header = builder.read_data_from_bufr(bufr, template["header"]) + bufr.close() + assert header is not None + assert header["station_id"] == "89571" + assert header["lon"] == 77.97 + assert header["lat"] == -68.58 + assert header["obs-cycle_time"] == -0.5 + assert header["elevation"] == 18.0 + assert header["data_dump_report_type"] == 11.0 + assert header["report_type"] == 120 We may need to mark these as integration tests for CI with @pytest.mark.integration(). @pytest.mark.integration()def test_read_header(): ... ------------------------------ In src/vxingest/grib2_to_cb/grib_builder.py <#395 (comment)>: > +class RaobModelNativeBuilderV01(GribModelBuilderV01): + """This is the builder for model data that is ingested from grib2 NATIVE levels files. + It is a concrete builder specifically for the model raob data that are organized based + on the models preset vertical levels. This varies quite a bit from model to model + and is dependent on the configuration set up before the model runs. + This builder is a subclass of the GribModelBuilderV01 class. + The primary differences in these two classes are the handlers that derive the pressure level. + The pressure level needs to be interpolated according to a specific algorithm. + + Args: + load_spec (Object): The load spec used to init the parent + ingest_document (Object): the ingest document + number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize. We discussed this during an in-person meeting - these are placeholder classes for extracting model data to compare with the RAOB obs data. — Reply to this email directly, view it on GitHub <#395 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPUCKDYDWFHRKSD7VJTZPOR7PAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJTGM3TKNRWGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Randy Pierce

Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

good catch Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

…ub.com/NOAA-GSL/VxIngest into implement_prepbufr_RaobsObsBuilder_185

ian-noaa

I added a couple notes based on your emails from Friday. It looks like you've resolved most of the lint issues but let me know if you have questions about the linter!

src/vxingest/prepbufr_to_cb/vx_ingest_manager.py

randytpierce · 2024-08-05T15:34:38Z

I have resolved the lint issues but the tests are failing because CI cannot find the ncepbufr module. I realize that I need to log onto aws and build the x86 linux version, but I don't have access to an x86 MAC (mine is ARM) to build that version. Still when I build the new module I don't remember how the CI finds the ncepbufr module. That still has me slightly confused. I think I will also add an integration test that has main.py as its entry point so no one will make the mistake of not integrating a new builder into main.py again. randy randy

…

On Mon, Aug 5, 2024 at 9:27 AM Ian McGinnis ***@***.***> wrote: ***@***.**** commented on this pull request. I added a couple notes based on your emails from Friday. It looks like you've resolved most of the lint issues but let me know if you have questions about the linter! ------------------------------ In docker/ingest/Dockerfile <#395 (comment)>: > -FROM python:3.11-slim-bookworm AS builder +FROM python:3.12-slim-bookworm AS builder We have a number of layers in this Dockerfile so this will also need to be done on Line 6 & Line 72. ------------------------------ In src/vxingest/prepbufr_to_cb/vx_ingest_manager.py <#395 (comment)>: > + stmnt_mysql = f'select wmoid,press,z,t,dp,rh,wd,ws from ruc_ua_pb.RAOB where date = "{date}" and press = {level} and wmoid = "{station}";' + _mysql_db = mysql.connector.connect( + host=self.load_spec["_mysql_host"], + user=self.load_spec["_mysql_user"], + password=self.load_spec["_mysql_pwd"], + ) + my_cursor = _mysql_db.cursor() + my_cursor.execute(stmnt_mysql) + my_result_final = my_cursor.fetchall() I missed this on my initial walkthrough - do we need to connect to a MySQL database? I thought VxIngest was Couchbase-only so that we could move away from our old MySQL DB. And - if we do need that data, is there any other possible way to get it? I'd really like to avoid making the new ingest dependent on the old ingest. — Reply to this email directly, view it on GitHub <#395 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPRTMT5TWXKBEW67MD3ZP6KU3AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJZGI3TIMBVGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Randy Pierce

…raded numpy (don't have a platform to rebuild it.)

pyproject.toml

randytpierce · 2024-08-06T17:04:24Z

Thanks! I'll try this out and let you know. I thought poetry followed the python markers. Where, BTW, did you find the reference to PEP 508 markers? randy

…

On Tue, Aug 6, 2024 at 10:38 AM Ian McGinnis ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pyproject.toml <#395 (comment)>: > +ncepbufr = [ + { platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" }, + { platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" } +] It looks like Poetry's platform field takes 1 of darwin, linux, or win32. To get at the architecture Poetry supports PEP 508 markers <https://peps.python.org/pep-0508/#environment-markers>. You'd do: marker = "platform_machine == 'arm64'" for ARM, for example. I also bumped the version number to match the new nceplibs version in the recommendation below. ⬇️ Suggested change -ncepbufr = [ - { platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" }, - { platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" } -] +ncepbufr = [ + { platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" }, + { platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" } +] — Reply to this email directly, view it on GitHub <#395 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPQGAD75KPY4EV22DT3ZQD3X7AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMRRG44DONBSGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Randy Pierce

ian-noaa · 2024-08-06T18:02:10Z

Certainly! Poetry references them in their Dependency Specification guide.

…alculation

ian-noaa · 2024-08-21T16:57:41Z

Now that I'm back from leave - what's the status on this PR? Last I recall, we had discovered we needed to install nceplibs-bufr.

It'd be good to get this finished and merged before we get pulled away onto other issues.

ian-noaa

It looks like I had a few pending PR comments around the Dockerfile & nceplibs-bufr version so I'm adding those here to make sure they don't get forgotten.

pyproject.toml

src/vxingest/prepbufr_to_cb/vx_ingest_manager.py

docker/ingest/Dockerfile

randytpierce · 2024-08-21T17:30:29Z

I have discovered two additional discrepancies in the data results compared to the legacy system - which is based on Ming's data assimilation code. One had to do with q-markers and that has now been implemented. The other is more subtle. The other one has to do with multiple entries of type 120 and type 220 reports within a single gdas prepbufr file. This must be understood before we consider this complete.

…

On Wed, Aug 21, 2024 at 10:58 AM Ian McGinnis ***@***.***> wrote: Now that I'm back from leave - what's the status on this PR? Last I recall, we had discovered we needed to install nceplibs-bufr. It'd be good to get this finished and merged before we get pulled away onto other issues. — Reply to this email directly, view it on GitHub <#395 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGDVQPQL6KSDETMURQFT2B3ZSTBJVAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGU2TMMRYHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Randy Pierce

ian-noaa · 2024-10-07T20:22:55Z

pyproject.toml

+ncepbufr = [
+	{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" },
+	{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" }
+]


We should do ARM Linux as well - it's an important platform in AWS:

Suggested change

ncepbufr = [

{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" },

{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" }

]

ncepbufr = [

{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" },

{ platform = "linux", markers = "platform_machine == 'aarch64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_aarch64.whl" },

{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" }

]

ian-noaa · 2024-10-07T20:25:52Z

src/vxingest/builder_common/ingest_manager.py

-                        json_data = json.dumps(list(document_map.values()))
+                        json_data = json.dumps(list(document_map.values()), indent=2)


Do we want to pretty-print the JSON when writing it out? It'll take up extra space compared to the minified version.

ian-noaa · 2024-10-07T20:26:42Z

src/vxingest/builder_common/vx_ingest.py

    def parse_args(self, args):
+        """This method is intended to be overridden"""
        """This method is intended to be overridden"""


Looks like the docstring was accidentally duplicated

ian-noaa · 2024-10-07T20:35:47Z

src/vxingest/builder_common/vx_ingest.py

+                                (pathlib.PurePath(filename).name).split(".")[0],
+                                file_mask,


Pathlib has a handy .stem method for extracting the filename without the extension:

Suggested change

(pathlib.PurePath(filename).name).split(".")[0],

file_mask,

pathlib.Path(filename).stem, file_mask

ian-noaa · 2024-10-29T16:30:17Z

third_party/NCEPLIBS-bufr/build.sh

I belive we still need the new 12.1.0 libraries built & added to third_party.

ian-noaa · 2024-10-29T16:32:29Z

tests/vxingest/prepbufr_to_cb/test_int_prepbufr_builder.py

+def test_one_thread_specify_file_pattern(tmp_path: Path):
+    """Note: this test takes a long time to run (few minutes)"""
+    try:
+        log_queue = Queue()
+        vx_ingest = setup_connection()
+        # # stations = [
+        # #     "70026",
+        # #     "72393",
+        # #     "74794",
+        # #     "71119",
+        # #     "76225",
+        # #     "76256",
+        # #     "76458",
+        # #     "76526",
+        # #     "76595",
+        # #     "76612",
+        # #     "76644",
+        # #     "76654",
+        # #     "76679",
+        # #     "76692",
+        # #     "76743",
+        # #     "76903",
+        # #     "78384",
+        # #     "78397",
+        # #     "78486",
+        # #     "78526",
+        # #     "78583",
+        # #     "78954",
+        # #     "78970",
+        # #     "82022",
+        # #     "82026",
+        # #     "82099",
+        # #     "82107",
+        # #     "82193",
+        # #     "82244",
+        # #     "82332",
+        # #     "82411",
+        # #     "82532",
+        # #     "82599",
+        # #     "82705",
+        # # ]
+        # print("Testing stations: ", stations)
+        # print(f"output path is: {tmp_path}")
+        # vx_ingest.write_data_for_station_list = stations
+        # vx_ingest.write_data_for_levels = [200, 300, 500, 700, 900]
+        try:
+            vx_ingest.runit(
+                {
+                    "job_id": "JOB-TEST:V01:RAOB:PREPBUFR:OBS",
+                    "credentials_file": os.environ["CREDENTIALS"],
+                    "file_name_mask": "%y%j%H%M",  # only tests the first part of the file name i.e. 241011200.gdas.t12z.prepbufr.nr -> 241011200
+                    "output_dir": f"{tmp_path}",
+                    "threads": 1,
+                    "file_pattern": "242130000*",  # specifically /opt/data/prepbufr_to_cb/input_files/242130000.gdas.t00z.prepbufr.nr,
+                    # "file_pattern": "242131200*",  # specifically /opt/data/prepbufr_to_cb/input_files/242131200.gdas.t00z.prepbufr.nr,
+                    # "file_pattern": "242121800*",  # specifically /opt/data/prepbufr_to_cb/input_files/242121800.gdas.t00z.prepbufr.nr,
+                    # "file_pattern": "241570000*",  # specifically /opt/data/prepbufr_to_cb/input_files/241570000.gdas.t00z.prepbufr.nr,
+                },
+                log_queue,
+                stub_worker_log_configurer,
+            )
+        except Exception as e:
+            raise AssertionError(f"Exception: {e}") from e
+        # Test that we have one or more output files
+        output_file_list = list(
+            tmp_path.glob(
+                "[0123456789]????????.gdas.t[0123456789][0123456789]z.prepbufr.nr.json"
+            )
+        )
+
+        # Test that we have one "load job" ("LJ") document
+        lj_doc_regex = (
+            "LJ:RAOB:vxingest.prepbufr_to_cb.run_ingest_threads:VXIngest:*.json"
+        )
+        num_load_job_files = len(list(tmp_path.glob(lj_doc_regex)))
+        assert (
+            num_load_job_files >= 1
+        ), f"Number of load job files is incorrect {num_load_job_files} is not >= 1"
+
+        # Test that we have one output file per input file
+        input_path = Path("/opt/data/prepbufr_to_cb/input_files")
+        num_input_files = len(list(input_path.glob("242130000*")))
+        # num_input_files = len(list(input_path.glob("242131200*")))
+        # num_input_files = len(list(input_path.glob("242121800*")))
+        # num_input_files = len(list(input_path.glob("241011200*")))
+        num_output_files = len(output_file_list)


I suggest cleaning up the commented-out code here.

ian-noaa · 2024-10-29T16:35:58Z

docker/ingest/Dockerfile

We should check and make sure all FROM lines are using the same version of Python - the "builder" image is still using 3.11 here.

ian-noaa · 2024-10-29T16:47:04Z

src/vxingest/prepbufr_to_cb/vx_ingest_manager.py

+    def write_data_for_debug(self, builder, document_map):
+        """
+        write the raw data and interpolated for a specific set of stations for debugging purposes
+        """


Do we still need the debug function? If not, we can eliminate the tabulate & mysql.connector dependencies which would be good from a security standpoint.

Otherwise, if we do want to keep this functionality around, it'd be good to move it to a test suite or utility script so that we can move tabulate & mysql.connector from production to dev dependencies in our pyproject.toml file.

ian-noaa · 2024-10-29T17:02:15Z

tests/vxingest/utilities/get_data_for_raobs_from_adpupa_dump.py

+                            _val = (
+                                round(float(line.split()[1]))
+                                if line.split()[1] != "MISSING"
+                                else None
+                            )


This pattern is repeated a lot throughout this script - it should be extracted into a function so it's easier to update & maintain.

def process_line(line: str) -> float | None: """ Could use a better name & docstring Assumes a line that looks like: <some example input> """ parts = line.split() value = parts[1] return round(float(value)) if value != "MISSING" else None

ian-noaa · 2024-10-29T19:59:17Z

src/vxingest/main.py

+    parser.add_argument(
+        "--station_list",
+        type=list,
+        required=False,
+        default=[],
+        help="The list of station ids for a RAOB prepbufr diagnostic report. Default is [].",
+    )
+    parser.add_argument(
+        "--levels_list",
+        type=list,
+        required=False,
+        default=[],
+        help="The list of levels for a RAOB prepbufr diagnostic report. Default is [].",
+    )


These two CLI flags appear to be unused. We should remove them, if so.

randytpierce added 18 commits May 3, 2024 19:09

initial pass at raw_data

883cda3

initial pass at reading raw data structure

e2568ef

continued work on prepbufrbuilder and sample ingest.json and job.json

91d2d0e

prepbufr tests passing for raw data

c7c6eb6

update README

02e1032

make the mnemonic_mapping template suitable for Couchbase

fa9b135

create the mnemonic_mapping template and add it to the ingest document

cdb139a

prepbufr int test running

5381341

make raw and interpolated reports more concise by only including the …

b079929

…appropriate dat for the report type

don't float the stationName

d444bef

checkpoint for investigating build issue from Bill

fbf552d

checkpoint - tests run but wrong data

4de94a4

test working but still have "None" in some output

65bcfdb

tests pass

fe21e7f

unit testing

4246acf

clean up tests, reformat directories, add parsing option for reports

8b61786

documentation cleanup

72793df

update doc and modify third party for new library build

b43bb43

randytpierce added the VXingest issues related to the VXingest project label Jul 31, 2024

randytpierce requested review from JeffHamiltonNOAA and ian-noaa July 31, 2024 20:11

randytpierce self-assigned this Jul 31, 2024

ian-noaa reviewed Aug 2, 2024

View reviewed changes

randytpierce and others added 6 commits August 2, 2024 13:06

Update src/vxingest/grib2_to_cb/grib_builder.py

f552e02

Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

Update src/vxingest/prepbufr_to_cb/README.md

11c6e16

Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

Update src/vxingest/prepbufr_to_cb/README.md

de8851a

Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

Update src/vxingest/prepbufr_to_cb/README.md

1c8040a

good catch Co-authored-by: Ian McGinnis <67600557+ian-noaa@users.noreply.github.com>

PR changes

c4ab35a

Merge branch 'implement_prepbufr_RaobsObsBuilder_185' of https://gith…

22470a1

…ub.com/NOAA-GSL/VxIngest into implement_prepbufr_RaobsObsBuilder_185

randytpierce added 2 commits August 5, 2024 09:07

update poetry.lock

d4b83f7

Merge branch 'implement_prepbufr_RaobsObsBuilder_185' of https://gith…

ef2a94c

…ub.com/NOAA-GSL/VxIngest into implement_prepbufr_RaobsObsBuilder_185

ian-noaa reviewed Aug 5, 2024

View reviewed changes

src/vxingest/prepbufr_to_cb/vx_ingest_manager.py Show resolved Hide resolved

randytpierce added 5 commits August 5, 2024 20:44

build x86_64 wheel, and add poetry to build.sh

c5a3119

add mysql-connector (used for generating reports)

54f47da

changed third party pyproject numpy depedency version from 1 to 2

9ce03c3

had to delete x86 mac third_party/NCEPLIBSbufr wheel because of downg…

8afba60

…raded numpy (don't have a platform to rebuild it.)

clear poetry cache and rebuild nceplibsbufr

65cb9d8

ian-noaa reviewed Aug 6, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

fix platform dependency for poetry ncepbufr module and add wobus rh c…

22aaf45

…alculation

ian-noaa reviewed Aug 21, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

src/vxingest/prepbufr_to_cb/vx_ingest_manager.py Show resolved Hide resolved

docker/ingest/Dockerfile Outdated Show resolved Hide resolved

randytpierce added 12 commits August 21, 2024 14:51

q_marker changes to builder, additional tests, PR changes

767749d

baxk merging from main

c1c141a

committing so that I can back merge

c7c573b

backmerge from main and make tests work - finish prepbufrBuilder

618f040

fixing test errors

dc9b327

dissallowing specific stations from test after analysis.

59bacfe

dissallowing specific stations after analysis

728168a

small test changes

0f06558

add ing comments

c6f429e

format and lint changes

854fd18

choose a recent CTC to compare

a270e0f

add README with analysis notes

930d7d7

randytpierce requested a review from gopa-noaa October 29, 2024 15:08

ian-noaa reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement prepbufr raobs obs builder 185 #395

Implement prepbufr raobs obs builder 185 #395

randytpierce commented Jul 31, 2024

randytpierce commented Aug 2, 2024 via email

ian-noaa left a comment

randytpierce commented Aug 5, 2024 via email

randytpierce commented Aug 6, 2024 via email

ian-noaa commented Aug 6, 2024

ian-noaa commented Aug 21, 2024

ian-noaa left a comment

randytpierce commented Aug 21, 2024 via email

ian-noaa Oct 7, 2024

ian-noaa Oct 7, 2024

ian-noaa Oct 7, 2024

ian-noaa Oct 7, 2024

ian-noaa Oct 29, 2024

ian-noaa Oct 29, 2024

ian-noaa Oct 29, 2024

ian-noaa Oct 29, 2024

ian-noaa Oct 29, 2024

ian-noaa Oct 29, 2024

		json_data = json.dumps(list(document_map.values()))
		json_data = json.dumps(list(document_map.values()), indent=2)

	(pathlib.PurePath(filename).name).split(".")[0],
	file_mask,
	pathlib.Path(filename).stem, file_mask

Implement prepbufr raobs obs builder 185 #395

Are you sure you want to change the base?

Implement prepbufr raobs obs builder 185 #395

Conversation

randytpierce commented Jul 31, 2024

randytpierce commented Aug 2, 2024 via email

ian-noaa left a comment

Choose a reason for hiding this comment

randytpierce commented Aug 5, 2024 via email

randytpierce commented Aug 6, 2024 via email

ian-noaa commented Aug 6, 2024

ian-noaa commented Aug 21, 2024

ian-noaa left a comment

Choose a reason for hiding this comment

randytpierce commented Aug 21, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment