Skip to content

Fix(yaml): Handle missing optional fields in JSON parsing #35288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 24, 2025

Conversation

liferoad
Copy link
Contributor

@liferoad liferoad commented Jun 14, 2025

Fixes #35179

When using ReadFromPubSub with a schema in Beam YAML, the pipeline would fail with a KeyError if a field specified in the schema was missing from the incoming JSON message.

This commit fixes the issue by modifying the json_to_row function in apache_beam/yaml/json_utils.py. The direct dictionary access value[name] is replaced with value.get(name) to safely handle missing keys, returning None instead of raising an error.

The converters for array, map, and row types have also been made robust to handle None values, which can occur for missing optional fields of these complex types.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@liferoad
Copy link
Contributor Author

@jonathaningram possible to validate this PR from your side? Feel free to review it as well.

Copy link
Contributor

Assigning reviewers:

R: @claudevdm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link

@jonathaningram jonathaningram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks right to me and I tested a Dataflow Beam YAML pipeline without this change and with it. I can confirm the key error goes away with this change. I tested a missing Pub/Sub message field but not a missing attribute. I assume it works for both though.

Maybe there could be a corresponding docs update to go with this PR, e.g., tell users what happens if they have missing fields, but leave that with you to decide on.

@liferoad
Copy link
Contributor Author

The code looks right to me and I tested a Dataflow Beam YAML pipeline without this change and with it. I can confirm the key error goes away with this change. I tested a missing Pub/Sub message field but not a missing attribute. I assume it works for both though.

Maybe there could be a corresponding docs update to go with this PR, e.g., tell users what happens if they have missing fields, but leave that with you to decide on.

Good idea. Added this to CHANGES.md.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment: it'd be nice if there was a test ensuring we still fail for non optional fields.

@jonathaningram
Copy link

@robertwb is it even possible to define an optional or non-optional field? As described in the original issue #35179, I couldn't work out how to specify "required-ness" on my schema.

I did notice in the tests in this PR that nullable was being used and I was going to comment on whether that's something that external pipeline authors are meant to be able to configure, but I removed my comment because I decided that maybe the nullable was just to help set up a schema for the tests (and I could prove the fix worked e2e in Dataflow).

@robertwb
Copy link
Contributor

By default, all properties in a json schema are optional; to declare them otherwise one uses the required field: https://json-schema.org/understanding-json-schema/reference/object#required which we respect in Beam: https://github.com/apache/beam/blob/release-2.65/sdks/python/apache_beam/yaml/json_utils.py#L67 .

This function takes as input a schema_pb2.FieldType and should respect whether the types in question are optional (though I'm not saying it might not be to strict now).

@liferoad
Copy link
Contributor Author

Drive by comment: it'd be nice if there was a test ensuring we still fail for non optional fields.

Good point. The original PR indeed did not force this requirement. I updated the code to check the required fields.

@liferoad liferoad requested review from claudevdm and robertwb June 19, 2025 20:08
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.51%. Comparing base (cecfa61) to head (c2f2553).
Report is 61 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##             master   #35288    +/-   ##
==========================================
  Coverage     54.50%   54.51%            
  Complexity     1559     1559            
==========================================
  Files          1035     1036     +1     
  Lines        161595   161782   +187     
  Branches       1139     1139            
==========================================
+ Hits          88084    88189   +105     
- Misses        71380    71462    +82     
  Partials       2131     2131            
Flag Coverage Δ
python 80.82% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@liferoad liferoad merged commit 4bfa0c9 into apache:master Jun 24, 2025
95 checks passed
shunping pushed a commit to shunping/beam that referenced this pull request Jun 27, 2025
* Fix(yaml): Handle missing optional fields in JSON parsing

* updated the release doc

* check the required fields

* check the nullable at the beginning

* fixed the pickle error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Allow schema fields that might be null or missing
4 participants