Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Non-Federal Data Harvesting #4870

Open
2 tasks
Jin-Sun-tts opened this issue Aug 29, 2024 · 1 comment
Open
2 tasks

Support Non-Federal Data Harvesting #4870

Jin-Sun-tts opened this issue Aug 29, 2024 · 1 comment
Labels
bug Software defect or bug H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@Jin-Sun-tts
Copy link
Contributor

We need to enhance the current data harvesting system to support non-federal data sources. This involves updating the harvest_source table to include additional fields that differentiate between federal and non-federal data sources, or alternatively, using the existing schema_type field to manage this distinction.

How to reproduce

When harvesting non-federal data sources, such as NYC Data.json, validation errors occur, preventing all records from being harvested.

Expected behavior

The bureauCode field should not be validated when processing non-federal data.

Actual behavior

validation error: <ValidationError: "'bureauCode' is a required property">

Sketch

  • Modify the harvest_source table to include a field indicating whether the source is federal or non-federal.
  • Update the harvester process to support the processing of non-federal data.
@Jin-Sun-tts Jin-Sun-tts added bug Software defect or bug H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 labels Aug 29, 2024
@jbrown-xentity
Copy link
Contributor

We have a non-federal schema, but the JSON Schema version is super old. @rshewitt went through the process of upgrading the federal-v1.1 version and it is in the datagov-harvester here. I examined the differences between the federal and non-federal on the old version; they mostly consist of allowing REDACTED in the federal version, and some namespace convention changes that are mostly unnecessary.
I would propose going from the updated federal version, and making things not required using best judgement from the DCAT-US spec documentation. For example, while bureauCode is listed in the summary area as being "required", in the details, you'll see it is Yes, for United States Federal Government agencies. So for non-federal validation, this should be optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Status: H2.0 Backlog
Development

No branches or pull requests

2 participants