Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8085 decoupling tsv #8086

Closed
wants to merge 12 commits into from
Closed

Conversation

poikilotherm
Copy link
Contributor

@poikilotherm poikilotherm commented Sep 3, 2021

What this PR does / why we need it:
Testing, reuse for tests and future features with TSV based metadata schema files need abstraction and parsing outside the API code, where currently the conversion happens.

This PR is about to change the underlying infrastructure for this, add lots of testing and implement actual checks on restrictions like no circular deps of fields, max depth of compounds etc etc etc.

This is a blocker ⚡ for #5989

Which issue(s) this PR closes:

Closes #8085

Special notes for your reviewer:
None yet.

Suggestions on how to test this:
Extensive unit and integration testing will be provided.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Nope.

Is there a release notes update needed for this change?:
I don't think so, as it's part of the code infrastructure.

Additional documentation:
None yet. (Might make the docs on metadata blocks more precise in a few places)

@poikilotherm poikilotherm self-assigned this Sep 3, 2021
@coveralls
Copy link

coveralls commented Sep 3, 2021

Coverage Status

Coverage increased (+0.1%) to 19.28% when pulling cd6f1e7 on poikilotherm:8085-decoupling-tsv into 8127559 on IQSS:develop.

- Use the Univocity annotations on the model
- Add proper validation restrictions
- Make the column headers (or future mappings) part of the model
  by adding a proper enum, representing the (TSV column) order and
  the key values
- Make column headers part of the model class in a reusable fashion
  by using an enum plus constants. The enum also represent order and
  key values for the parser.
- Add Univocity parsing annotations to setter methods of the model class
- Add validation annotations where necessary
- Add tests for everything
…QSS#8085

- Add headers enum like with MetadataBlock and DatasetFieldType
- Add a placeholder for alternate values that need references to
  the dataset field this CVV is a part of
- Make the alternatives come from a column with a header
  (This is undocumented behaviour in the docs currently!)
- Proper validation as with the other data bindings
- Includes lot's of tests to make sure we notice when it breaks.
@bencomp
Copy link
Contributor

bencomp commented Oct 12, 2022

I am amazed by this work. (For Hacktoberfest I was playing with creating unit tests for the DatasetField and -Type classes, but they were not as extensive yet.)

If I may ask, are you not creating a strong coupling between the "TSV"s and the core data model relating to fields? What if managing fields and field types should be editable/createable individually via the API, or controlled vocabularies updated using other formats like SKOS in Turtle/JSON-LD/...? Would the suggested approach prevent those possibilities?
Would a service bean in between the API and entities be a better spot for parsing the "TSV" files?

I understand that this is not currently worked on, so I don't expect my comment to influence anything soon. I was just wondering :)

@poikilotherm
Copy link
Contributor Author

Hey @bencomp thanks for the 🌹 🌷 💮 !

I really had to look what I did there. And I have to say: I kind of gave up on this, as I also saw (some of) the problems you mentioned. The biggest obstacle, however, was the inability to reuse the TSV parser library with what we have as our custom metadata block format.

I initially created this to iron out the problems with the TSVs and Solr configuration. So my other approach, which I hope to continue at some point, can be found here: #8320 I started to write a validating TSV-format parser in there, as I also found there is no good and complete spec of our format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Decoupling TSV metadata schema parsing from API endpoint
4 participants