Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out triples with whitespace object #8

Merged
merged 2 commits into from
Apr 13, 2022

Conversation

benjay10
Copy link

@benjay10 benjay10 commented Apr 11, 2022

A more complete backstory:

While editing an agendapoint in Gelinkt Notuleren, you can set and change the type of the besluit. When changing this type multiple times in a row, there would be some leftover spaces in the RDFa data (which is not illegal). For example:

<div ... typeof="besluit:Besluit ext:BesluitNieuweStijl     besluittype:849c66c2-ba33-4ac1-a693-be48d8ac7bc7" ...>

This caused the RDFa parser to find the types:

[ "besluit:Besluit", "ext:BesluitNieuweStijl", " ", " ", " ", "besluittype:849c66c2-ba33-4ac1-a693-be48d8ac7bc7" ]

and eventually created SPARQL queries like:

INSERT DATA {
  GRAPH <http://mu.semte.ch/graphs/public> {
    besluiten:0ae32940-b70d-11ec-ac2b-ed7850dc94ca a besluit:Besluit ;
      a ext:BesluitNieuweStijl ;
      a <> ;
      a besluittype:849c66c2-ba33-4ac1-a693-be48d8ac7bc7 .
    ...
  }
}

where the empty IRI <> caused the query to fail.

In some situations, the service retries to publish resources up to 10 times until it eventually permanently fails to publish the resource. Every resources is published using multiple queries, and only 1 would not succeed, leaving some leftover data in the database that is not rolled back. This could cause 'phantom' uittreksels to show in the publicatie frontend, 10 to be exact:
image

This PR hopefully fixes this, by filtering out triples with empty (or white space) objects. A more complete fix would be to use a different RDFa parser that would properly handle multiple spaces.

After parsing RDFa, some triples had objects consisting of only
whitespace, which would lead to incorrect SPARQL queries, blocking the
publishing of certain besluiten.
@benjay10 benjay10 added the bug Something isn't working label Apr 11, 2022
@benjay10 benjay10 requested a review from nvdk April 11, 2022 16:17
@benjay10 benjay10 requested a review from Asergey91 April 11, 2022 23:01
Copy link
Member

@nvdk nvdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also drop empty literals, which aren't really an issue... but I guess don't really add value either.

You could consider filtering on datatype to only remove empty resources

Thanks to Niels pointing out that other values can also be empty or
whitespace and those should be preserved.
@benjay10 benjay10 requested a review from nvdk April 12, 2022 16:01
@nvdk nvdk merged commit 69c03a7 into master Apr 13, 2022
@nvdk nvdk deleted the bugfix/filter-triples-with-empty-object branch April 13, 2022 07:10
nvdk pushed a commit to lblod/app-gn-publicatie that referenced this pull request May 31, 2022
lblod/besluit-publicatie-publish-service#8

filter out triples with whitespace object, causing queries to fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants