Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic behaviour with nested entity and choose in combination with flushWith #366

Open
TobiasNx opened this issue May 18, 2021 · 6 comments

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented May 18, 2021

Working with lobid-resources I found a bug.

https://github.com/hbz/lobid-resources/blob/ccc34fc8590340c6fb72d4d9c884746090b7fbf8/src/main/resources/alma/common/subjects.xml#L4-L60

To add a type specific to the second indicator of a marc field in 610 and 650 I have introduced a choose collector with sameEntity="true" reset="true" flushWith="650??.a" (or 610??.a). The wrapping mother entity also has flushWith conditions. Both flushWith as well al the source data refrence are using wild cards.

    <entity name="" flushWith="650??" sameEntity="true" reset="true">
      <entity name="type[]" sameEntity="true" reset="true">
        <choose sameEntity="true" reset="true" flushWith="650??.a">
          <data source="650?0.a">
            <constant value="Concept"/>
          </data>
          <data source="650?4.a">
            <constant value="Keyword"/>
          </data>
        </choose>
      </entity>
...

When I run the lobid morph to test with:
mvn failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest

Results for this specific transformation process seem to be erratic.
They are inconsistent internally as well as between different test runs. But is does not always create these inconsistencies between runs.
hbz/lobid-resources@ccc34fc

@dr0i saw it live. When we changed the flushWith-attribute of choose to the entity level (610?? instead of 610??.a). The bug does not appear.

@TobiasNx TobiasNx added the Bug label May 18, 2021
TobiasNx added a commit to hbz/lobid-resources that referenced this issue May 18, 2021
@blackwinter
Copy link
Member

I don't think I can confirm.

First of all, AlmaMarc21XmlToLobidJsonTest.setup() sets GENERATE_TESTDATA = true, so the test doesn't actually test anything (AFAIUI).

But even after removing that line, executing mvn clean test-compile failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest always yields the exact same comparison failure (see below). No erratic behaviour observed here.

Can you be more specific as to what should be the expected outcome?

expected:<...subject" : [ {
    "[label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {]
    "label" : "Schl...> but was:<...subject" : [ {
    "[type" : [ "Keyword" ],
    "label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],]
    "label" : "Schl...>
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.lambda$transformFiles$1(AlmaMarc21XmlToLobidJsonTest.java:174)
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.transformFiles(AlmaMarc21XmlToLobidJsonTest.java:137)

@blackwinter
Copy link
Member

There's almost definitely something fishy going on... When shuffling the file list, catching the AssertionError and then just printing the failing file names, it becomes apparent that the failures are order-dependent:

(DE-605)HT003176544
(DE-605)TT050421649
(DE-605)HT015671602
(DE-605)HT017411546
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT003176544
(DE-605)HT020202475
(DE-605)HT015671602
(DE-605)HT017411546
(DE-605)HT017411546
(DE-605)HT020202475
(DE-605)HT003176544
(DE-605)TT050421649
(DE-605)TT050421649
(DE-605)HT017411546
(DE-605)TT050421649
(DE-605)HT020202475
(DE-605)HT017411546

While they're stable when iterating in fixed (directory) order:

(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546
(DE-605)HT003176544
(DE-605)HT017411546

It's also not limited to the <choose> elements in subjects.xml:

(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>

Which originates from a - seemingly - inconspicuous <data> element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

@TobiasNx
Copy link
Contributor Author

TobiasNx commented May 20, 2021

Can you be more specific as to what should be the expected outcome?

It is expected that 610 and 650 get a "type" : [ "Keyword" ] or "type" : [ "Concept" ] based on the specific second indicator of that field.

the internal inconsistency is shown in this two cases:

(DE-605)HT003176544.json:

    "label" : "Schleswig-Holstein",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hotel- und Gaststättengewerbe",
    "source" : {
      "label" : "Freie Verschlagwortung",

The 610 and the 650 even if they are the same workflow 610 in this example provides the type and 650 does not.

We only have one 610 testfile.

Unfortunately not shown in the the catched example 650 also does not work always and even can give different results in different testfiles . It then can happen that HT003176544 and HT020202475 have different results with this workflow. We saw the case that: While HT003176544: "label" : "Hotel- und Gaststättengewerbe" has no type HT020202475: "label" : "Critical theory" and "label" : "Electronic books" have or the other way around.

(DE-605)HT020202475.json

  "subject" : [ {
    "type" : [ "Keyword" ],
    "label" : "Critical theory",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Electronic books",
    "source" : {
      "label" : "Freie Verschlagwortung",

Your second case:

It's also not limited to the elements in subjects.xml:

(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>

Which originates from a - seemingly - inconspicuous element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

Could this due to the fact that the source data has duplicates?

  <datafield tag="264" ind1="3" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Historische Kommission, Landschaftsverband Westfalen-Lippe</subfield>
    <subfield code="c">1951-2004</subfield>
    <subfield code="3">1951-2004</subfield>
  </datafield>
  <datafield tag="264" ind1="2" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="3">1951-1977</subfield>
  </datafield>

Münster appears multiple times.

@dr0i
Copy link
Member

dr0i commented May 20, 2021

Sorry for intervening late: the tests are not working and we have an issue for that. For the rest of the problem I can't say because I haven't checked that on my own yet.

@blackwinter
Copy link
Member

the tests are not working

I suspect both issues have exactly the same root cause. Your tests don't give reliable results (seems to be the location issue mentioned above), because of the inconsistency between generating the test data and testing them (as well as between different test runs).

I was able to reproduce the erratic behaviour with a reduced test case (HT020391499 and HT003176544, in this order):

Possible outcomes for HT003176544 (as of hbz/lobid-resources@ccc34fc):

  1. All 610 fields gain type.
  2. The 650 field loses type.
  3. Both 1 and 2.
  4. No changes.

Test scenarios:

  1. Generating test data while also extracting from the tarball: 20/20 outcome 1.
  2. Generating test data without extracting from the tarball: 20/20 outcome 3.
  3. Testing pristine test data: 19/30 outcome 1, 11/30 outcome 2.

Control test with only HT003176544:

  1. Scenario 1: 19/20 outcome 2, 1/20 outcome 3.
  2. Scenario 2: 18/20 outcome 3, 2/20 outcome 1.
  3. Scenario 3: 30/30 outcome 4.

Which leads to the following questions:

  1. Why does extracting from the tarball affect the outcome of generate runs? (1 vs. 3 and 2 + 3 vs. 1 + 3)
  2. Why do generate runs oscillate between different outcomes? (2 + 3 vs. 1 + 3)
  3. Why does one extra test file affect the outcome of test runs? (1 + 2 vs. 4)
  4. Why do test runs oscillate between different outcomes? (1 + 2)

blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021
@blackwinter
Copy link
Member

blackwinter commented May 21, 2021

So here's a heavily reduced test case that hopefully still catches the essence of this issue (at least w.r.t. questions 2 and 4, maybe 3). You can run it with the following Gradle invocation from the metamorph subdirectory:

 ../gradlew --rerun-tasks test --tests org.metafacture.metamorph.Issue366Test -Dissue366.verbose=true -Dissue366.iter=100

It shows that option 0 (the issue at hand) fails 50% of the time (misses k) and that options 1-3 (small modifications to the morph) always pass (while option 1 only differs from options 2-3 in that it yields a different order in the output - as expected).

Input:

{"10012":{"a":"V"}}

Option 0:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Output 0:

{"r":{"v":"V"}}         // 47/100
{"r":{"k":"K","v":"V"}} // 53/100

Option 1:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 1:

--- option0.xml
+++ option1.xml
@@ -1,7 +1,7 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100??">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>

Output 1:

{"r":{"v":"V","k":"K"}}

Option 2:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100??.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 2:

--- option0.xml
+++ option2.xml
@@ -2,7 +2,7 @@
   <rules>
     <entity name="r" flushWith="100??">
       <choose flushWith="100??.a">
-        <data name="k" source="100?2.a">
+        <data name="k" source="100??.a">
           <constant value="K"/>
         </data>
       </choose>

Output 2:

{"r":{"k":"K","v":"V"}}

Option 3:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100?2.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100?2.a" />
    </entity>
  </rules>
</metamorph>

Diff 3:

--- option0.xml
+++ option3.xml
@@ -1,12 +1,12 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100?2.a">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>
       </choose>
-      <data name="v" source="100??.a" />
+      <data name="v" source="100?2.a" />
     </entity>
   </rules>
 </metamorph>

Output 3:

{"r":{"k":"K","v":"V"}}

blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021
blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021
@github-project-automation github-project-automation bot moved this to Backlog in Metafacture Mar 27, 2023
@katauber katauber removed this from Metafacture Apr 24, 2023
blackwinter added a commit that referenced this issue Dec 13, 2024
* Disallow unsupported levels.
* Add tests.
blackwinter pushed a commit that referenced this issue Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants