Erratic behaviour with nested entity and choose in combination with flushWith #366

TobiasNx · 2021-05-18T14:08:02Z

Working with lobid-resources I found a bug.

https://github.com/hbz/lobid-resources/blob/ccc34fc8590340c6fb72d4d9c884746090b7fbf8/src/main/resources/alma/common/subjects.xml#L4-L60

To add a type specific to the second indicator of a marc field in 610 and 650 I have introduced a choose collector with sameEntity="true" reset="true" flushWith="650??.a" (or 610??.a). The wrapping mother entity also has flushWith conditions. Both flushWith as well al the source data refrence are using wild cards.

    <entity name="" flushWith="650??" sameEntity="true" reset="true">
      <entity name="type[]" sameEntity="true" reset="true">
        <choose sameEntity="true" reset="true" flushWith="650??.a">
          <data source="650?0.a">
            <constant value="Concept"/>
          </data>
          <data source="650?4.a">
            <constant value="Keyword"/>
          </data>
        </choose>
      </entity>
...

When I run the lobid morph to test with:
mvn failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest

Results for this specific transformation process seem to be erratic.
They are inconsistent internally as well as between different test runs. But is does not always create these inconsistencies between runs.
hbz/lobid-resources@ccc34fc

@dr0i saw it live. When we changed the flushWith-attribute of choose to the entity level (610?? instead of 610??.a). The bug does not appear.

The text was updated successfully, but these errors were encountered:

See metafacture/metafacture-core#366

blackwinter · 2021-05-19T07:20:33Z

I don't think I can confirm.

First of all, AlmaMarc21XmlToLobidJsonTest.setup() sets GENERATE_TESTDATA = true, so the test doesn't actually test anything (AFAIUI).

But even after removing that line, executing mvn clean test-compile failsafe:integration-test -Dit.test=AlmaMarc21XmlToLobidJsonTest always yields the exact same comparison failure (see below). No erratic behaviour observed here.

Can you be more specific as to what should be the expected outcome?

expected:<...subject" : [ {
    "[label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {]
    "label" : "Schl...> but was:<...subject" : [ {
    "[type" : [ "Keyword" ],
    "label" : "Berlin",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Bremen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hamburg",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Nordrhein-Westfalen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Niedersachsen",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Rheinland-Pfalz",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Saarland",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],]
    "label" : "Schl...>
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.lambda$transformFiles$1(AlmaMarc21XmlToLobidJsonTest.java:174)
        at org.lobid.resources.AlmaMarc21XmlToLobidJsonTest.transformFiles(AlmaMarc21XmlToLobidJsonTest.java:137)

blackwinter · 2021-05-19T14:44:20Z

There's almost definitely something fishy going on... When shuffling the file list, catching the AssertionError and then just printing the failing file names, it becomes apparent that the failures are order-dependent:

(DE-605)HT003176544
(DE-605)TT050421649
(DE-605)HT015671602
(DE-605)HT017411546

(DE-605)HT017411546
(DE-605)HT003176544

(DE-605)HT003176544
(DE-605)HT020202475
(DE-605)HT015671602
(DE-605)HT017411546

(DE-605)HT017411546
(DE-605)HT020202475
(DE-605)HT003176544
(DE-605)TT050421649

(DE-605)TT050421649
(DE-605)HT017411546

(DE-605)TT050421649
(DE-605)HT020202475
(DE-605)HT017411546

While they're stable when iterating in fixed (directory) order:

(DE-605)HT003176544
(DE-605)HT017411546

(DE-605)HT003176544
(DE-605)HT017411546

(DE-605)HT003176544
(DE-605)HT017411546

(DE-605)HT003176544
(DE-605)HT017411546

It's also not limited to the <choose> elements in subjects.xml:

(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>

Which originates from a - seemingly - inconspicuous <data> element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

TobiasNx · 2021-05-20T05:15:38Z

Can you be more specific as to what should be the expected outcome?

It is expected that 610 and 650 get a "type" : [ "Keyword" ] or "type" : [ "Concept" ] based on the specific second indicator of that field.

the internal inconsistency is shown in this two cases:

(DE-605)HT003176544.json:

    "label" : "Schleswig-Holstein",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Hotel- und Gaststättengewerbe",
    "source" : {
      "label" : "Freie Verschlagwortung",

The 610 and the 650 even if they are the same workflow 610 in this example provides the type and 650 does not.

We only have one 610 testfile.

Unfortunately not shown in the the catched example 650 also does not work always and even can give different results in different testfiles . It then can happen that HT003176544 and HT020202475 have different results with this workflow. We saw the case that: While HT003176544: "label" : "Hotel- und Gaststättengewerbe" has no type HT020202475: "label" : "Critical theory" and "label" : "Electronic books" have or the other way around.

(DE-605)HT020202475.json

  "subject" : [ {
    "type" : [ "Keyword" ],
    "label" : "Critical theory",
    "source" : {
      "label" : "Freie Verschlagwortung",
      "id" : "https://www.wikidata.org/wiki/Q47524318"
    }
  }, {
    "type" : [ "Keyword" ],
    "label" : "Electronic books",
    "source" : {
      "label" : "Freie Verschlagwortung",

Your second case:

It's also not limited to the elements in subjects.xml:
(DE-605)HT017411546 expected:<..." : "Münster",
    "[]publicationHistory" ...> but was:<..." : "Münster",
    "[location" : "Münster",
    "]publicationHistory" ...>
Which originates from a - seemingly - inconspicuous element (via macro regex-del-punctuation-end) in titleRelatedFields.xml.

Could this due to the fact that the source data has duplicates?

  <datafield tag="264" ind1="3" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Historische Kommission, Landschaftsverband Westfalen-Lippe</subfield>
    <subfield code="c">1951-2004</subfield>
    <subfield code="3">1951-2004</subfield>
  </datafield>
  <datafield tag="264" ind1="2" ind2="1">
    <subfield code="a">M&#252;nster</subfield>
    <subfield code="b">Regensberg</subfield>
    <subfield code="3">1951-1977</subfield>
  </datafield>

Münster appears multiple times.

dr0i · 2021-05-20T07:33:58Z

Sorry for intervening late: the tests are not working and we have an issue for that. For the rest of the problem I can't say because I haven't checked that on my own yet.

blackwinter · 2021-05-20T09:45:33Z

the tests are not working

I suspect both issues have exactly the same root cause. Your tests don't give reliable results (seems to be the location issue mentioned above), because of the inconsistency between generating the test data and testing them (as well as between different test runs).

I was able to reproduce the erratic behaviour with a reduced test case (HT020391499 and HT003176544, in this order):

Possible outcomes for HT003176544 (as of hbz/lobid-resources@ccc34fc):

All 610 fields gain type.
The 650 field loses type.
Both 1 and 2.
No changes.

Test scenarios:

Generating test data while also extracting from the tarball: 20/20 outcome 1.
Generating test data without extracting from the tarball: 20/20 outcome 3.
Testing pristine test data: 19/30 outcome 1, 11/30 outcome 2.

Control test with only HT003176544:

Scenario 1: 19/20 outcome 2, 1/20 outcome 3.
Scenario 2: 18/20 outcome 3, 2/20 outcome 1.
Scenario 3: 30/30 outcome 4.

Which leads to the following questions:

Why does extracting from the tarball affect the outcome of generate runs? (1 vs. 3 and 2 + 3 vs. 1 + 3)
Why do generate runs oscillate between different outcomes? (2 + 3 vs. 1 + 3)
Why does one extra test file affect the outcome of test runs? (1 + 2 vs. 4)
Why do test runs oscillate between different outcomes? (1 + 2)

blackwinter · 2021-05-21T09:15:47Z

So here's a heavily reduced test case that hopefully still catches the essence of this issue (at least w.r.t. questions 2 and 4, maybe 3). You can run it with the following Gradle invocation from the metamorph subdirectory:

 ../gradlew --rerun-tasks test --tests org.metafacture.metamorph.Issue366Test -Dissue366.verbose=true -Dissue366.iter=100

It shows that option 0 (the issue at hand) fails 50% of the time (misses k) and that options 1-3 (small modifications to the morph) always pass (while option 1 only differs from options 2-3 in that it yields a different order in the output - as expected).

Input:

{"10012":{"a":"V"}}

Option 0:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Output 0:

{"r":{"v":"V"}}         // 47/100
{"r":{"k":"K","v":"V"}} // 53/100

Option 1:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 1:

--- option0.xml
+++ option1.xml
@@ -1,7 +1,7 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100??">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>

Output 1:

{"r":{"v":"V","k":"K"}}

Option 2:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100??.a">
        <data name="k" source="100??.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100??.a" />
    </entity>
  </rules>
</metamorph>

Diff 2:

--- option0.xml
+++ option2.xml
@@ -2,7 +2,7 @@
   <rules>
     <entity name="r" flushWith="100??">
       <choose flushWith="100??.a">
-        <data name="k" source="100?2.a">
+        <data name="k" source="100??.a">
           <constant value="K"/>
         </data>
       </choose>

Output 2:

{"r":{"k":"K","v":"V"}}

Option 3:

<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
  <rules>
    <entity name="r" flushWith="100??">
      <choose flushWith="100?2.a">
        <data name="k" source="100?2.a">
          <constant value="K"/>
        </data>
      </choose>
      <data name="v" source="100?2.a" />
    </entity>
  </rules>
</metamorph>

Diff 3:

--- option0.xml
+++ option3.xml
@@ -1,12 +1,12 @@
 <metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
   <rules>
     <entity name="r" flushWith="100??">
-      <choose flushWith="100??.a">
+      <choose flushWith="100?2.a">
         <data name="k" source="100?2.a">
           <constant value="K"/>
         </data>
       </choose>
-      <data name="v" source="100??.a" />
+      <data name="v" source="100?2.a" />
     </entity>
   </rules>
 </metamorph>

Output 3:

{"r":{"k":"K","v":"V"}}

* Disallow unsupported levels. * Add tests.

Add log function

TobiasNx added the Bug label May 18, 2021

TobiasNx added a commit to hbz/lobid-resources that referenced this issue May 18, 2021

Erratic choose construct in entity

ccc34fc

See metafacture/metafacture-core#366

blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021

Reduced test case for metafacture#366.

5d02bbe

blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021

Reduced test case for metafacture#366.

a7a6fbe

blackwinter added a commit to blackwinter/metafacture-core that referenced this issue May 21, 2021

Reduced test case for metafacture#366.

92a035c

TobiasNx added the Metamorph label Mar 13, 2023

TobiasNx added this to Metafacture Mar 27, 2023

github-project-automation bot moved this to Backlog in Metafacture Mar 27, 2023

katauber removed this from Metafacture Apr 24, 2023

blackwinter added a commit that referenced this issue Dec 13, 2024

Allow more log levels in log() Fix function. (#366)

5d15e0e

* Disallow unsupported levels. * Add tests.

blackwinter added a commit that referenced this issue Dec 13, 2024

Update documentation for log() Fix function. (#366)

ad73aa2

blackwinter pushed a commit that referenced this issue Dec 13, 2024

Merge pull request #366 from metafacture/319-log

81a88bc

Add log function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erratic behaviour with nested entity and choose in combination with flushWith #366

Erratic behaviour with nested entity and choose in combination with flushWith #366

TobiasNx commented May 18, 2021 •

edited

Loading

blackwinter commented May 19, 2021

blackwinter commented May 19, 2021

TobiasNx commented May 20, 2021 •

edited

Loading

dr0i commented May 20, 2021

blackwinter commented May 20, 2021

blackwinter commented May 21, 2021 •

edited

Loading

Erratic behaviour with nested entity and choose in combination with flushWith #366

Erratic behaviour with nested entity and choose in combination with flushWith #366

Comments

TobiasNx commented May 18, 2021 • edited Loading

blackwinter commented May 19, 2021

blackwinter commented May 19, 2021

TobiasNx commented May 20, 2021 • edited Loading

dr0i commented May 20, 2021

blackwinter commented May 20, 2021

blackwinter commented May 21, 2021 • edited Loading

TobiasNx commented May 18, 2021 •

edited

Loading

TobiasNx commented May 20, 2021 •

edited

Loading

blackwinter commented May 21, 2021 •

edited

Loading