Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass data through Metamorph #107

Closed
cboehme opened this issue Jul 4, 2013 · 13 comments · Fixed by #333
Closed

Pass data through Metamorph #107

cboehme opened this issue Jul 4, 2013 · 13 comments · Fixed by #333

Comments

@cboehme
Copy link
Member

cboehme commented Jul 4, 2013

Metamorph should be able to pass entities and literals through even if they are not processed.

cboehme added a commit to cboehme/metafacture-core that referenced this issue Jul 14, 2013
The old PicaDecoder used regular expressions to parse PICA+ records. 
This let to two problems:

 * Errors in the data resulted in exceptions which did not refer to the    
   portion of the data that caused the problem (e.g. a character index)
 * Due to the use of String.substring() for extracting data from the  
   record the full record was kept in memory (see issue metafacture#51)

The new PicaDecoder was written to solve these problems. The first one
was addressed by constructing the parser so that it only fails in two
clearly defined situations (missing id field and unexpected end of
record). The second one was solved by copying the parsed data portions
into new strings. 

In addition to the problems listed above, the following issues were
addressed:
 
 * metafacture#109 -- removed support for static usages of the encoder
 * metafacture#112 -- removed support for appendControlSubField. If Metamorph is  
   extended to pass data through (issue metafacture#107), this functionality can 
   easily be implemented in a script. It is also not clear how widely it 
   is used at all.
 
While having removed support for control subfields the new decoder
introduces a range of new options:

 * ignore missing id -- do not fail on missing ids but use an empty 
   string as record id
 * skip empty fields -- do not output fields without subfields or empty
   subfields only (i.e. subfields without name and value)
 * fix unexpected end of record -- if a record does not end with a field 
   delimiter one will be automatically added.
 * normalize UTF8 -- automatically performs UTF8 normalization of values
 
The unit tests have been rewritten to match the new options and to be
more useful for debugging.
@cboehme cboehme removed this from the Version 2 milestone Feb 19, 2014
@dr0i
Copy link
Member

dr0i commented Oct 25, 2019

There is metamorph's _else, but I guess you mean "something else" ;)

@cboehme
Copy link
Member Author

cboehme commented Nov 3, 2019

Yes, I meant something that could also pass whole entities and not only literals. Currently, it is not possible to easily write a script that simply changes some literals or entities in a record while passing the remainder of a record untouched. This makes it quite difficult to write script to filter some data in a record without completely rewriting it.

@dr0i
Copy link
Member

dr0i commented Nov 19, 2019

Seems related to an email from February 2015, which also provides more details.

dr0i added a commit that referenced this issue Jun 2, 2020
With this version of Metamorph entities (well, entity events) can be passed through.
It is slightly incomaptible with the default Metamorph where two tests would fail:

- org.metafacture.metamorph.collectors.EntityTest > shouldEmitEntityOnEachFlushEvent
- org.metafacture.metamorph.functions.UniqueTest -> shouldAllowSelectingTheUniqueScope

So this introduces a "<version>" element under the "<meta>" element in morph.

The data is flattened, as with Metamorph 1, but the entity's "start"
and "end" events are passed through so that the receiver can handle the flattened
data structure, unflatten it etc.

By preserving the entity events it's now also possible, without any workarounds,
to handle reiterations of entities having the same name.

- improve MarcXmlEncoder to work with both Metamorph versions
- add "version" element to metamorph.xsd

See #107.
See also https://github.com/hagbeck/metafacture-sandbox/tree/master/enrich_marcxml.
dr0i added a commit that referenced this issue Jun 2, 2020
With this version of Metamorph entities (well, entity events) can be passed through.
It is slightly incomaptible with the default Metamorph where two tests would fail:

- org.metafacture.metamorph.collectors.EntityTest > shouldEmitEntityOnEachFlushEvent
- org.metafacture.metamorph.functions.UniqueTest -> shouldAllowSelectingTheUniqueScope

So this introduces a "<version>" element under the "<meta>" element in morph.

The data is flattened, as with Metamorph 1, but the entity's "start"
and "end" events are passed through so that the receiver can handle the flattened
data structure, unflatten it etc.

By preserving the entity events it's now also possible, without any workarounds,
to handle reiterations of entities having the same name.

- add "version" element to metamorph.xsd

See #107.
See also https://github.com/hagbeck/metafacture-sandbox/tree/master/enrich_marcxml.
dr0i added a commit that referenced this issue Jun 2, 2020
With this version of Metamorph entities (well, entity events) can be passed through.
It is slightly incomaptible with the default Metamorph where two tests would fail:

- org.metafacture.metamorph.collectors.EntityTest > shouldEmitEntityOnEachFlushEvent
- org.metafacture.metamorph.functions.UniqueTest -> shouldAllowSelectingTheUniqueScope

So this introduces a "<version>" element under the "<meta>" element in morph.

The data is flattened, as with Metamorph 1, but the entity's "start"
and "end" events are passed through so that the receiver can handle the flattened
data structure, unflatten it etc.

By preserving the entity events it's now also possible, without any workarounds,
to handle reiterations of entities having the same name.

- add "version" element to metamorph.xsd

See #107.
See also https://github.com/hagbeck/metafacture-sandbox/tree/master/enrich_marcxml.
dr0i added a commit that referenced this issue Jun 2, 2020
With this version of Metamorph entities (well, entity events) can be passed through.
It is slightly incomaptible with the default Metamorph where two tests would fail:

- org.metafacture.metamorph.collectors.EntityTest > shouldEmitEntityOnEachFlushEvent
- org.metafacture.metamorph.functions.UniqueTest -> shouldAllowSelectingTheUniqueScope

So this introduces a "<version>" element under the "<meta>" element in morph.

The data is flattened, as with Metamorph 1, but the entity's "start"
and "end" events are passed through so that the receiver can handle the flattened
data structure, unflatten it etc.

By preserving the entity events it's now also possible, without any workarounds,
to handle reiterations of entities having the same name.

- add "version" element to metamorph.xsd

See #107.
See also https://github.com/hagbeck/metafacture-sandbox/tree/master/enrich_marcxml.
dr0i added a commit that referenced this issue Jun 2, 2020
With this version of Metamorph entities (well, entity events) can be passed through.
It is slightly incomaptible with the default Metamorph where two tests would fail:

- org.metafacture.metamorph.collectors.EntityTest > shouldEmitEntityOnEachFlushEvent
- org.metafacture.metamorph.functions.UniqueTest -> shouldAllowSelectingTheUniqueScope

So this introduces a "<version>" element under the "<meta>" element in morph.

The data is flattened, as with Metamorph 1, but the entity's "start"
and "end" events are passed through so that the receiver can handle the flattened
data structure, unflatten it etc.

By preserving the entity events it's now also possible, without any workarounds,
to handle reiterations of entities having the same name.

- add "version" element to metamorph.xsd

See #107.
See also https://github.com/hagbeck/metafacture-sandbox/tree/master/enrich_marcxml.
@blackwinter
Copy link
Member

blackwinter commented Jun 3, 2020

Just an FYI in reference to PR #328 (not really sure where to discuss): You could achieve (almost) the same result with a filter (org.metafacture.metamorph.Filter) and _else in combination with an unnamed entity [CORRECTION: entity is not required, we're only using it for the (optional) if] (cf. hbz.limetrans.filter.LibraryMetadataFilter).

@Test
public void metamorph1_passthrough() {
    final Filter metamorph = new Filter(InlineMorph.in(this) //
            .with("<rules>")//
            .with("    <data source='_else'/>")//
            .with("</rules>")//
            .create());
    metamorph.setReceiver(receiver);

    metamorph.startRecord("1");
    metamorph.startEntity("clone");
    metamorph.literal("id", "0");
    metamorph.endEntity();
    metamorph.startEntity("clone");
    metamorph.literal("id", "1");
    metamorph.endEntity();
    metamorph.endRecord();

    final InOrder ordered = inOrder(receiver);
    ordered.verify(receiver).startRecord("1");
    ordered.verify(receiver).startEntity("clone");
    ordered.verify(receiver).literal("id", "0");
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).startEntity("clone");
    ordered.verify(receiver).literal("id", "1");
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).endRecord();
}

The only difference is that literals are emitted with their base name only, not with the full entity path (id instead of clone.id), which seems more in line with the usual (pre-1.1) mappings.

@hagbeck
Copy link
Contributor

hagbeck commented Jun 3, 2020

Sorry, but this isn't working anymore! I've pulled the branch and the resulting dist produces

<marc:controlfield tag="leader">01339nmm a2200024 c 4500</marc:controlfield>
<marc:controlfield tag="001">1635091</marc:controlfield>
<marc:controlfield tag="005">20180914</marc:controlfield>
<marc:controlfield tag="007">cr |||||||||||</marc:controlfield>
<marc:controlfield tag="008">171117s2017uuuu|||||| |o|||||||||||ger||</marc:controlfield>
<marc:controlfield tag="020 .a">9783662532607</marc:controlfield>
<marc:controlfield tag="0247.2">doi</marc:controlfield>
<marc:controlfield tag="0247.a">10.1007/978-3-662-53260-7</marc:controlfield>
<marc:controlfield tag="035 .a">(UNION_SEAL)HT019079275</marc:controlfield>
<marc:controlfield tag="035 .a">(DE-599)HBZHT019079275</marc:controlfield>
<marc:controlfield tag="040 .e">rda</marc:controlfield>
<marc:controlfield tag="24500a">Weißbuch Gelenkersatz</marc:controlfield>
<marc:controlfield tag="24500b">Versorgungssituation bei endoprothetischen Hüft- und Knieoperationen in Deutschland</marc:controlfield>
<marc:controlfield tag="24500c">herausgegeben von H.-H. Bleß, M. Kip</marc:controlfield>
<marc:controlfield tag="260 .a">Berlin, Heidelberg</marc:controlfield>
<marc:controlfield tag="260 .b">Springer Berlin Heidelberg</marc:controlfield>
<marc:controlfield tag="260 .c">2017</marc:controlfield>
<marc:controlfield tag="260 .b">Imprint: Springer</marc:controlfield>
<marc:controlfield tag="2641.a">Berlin, Heidelberg</marc:controlfield>
<marc:controlfield tag="2641.b">Springer Berlin Heidelberg</marc:controlfield>
<marc:controlfield tag="2641.c">2017</marc:controlfield>

The morph I've used

<?xml version="1.0" encoding="UTF-8"?>
<metamorph xmlns="http://www.culturegraph.org/metamorph"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.culturegraph.org/metamorph metamorph.xsd"
           version="1">

     <!-- Metadata -->
    <meta />

    <!-- Macro definitions -->
    <macros />

    <!-- Transformation rules-->
    <rules>
        <data source="_else"/>
    </rules>

    <maps />

</metamorph>

And the flux

"infile.xml"|
open-file|
decode-xml|
handle-marcxml|
morph(FLUX_DIR + "morph1.xml", *)|
encode-marcxml|
write( "output.xml");

@blackwinter
Copy link
Member

@hagbeck: It's caused by this particular change in the branch:

-            writeRaw(String.format(CONTROLFIELD_OPEN_TEMPLATE, name));
+            writeRaw(String.format(CONTROLFIELD_OPEN_TEMPLATE, name.replaceFirst("\\W","")));

dr0i added a commit to dr0i/metafacture-sandbox that referenced this issue Jun 4, 2020
@dr0i
Copy link
Member

dr0i commented Jun 4, 2020

@hagbeck for 100% backward compatibility a meta.version was introduced.
The default is to behave like with the original Metamorph - this is what your results reflect.

To use the Metamorph 1.1 you have to put a meta.version into morph with value "1.1". I opened a PR (hagbeck/metafacture-sandbox#4) so you can see how it's done.

@dr0i
Copy link
Member

dr0i commented Jun 4, 2020

@blackwinter oh, thx, wasn't aware of this "Filter" ... Wonder if this is the better solution and wonder why this was not discussed here before. Will try it and see how it behaves, but it looks very promising!

@blackwinter
Copy link
Member

The default is to behave like with the original Metamorph - this is what your results reflect.

The difference is tag="020␣␣.a" (old) vs. tag="020␣.a" (new).

@blackwinter
Copy link
Member

blackwinter commented Jun 4, 2020

Ok, I see, the comment "this isn't working anymore" refers to metafacture-sandbox, not the original behaviour. I still think it's an incompatibility, though. This change is not guarded by the version parameter.

@dr0i
Copy link
Member

dr0i commented Jun 4, 2020

@blackwinter yes, it's indeed an issue. Wished to have put 7ca5615 in another branch so that it could be discussed there.

@blackwinter
Copy link
Member

wonder why this was not discussed here before.

I guess that's because it's not selective and doesn't allow "enriching" the output. It only allows for converting from one format into another and/or filtering out unwanted records. The e-mail thread you mentioned above sounds like pass-through should be applicable to selected elements, not necessarily the stream as a whole.

Wished to have put 7ca5615 in another branch so that it could be discussed there.

Well, why did you make those changes in the first place? I assume they were intended to address a shortcoming of this particular pass-through implementation: namely, that it passes the full entity path instead of just the individual entity/literal names (see the difference between metamorph1_1() and metamorph1_passthrough()). I would contend that any implementation of this feature should get by without requiring modifications in downstream consumers.

Both these points seem to indicate that we should maybe approach this issue from a different angle: API-wise, isn't this rather a property of the data element instead of the Metamorph definition? It seems what we want is a data element that acts as a entity/data hybrid that would be able to "unflatten" the stream.

To illustrate (untested):

@Test
public void metamorph1_unflatten() {
    metamorph = InlineMorph.in(this) //
            .with("<rules>")//
            .with("    <entity name='flattened' flushWith='record'>")//
            .with("        <data source='_else'/>")//
            .with("    </entity>")//
            .with("    <entity name='unflattened' flushWith='record'>")//
            .with("        <data source='_else' unflatten='true'/>")//
            .with("    </entity>")//
            .with("</rules>")//
            .createConnectedTo(receiver);

    metamorph.startRecord("1");
    metamorph.startEntity("clone");
    metamorph.literal("id", "0");
    metamorph.endEntity();
    metamorph.startEntity("clone");
    metamorph.literal("id", "1");
    metamorph.endEntity();
    metamorph.endRecord();

    final InOrder ordered = inOrder(receiver);
    ordered.verify(receiver).startRecord("1");
    ordered.verify(receiver).startEntity("flattened");
    ordered.verify(receiver).literal("clone.id", "0");
    ordered.verify(receiver).literal("clone.id", "1");
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).startEntity("unflattened");
    ordered.verify(receiver).startEntity("clone");
    ordered.verify(receiver).literal("id", "0");
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).startEntity("clone");
    ordered.verify(receiver).literal("id", "1");
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).endEntity();
    ordered.verify(receiver).endRecord();
}

WDYT?

@dr0i dr0i self-assigned this Jul 28, 2020
dr0i added a commit that referenced this issue Oct 8, 2020
With the new keyword "_elseAndPassEntityEvents" (set with
<data source="_elseAndPassEntityEvents" /> ) the known "_else"
is triggered AND entity events for these _else sources are fired.
With this, data can be passed through metamorph. These "_else"-data
is handled in receivers like all the other data handled by morph rules.

Data which is handled by metamorph rules will NOT be passed through
(hence the aptly named "_else"). If you want to use data in the morph
AND pass it through, you have to add an explicit rule for this, as usual.

See #107.
dr0i added a commit that referenced this issue Oct 9, 2020
With the new keyword "_elseAndPassEntityEvents" (set with
<data source="_elseAndPassEntityEvents" /> ) the known "_else"
is triggered AND entity events for these "_else" sources are fired.
With this, data can be passed through metamorph. All "_else" data
are handled in receivers like all the other data handled by morph rules.

Data which is handled by metamorph rules will NOT be passed through
(hence the aptly named "_else"). If you want to use data in the morph
AND pass it through, you have to add an explicit rule for this, as usual.

See #107.
This was linked to pull requests Oct 9, 2020
dr0i added a commit that referenced this issue Oct 9, 2020
@dr0i dr0i closed this as completed in #333 Oct 13, 2020
@dr0i
Copy link
Member

dr0i commented Oct 13, 2020

Added to the wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants