Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix a bug in xml stream parsing where a previously unmatched node cau…
…sing all subsequent valid matches fail. Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for adding additional filtering. Imagine the following example, where the xml doc is: ``` <ROOT> <AAA> <CCC>c1</CCC> <BBB>b1</BBB> <DDD>d1</DDD> <BBB>b2<ZZZ z="1">z1</ZZZ></BBB> <BBB>b3</BBB> </AAA> <ZZZ> <BBB>b4</BBB> <BBB>b5</BBB> <CCC>c3</CCC> </ZZZ> </ROOT> ``` The stream parser is created as: ``` CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']") ``` Basically we want the stream parser to return all the `BBB` nodes whose text aren't `b3`. By looking at the sample XML, we know it should return: the `<BBB>` nodes whose texts are `b1`, `b2`, `b4`, and `b5`. However, the current code only returns `b1` and `b2`. The problem lies in the stream element matching inside `case xml.StartElement`. Currently the code does this: ``` case xml.StartElement: ... ... if p.streamElementXPath != nil { if p.streamNode == nil { if QuerySelector(p.doc, p.streamElementXPath) != nil { // Now we assume this node is the stream node candidate. } ``` We originally under the assumption that if the `streamElementXPath` query returns anything, it must be this node itself; thus if it returns, this node is the stream node candidate. But it's clearly wrong in this `b3` example above. For the node `<BBB>b3</BBB>` it is first considered as the stream candidate, but later filtering (`[. != 'b3']`) removes its stream node status, and treats it just like any other non-stream nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will **always** "matches" `streamElementXPath`. So in the example above, the node `<ZZZ>` is now erroneously considered stream node, and any child nodes are not even tested for streaming anymore. There are two fixes: 1) In `xml.StartElement` stream match, instead of just doing `QuerySelector(...) != nil` check, we need to issue a `QuerySelectorAll(...)` call and examine all the returned nodes, if the current node is one of them, then this current node is considered stream candidate. 2) Simpler: if a stream candidate is later filtered out inside `case xml.EndElement` handling, then simply remove it from the node tree, thus preventing future erroneous matches. 1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it in any capacity. Thus chose fix 2).
- Loading branch information