Not a bug: Default namespace doesn't seem to be supported. #137

EricForgy · 2020-04-16T12:44:22Z

Note: I'm on Windows 10, Julia v1.4.0, EzXML v1.1.0

Hi 👋

Thank you for this package 🙏

I am working on a fairly largish 45K line XML file (but some of the lines are VERY long) and can't seem to get a basic findall to work.

I get my

julia> doc = parsexml(filename)

It seems fine. Then I grab its root:

julia> xbrl = doc.root
EzXML.Node(<ELEMENT_NODE[xbrl]@0x0000000040ed6a80>)

but then

julia> findall("/xbrl", doc)
0-element Array{EzXML.Node,1}

I am expecting this to give me the root node.

Any idea what I'm doing wrong?

Edit:

This seems to work:

julia> test = parsexml("""
       <?xml version="1.0" encoding="UTF-8"?>
       <primates>
           <genus name="Homo">
               <species name="sapiens">Human</species>
                   </genus>
           <genus name="Pan">
               <species name="paniscus">Bonobo</species>
               <species name="troglodytes">Chimpanzee</species>
                   </genus>
       </primates>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000040b48120>))

julia> findall("/primates", test)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[primates]@0x0000000042f5c090>)

julia> test.root
EzXML.Node(<ELEMENT_NODE[primates]@0x0000000042f5c090>)

Edit^2: If it helps, here is the XML file:

https://www.sec.gov/Archives/edgar/data/937834/000093783420000005/mlic-12312019x10kdocum_htm.xml

The text was updated successfully, but these errors were encountered:

EricForgy · 2020-04-16T19:27:49Z

Hi 👋

Over on Slack, @kescobo was helping me and came up with a good MWE:

julia> test = parsexml("""
       <?xml version="1.0" encoding="utf-8"?>
       <xbrl
         xmlns="http://www.xbrl.org/2003/instance"
         xmlns:country="http://xbrl.sec.gov/country/2017-01-31"
           xmlns:dei="http://xbrl.sec.gov/dei/2019-01-31"
             xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
         xmlns:link="http://www.xbrl.org/2003/linkbase"
         xmlns:mlic="http://www.metlife.com/20191231"
         xmlns:srt="http://fasb.org/srt/2019-01-31"
         xmlns:us-gaap="http://fasb.org/us-gaap/2019-01-31"
         xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
         </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000040b48a20>))

julia> findall("/xbrl", test)
0-element Array{EzXML.Node,1}

julia> test2 = parsexml("""
       <?xml version="1.0" encoding="utf-8"?>
       <xbrl></xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000040b48360>))

julia> findall("/xbrl", test2)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[xbrl]@0x0000000042f60090>)

As far as I can tell, the former should be a valid XML document.

Any idea what is going on?

EricForgy · 2020-04-16T19:47:57Z

~~Is it possible that this https://github.com/bicycle1885/EzXML.jl/blob/master/src/xpath.jl#L43~~

function Base.findall(xpath::AbstractString, doc::Document)
    return findall(xpath, doc.node)
end

~~should be~~

function Base.findall(xpath::AbstractString, doc::Document)
    return findall(xpath, doc.root) # i.e. doc.root instead of doc.node
end

?

Edit: Nevermind.

kescobo · 2020-04-16T20:09:26Z

A more minimal MWE:

julia> test = parsexml("""
              <?xml version="1.0" encoding="utf-8"?>
              <xbrl xmlns="http://www.xbrl.org/2003/instance">
              </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007f9d9993d700>))

julia> findall("/xbrl", test)
0-element Array{EzXML.Node,1}

EricForgy · 2020-04-16T20:13:16Z

Thanks @kescobo 🙌

I think this might be part of the problem:

julia> test = parsexml("""
                     <?xml version="1.0" encoding="utf-8"?>
                     <xbrl xmlns="http://www.xbrl.org/2003/instance">
                     </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000008298810>))

julia> namespaces(root(test))
1-element Array{Pair{String,String},1}:
 "" => "http://www.xbrl.org/2003/instance"

It seems EzXML does not like the empty key 🤔

kescobo · 2020-04-16T20:17:15Z

Just for fun:

julia> using EzXML

julia> test = parsexml("""
              <?xml version="1.0" encoding="utf-8"?>
              <primates xmlns="http://www.xbrl.org/2003/instance">
              </primates>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fa367862de0>))

julia> findall("/primates", test)
0-element Array{EzXML.Node,1}

julia> test2 = parsexml("""
              <?xml version="1.0" encoding="utf-8"?>
              <primates test="test">
              </primates>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fa36b074c90>))

julia> findall("/primates", test2)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[primates]@0x00007fa36b1137f0>)

julia> test3 = parsexml("""
              <?xml version="1.0" encoding="utf-8"?>
              <xmlns test="test">
              </xmlns>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fa36b04d610>))

julia> findall("/xmlns", test3)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[xmlns]@0x00007fa3665fed60>)

julia> test4 = parsexml("""
                     <?xml version="1.0" encoding="utf-8"?>
                     <xbrl xmlns="http">
                     </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fa36789e170>))

julia> findall("/xbrl", test4)
0-element Array{EzXML.Node,1}

EricForgy · 2020-04-16T20:18:35Z

I tried modifying findall(xpath, doc), which just calls findall(xpath, doc.node, namespaces(doc.node)) to

function Base.findall(xpath::AbstractString, doc::Document, ns=namespaces(doc.node))
    return findall(xpath, doc.node, ns)
end

and then tried

julia> findall("/xbrl", test, namespaces(root(test)))
┌ Warning: ignored the empty prefix for 'http://www.xbrl.org/2003/instance'; expected to be non-empty
└ @ EzXML C:\Users\ericf\.julia\dev\EzXML\src\xpath.jl:85
0-element Array{EzXML.Node,1}

Because the prefix was empty, it gets ignored. That seems to be why we get zero elements from findall (maybe) 🤔

kescobo · 2020-04-16T20:23:11Z

I think there's something about xml being in the name of the tag...

julia> test5 = parsexml("""
                     <?xml version="1.0" encoding="utf-8"?>
                     <xbrl test="http">
                     </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fd996dfe450>))

julia> findall("/xbrl", test5)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[xbrl]@0x00007fd996dcf980>)

julia> test6 = parsexml("""
                     <?xml version="1.0" encoding="utf-8"?>
                     <xbrl xmlns="http">
                     </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fd996fc2aa0>))

julia> findall("/xbrl", test6)
0-element Array{EzXML.Node,1}

And also, it seems to break parsing, every time I do that, all subsequent calls to parsexml give me

ERROR: AssertionError: isempty(XML_GLOBAL_ERROR_STACK)
Stacktrace:

EricForgy · 2020-04-16T20:24:56Z

I think xmlns is a special tag for namespacing, so those nodes get treated special somehow...

kescobo · 2020-04-16T20:25:51Z

Oh, I see...

EricForgy · 2020-04-16T20:30:31Z

Your MWE is namespaced and findall(xpath, doc) calls findall(xpath, doc.node, namespaces(doc.node)), but

julia> namespaces(test.node)
0-element Array{Pair{String,String},1}

so no namespaces are being registered. I think that is why findall is not working because there are no registered namespaces, but the root is namespaced (maybe) 🤔

EricForgy · 2020-04-16T20:34:45Z

From Wikipedia: https://en.wikipedia.org/wiki/XML_namespace

Namespace declaration

An XML namespace is declared using the reserved XML attribute xmlns or xmlns:prefix, the value of which must be a valid namespace name.

For example, the following declaration maps the "xhtml:" prefix to the XHTML namespace:
xmlns:xhtml="http://www.w3.org/1999/xhtml"
Any element or attribute whose name starts with the prefix "xhtml:" is considered to be in the XHTML namespace, if it or an ancestor has the above namespace declaration.

It is also possible to declare a default namespace. For example:
xmlns="http://www.w3.org/1999/xhtml"
In this case, any element without a namespace prefix is considered to be in the XHTML namespace, if it or an ancestor has the above default namespace declaration.

If there is no default namespace declaration in scope, the namespace name has no value.[6] In that case, an element without an explicit namespace prefix is considered not to be in any namespace.

Attributes are never subject to the default namespace. An attribute without an explicit namespace prefix is considered not to be in any namespace.

EricForgy · 2020-04-16T20:39:03Z

It seems like an issue dealing with default namespaces 🤔

EricForgy · 2020-04-16T20:52:33Z

EzXML apparently uses libxml2 and according to this:

http://xmlsoft.org/namespaces.html

default namespaces should be supported. I am probably confused 🤔

EricForgy · 2020-04-16T20:54:30Z

It works if I remove the default namespace:

julia> test = parsexml("""
       <?xml version="1.0" encoding="utf-8"?>
       <xbrl
         xmlns:country="http://xbrl.sec.gov/country/2017-01-31"
           xmlns:dei="http://xbrl.sec.gov/dei/2019-01-31"
             xmlns:iso4217="http://www.xbrl.org/2003/iso4217"
         xmlns:link="http://www.xbrl.org/2003/linkbase"
         xmlns:mlic="http://www.metlife.com/20191231"
         xmlns:srt="http://fasb.org/srt/2019-01-31"
         xmlns:us-gaap="http://fasb.org/us-gaap/2019-01-31"
         xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
         </xbrl>""")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x0000000008349bf0>))

julia> findall("/xbrl", test)
1-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[xbrl]@0x000000003192c7e0>)

EricForgy · 2020-04-16T21:15:45Z

Ok. I am slowly learning about namespaces. We need a way to register default namespaces. This package is currently ignoring them 🤔

EricForgy · 2020-04-16T21:21:08Z

This is C#, but the discussion looks relevant: https://docs.microsoft.com/en-us/dotnet/standard/data/xml/xpath-queries-and-namespaces#the-default-namespace

The Default Namespace

In the XML document that follows, the default namespace with an empty prefix is used to declare the http://www.contoso.com/books namespace.
<books xmlns="http://www.contoso.com/books">  
    <book>  
        <title>Title</title>  
        <author>Author Name</author>  
        <price>5.50</price>  
    </book>  
</books>  
XPath treats the empty prefix as the null namespace. In other words, only prefixes mapped to namespaces can be used in XPath queries. This means that if you want to query against a namespace in an XML document, even if it is the default namespace, you need to define a prefix for it.

For example, without defining a prefix for the XML document above, the XPath query /books/book would not return any results.

A prefix must be bound to prevent ambiguity when querying documents with some nodes not in a namespace, and some in a default namespace.

EricForgy · 2020-04-16T22:39:41Z

RTFM. Sorry for the noise 😔

For discoverability. Related: JuliaIO#137 (which is the only google result for `Warning: ignored the empty prefix for 'http://www.w3.org/2000/svg'; expected to be non-empty`)

EricForgy changed the title ~~noob: Having trouble with Xpath~~ Possible bug with many attributes in the root node. Apr 16, 2020

EricForgy changed the title ~~Possible bug with many attributes in the root node.~~ Likely bug: Default namespace doesn't seem to be supported. Apr 16, 2020

EricForgy changed the title ~~Likely bug: Default namespace doesn't seem to be supported.~~ Not a bug: Default namespace doesn't seem to be supported. Apr 16, 2020

EricForgy closed this as completed Apr 16, 2020

tfiers mentioned this issue Jan 5, 2023

Readme: link docs for caveat on xpath<>namespaces combination #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not a bug: Default namespace doesn't seem to be supported. #137

Not a bug: Default namespace doesn't seem to be supported. #137

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020 •

edited

Loading

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020

Namespace declaration

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020

The Default Namespace

EricForgy commented Apr 16, 2020

Not a bug: Default namespace doesn't seem to be supported. #137

Not a bug: Default namespace doesn't seem to be supported. #137

Comments

EricForgy commented Apr 16, 2020 • edited Loading

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020 • edited Loading

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020

kescobo commented Apr 16, 2020

EricForgy commented Apr 16, 2020 • edited Loading

EricForgy commented Apr 16, 2020

Namespace declaration

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020 • edited Loading

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020

The Default Namespace

EricForgy commented Apr 16, 2020

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020 •

edited

Loading

EricForgy commented Apr 16, 2020 •

edited

Loading