Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in .reader.xml.v21 with reference to non-existent Concept (INSEE) #205

Closed
nicolas-graves opened this issue Nov 4, 2024 · 13 comments · Fixed by #207
Closed

KeyError in .reader.xml.v21 with reference to non-existent Concept (INSEE) #205

nicolas-graves opened this issue Nov 4, 2024 · 13 comments · Fixed by #207
Assignees
Labels
data-source Issues related to specific web services/data source(s) reader Read file formats defined by the SDMX standards xml SDMX-ML format

Comments

@nicolas-graves
Copy link

nicolas-graves commented Nov 4, 2024

Using the latest SDMX version :

import sdmx
insee = sdmx.Client("INSEE")
flow_msg = insee.dataflow()
insee.dataflow("CNA-2014-PIB", params={"references": "all"})

leads to the following error:

Ignore:
 {140305929812416, 140302908035648}
<str:TimeDimension xmlns:str="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common" id="TIME_PERIOD" urn="urn:sdmx:org.sdmx.infomodel.datastructure.TimeDimension=FR1:CNA-2014-PIB(1.0).TIME_PERIOD" position="1">
              <str:ConceptIdentity/><str:LocalRepresentation/></str:TimeDimension>


Traceback (most recent call last):
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 290, in wrapped
    result = reader.Reference(
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 69, in __init__
    info = self.info_from_element(elem)
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/v21.py", line 77, in info_from_element
    raise NotReference
sdmx.reader.xml.common.NotReference

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 498, in resolve
    return parent[ref.target_id]
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/model/common.py", line 641, in __getitem__
    return self.__dict__["items"][name]
KeyError: 'TIME_PERIOD'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 204, in read_message
    result = func(self, element)
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 297, in wrapped
    result = func(reader, elem)
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/v21.py", line 596, in _component_end
    concept_identity=reader.pop_resolved_ref("ConceptIdentity"),
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 459, in pop_resolved_ref
    return self.resolve(self.pop_single(cls_or_name))
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 501, in resolve
    return parent.get_hierarchical(ref.target_id)
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/model/common.py", line 646, in get_hierarchical
    return self.items[id]
KeyError: 'TIME_PERIOD'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/graves/projects/src/sdmx/sdmx/scratch.py", line 8, in <module>
    insee.dataflow("CNA-2014-PIB")
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/client.py", line 496, in get
    msg = reader.read_message(response_content, structure=kwargs.get("dsd", None))
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/__init__.py", line 45, in read_message
    import_module(f"sdmx.reader.xml.{version}")
  File "/gnu/store/ln9xp7p9qar8j8fznas35wf7s1ypjn7y-profile/lib/python3.10/site-packages/sdmx/reader/xml/common.py", line 221, in read_message
    raise XMLParseError from exc
sdmx.exceptions.XMLParseError: KeyError: 'TIME_PERIOD'
>>>

despite that I've verified that the schema is valid for this data. I've tried to debug this but I'm not familiar enough with the codebase to do that properly. From what I tried:
the culprit lines in the xml are

            <str:TimeDimension id="TIME_PERIOD" urn="urn:sdmx:org.sdmx.infomodel.datastructure.TimeDimension=FR1:CNA-2014-PIB(1.0).TIME_PERIOD" position="1">
              <str:ConceptIdentity>
                <Ref id="TIME_PERIOD" maintainableParentID="CONCEPTS_INSEE" maintainableParentVersion="1.0" agencyID="FR1" package="conceptscheme" class="Concept"/>
              </str:ConceptIdentity>
              <str:LocalRepresentation>
                <str:TextFormat textType="ObservationalTimePeriod"/>
              </str:LocalRepresentation>
            </str:TimeDimension>

For some reason, during the parsing in Reference, when elem.tag is suffixed with ConceptIdentity, there seems to be no Ref info. On the previous element, TimeDimension is parsed and has a Ref. That doesn't seem to make much sense.

@khaeru
Copy link
Owner

khaeru commented Nov 4, 2024

Hi! Thanks for the report.

I can't reproduce it with the code you've given. I see the following locally:

import sdmx
insee = sdmx.Client("INSEE")
flow_msg = insee.dataflow()
insee.dataflow("CNA-PIB-2014", params={"references": "all"})
Traceback (most recent call last)
File ~/vc/sdmx/build/__editable__.sdmx1-2.19.1-py3-none-any/sdmx/client.py:463, in Client.get(self, resource_type, resource_id, tofile, use_cache, dry_run, **kwargs)
    460 try:
    461     # Send the request
    462     response = self.session.send(req, **self._send_kwargs)
--> 463     response.raise_for_status()
    464 except requests.exceptions.ConnectionError as e:
    465     raise e from None

File ~/.venv/3.12/lib/python3.12/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error:  for url: https://www.bdm.insee.fr/series/sdmx/dataflow/ALL/CNA-PIB-2014/latest?references=all

I also get a 404 / SDMX ErrorMessage if I open this URL in my web browser.

Can you please mention the version of Python you're using, and anything else particular in your environment that may cause this difference?

Also, aside from that, if you have the actual query URL or a local or cached copy of the XML response, please share that. That way I can try to reproduce the parsing error, even if the example query doesn't work.

@khaeru khaeru added xml SDMX-ML format reader Read file formats defined by the SDMX standards data-source Issues related to specific web services/data source(s) needs-info Needs more info from the issuer to proceed labels Nov 4, 2024
@nicolas-graves
Copy link
Author

nicolas-graves commented Nov 4, 2024

Sorry, first message by hand, it is CNA-2014-PIB as in the error messages (rather than the previous erroneous CNA-PIB-2014).

@khaeru
Copy link
Owner

khaeru commented Nov 4, 2024

Thanks! I can now reproduce.

So the direct URL is https://www.bdm.insee.fr/series/sdmx/dataflow/ALL/CNA-2014-PIB/latest?references=all

Skimming this message, I don't see the referenced concept. For example, I see:

<str:ConceptScheme id="CONCEPTS_INSEE" urn="urn:sdmx:org.sdmx.infomodel.conceptscheme.ConceptScheme=FR1:CONCEPTS_INSEE(1.0)" agencyID="FR1" version="1.0">
  <com:Name xml:lang="fr">Concepts Insee</com:Name>
  <com:Name xml:lang="en">Insee concepts</com:Name>
  <str:Concept id="FREQ" urn="urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=FR1:CONCEPTS_INSEE(1.0).FREQ">
    <com:Name xml:lang="fr">Périodicité</com:Name>
    <com:Name xml:lang="en">Frequency</com:Name>
  </str:Concept>
  …
  <str:Concept id="SATISFACTION_VIE" urn="urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=FR1:CONCEPTS_INSEE(1.0).SATISFACTION_VIE">
    <com:Name xml:lang="fr">Satisfaction dans la vie</com:Name>
    <com:Name xml:lang="en">Life satisfaction</com:Name>
  </str:Concept>
</str:ConceptScheme>

This is indeed the same "maintainable parent" referred to by:

<Ref id="TIME_PERIOD" maintainableParentID="CONCEPTS_INSEE" maintainableParentVersion="1.0" agencyID="FR1" package="conceptscheme" class="Concept"/>

…but I don't think it contains the referenced item.

Am I missing something?

@khaeru khaeru removed the needs-info Needs more info from the issuer to proceed label Nov 4, 2024
@khaeru khaeru self-assigned this Nov 4, 2024
@nicolas-graves
Copy link
Author

OK, IIUC you're saying that upstream INSEE is missing this referenced item, and that's why it isn't parsed correctly, is that right? Should we patch this ourselves in handle_requests then? And (or just) report upstream so that they fix it? If the concept is missing, add it, then pass to super()?

@khaeru
Copy link
Owner

khaeru commented Nov 4, 2024

OK, IIUC you're saying that upstream INSEE is missing this referenced item, and that's why it isn't parsed correctly, is that right?

Yes.

And (or just) report upstream so that they fix it?

I would suggest to do that, yes. It's probably not intentional.

Should we patch this ourselves in handle_requests then? […] If the concept is missing, add it, then pass to super()?

I'm not 100% sure what I would prefer as a fix. I would say definitely not code that specifically looks for a certain URN from a certain data provider and makes a very targeted correction for a particular, possibly temporary upstream error. I think that is out of scope for a package like this; we can't possibly track every single such case, so it's better not to pretend to try.

“If the concept is missing…”—I think something like this is possible.

  • sdmx1 already does things like ‘infer’ an entire DSD if asked to parse a DataMessage without being given an explicit DSD. So this would be in the same spirit of behaviour and follow POLA.
  • If we do it, there should definitely be a logged error/warning so that the user is aware this is happening.
  • I am not sure what the logic should be if an artefact (like this DSD) contains a mix of references that are and are not valid. For instance, the DSD has another Dimension with id INDICATEUR that references urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=FR1:CONCEPTS_INSEE(1.0).INDICATEUR which does exist. This Dimension will get a .concept_identity that maps to the concept scheme.
  • Thinking out loud, maybe the code in this instance should not modify an actual, given ConceptScheme by adding items to it. That would be a weird side-effect that could lead to user confusion. But it could just add the TimeDimension with .concept_identity = None; log a message (as mentioned above); and maybe add an annotation to the TimeDimension that explains what is missing (e.g. with id=sdmx1.reader.xml-parse-note and text="SDMX-ML file contained a reference non-existent [URN].").
  • At least with the latter behaviour the user gets back a parsed StructureMessage and can then go in, if they like, and make a post hoc correction for upstream errors.

Does that sound reasonable?

@khaeru khaeru changed the title INSEE results have a valid schema that fails to parse. KeyError in .reader.xml.v21 with reference to non-existent Concept (INSEE) Nov 4, 2024
@nicolas-graves
Copy link
Author

I'm going to report the error upstream. Hopefully they will fix it rapidly. I'm still curious about why the XML data is valid if that's an error (valid in the sense of https://sdmx1.readthedocs.io/en/latest/howto/validate.html).

Yes I like a big warning much better than an unusable xml.

@nicolas-graves
Copy link
Author

I've also opened an issue through the website, will keep you informed.

@khaeru
Copy link
Owner

khaeru commented Nov 5, 2024

I'm still curious about why the XML data is valid if that's an error (valid in the sense of https://sdmx1.readthedocs.io/en/latest/howto/validate.html).

Essentially, XSD schemas say things like:

  • You can only use XML tags X, Y, Z.
  • Tag X can only contain 1 or more of Y or Z, but not A, B, C.
  • Tag X should/must/cannot have attributes m, n, o, p.
  • These attributes are strings, or match such-and-such regular expression.

In other words, it's largely structural; it isn't really meant to express the logical validity of the contents or meaning of those tags and attributes.

So the message we're looking at is structurally fine, but it happens that its contents do not make sense as SDMX-ML.

By analogy, imagine a SDMX data set about the average height of some humans—structure is fine, valid SDMX-ML, but one of the observations says "1.56 cm". That would be a clear error or invalidity, but not a structural one that could be identified with XSD.

@khaeru
Copy link
Owner

khaeru commented Nov 5, 2024

Over in #207 I discovered that the concept with ID OBS_VALUE, referenced by the PrimaryMeasure of this DSD, is also missing from the referenced concept scheme.

@nicolas-graves
Copy link
Author

Nice to know, I'll add that in the report as soon as I have an answer.

@nicolas-graves
Copy link
Author

nicolas-graves commented Nov 7, 2024

FYI, INSEE confirmed the receipt of the report and transmitted it to the right team.

@nicolas-graves
Copy link
Author

nicolas-graves commented Nov 8, 2024

@khaeru This is the answer of the INSEE team:

Dear user,

For what our team seems to understand from your ticket reporting in gitHub you have trouble with the missing of the
concepts TIME_PERIOD and OBS_VALUE.
Those are special concepts in the DSD, as described in the part : « Additionally the DSD defines » in
https://ec.europa.eu/eurostat/web/user-guides/data-browser/api-data-access/api-getting-started/sdmx3.0 .
Being specific, it doesn’t seem to put in question the validity of the XML of the datastructure of our sdmx response,
except if you can bring to us elements that justify of this invalidity.
Our opinion is that you should take into account the specificity of those dimension/attribute of the SDMX.
We do thank you for your concerns for our API.

We are at your disposal for further information.
Best Regards,

@khaeru
Copy link
Owner

khaeru commented Nov 11, 2024

They don't seem to have understood the issue…maybe they didn't come here and read our discussion?

The portion of documentation that they point to says that the DSD has a dimension with id=TIME_PERIOD, and a primary measure with id=OBS_VALUE. That's all good. It is not the problem.

The actual problem is that those components (the dimension and the primary measure) explicitly reference, as concept identities, these artefacts:

urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=FR1:CONCEPTS_INSEE(1.0).TIME_PERIOD
urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=FR1:CONCEPTS_INSEE(1.0).OBS_VALUE

and these artefacts do not exist. The references are broken.

It would be weird and non-standard, but maybe they intend something like the following:

  1. “We explicitly reference urn…Concept=FR1:CONCEPTS_INSEE(1.0).TIME_PERIOD, but we don't actually publish this as part of our urn…ConceptScheme=FR1:CONCEPTS_INSEE(1.0). Users should should manually insert an extra concept with this ID into the concept scheme.”
  2. “We explicitly reference urn…Concept=FR1:CONCEPTS_INSEE(1.0).TIME_PERIOD, but we actually mean a different artefact: urn…Concept=SDMX:CROSS_DOMAIN_CONCEPTS(2.0).TIME_PERIOD. Users should make this substitution themselves.”

Because either of those would not be in line with the SDMX standards, they should say so precisely in their documentation. If they don't make such declarations it's impossible to guess what they mean.

khaeru added a commit that referenced this issue Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-source Issues related to specific web services/data source(s) reader Read file formats defined by the SDMX standards xml SDMX-ML format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants