Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate machine-readable legacy format #23

Closed
efucile opened this issue May 7, 2020 · 28 comments
Closed

Validate machine-readable legacy format #23

efucile opened this issue May 7, 2020 · 28 comments

Comments

@efucile
Copy link
Member

efucile commented May 7, 2020

Machine-readable legacy format has been included in the master branch. We need to verify that the format satisfies users requirements.

@efucile efucile pinned this issue May 7, 2020
@efucile
Copy link
Member Author

efucile commented May 7, 2020

legacy format here

@erget
Copy link
Member

erget commented May 7, 2020

@efucile I encountered some errors. You can browse what I could import here. I'm not sure what's causing the errors - I've never had trouble parsing these files for previous versions before, so it seems something's changed in the formatting. I'll try to narrow it down and provide some feedback.

XML

Table B - 111 elements were imported, 1619 generated errors.
Code/Flag - No elements could be imported. 493 entries generated errors.
Table D - 558 sequences imported, 31 errors generated.

CSV

Table B - 1698 elements inserted. 32 errors.
Code/Flag - 492 elements inserted. 1 error.
Table D - 588 elements inserted. 1 error.

@erget
Copy link
Member

erget commented May 7, 2020

Error sources we've identified:

  1. Empty elements, e.g. <Note_en/>
  2. Dummy entries such as
  <BUFRCREX_TableB_en>
    <ClassNo>ClassNo</ClassNo>
    <ClassName_en>ClassName_en</ClassName_en>
    <FXY>FXY</FXY>
    <ElementName_en>ElementName_en</ElementName_en>
    <Note_en>Note_en</Note_en>
    <BUFR_Unit>BUFR_Unit</BUFR_Unit>
    <BUFR_Scale>BUFR_Scale</BUFR_Scale>
    <BUFR_ReferenceValue>BUFR_ReferenceValue</BUFR_ReferenceValue>
    <BUFR_DataWidth_Bits>BUFR_DataWidth_Bits</BUFR_DataWidth_Bits>
    <CREX_Unit>CREX_Unit</CREX_Unit>
    <CREX_Scale>CREX_Scale</CREX_Scale>
    <CREX_DataWidth_Char>CREX_DataWidth_Char</CREX_DataWidth_Char>
    <Status>Status</Status>
  </BUFRCREX_TableB_en>

@efucile
Copy link
Member Author

efucile commented May 7, 2020

This is a branch to work on this issue https://github.com/wmo-im/BUFR4/tree/legacyFormats
Contributions to fix the problem are welcome.

@tomkralidis
Copy link

@erget looks like the dummy entries were fixed in bebfa61. In addition the empty elements are the result of no content in the CSVs they were derived from.

@erget
Copy link
Member

erget commented May 19, 2020

@tomkralidis thanks for the heads-up. Since bebfa61 all tables import correctly in CSV and XML formats.

@marianmajan-ibl
Copy link

Hello @efucile. I tried to import xml files with our internal import tool (BUFRCREX_TableB_en.xml, BUFR_TableD_en.xml, CREX_TableD_en.xml). The import was successful, the results were the same as for the previous files (published by Atsushi, I assume).

@jbathegit
Copy link
Contributor

I can import the XML files as well, though for some reason my parser didn't like the empty elements and failed to recognize them as null strings, even after I added an extra space character to change them from, e.g. <Title_en/> to <Title_en />. But empty elements are legal XML constructs, so I can work around those on my end by just deleting them altogether from my own copies of the tables after I download them, and before running my parser.

On a related note, I have a question for @efucile - are you still planning to distribute XSD schema definition files with future version releases? I didn't see any such files in the xml subdirectory, so I tested your latest files by validating them against modified earlier copies that I'd already downloaded. Again, I can work around this on my end if need be, but it's generally good practice to distribute XSD files with XML files.

@efucile
Copy link
Member Author

efucile commented May 20, 2020

@jbathegit thanks for checking the xml files.

  1. we need to delete empty lines. Can you please let me know in which files we have empty lines?
  2. regarding xsd we had a discussion here script to produce legacy machine readable formats CCT#5 and was considered that xsd in this case are not bringing value and we should stop producing them.

@jbathegit
Copy link
Contributor

  1. There are empty elements (not empty lines) in all of the BUFR XML files, except for TableA. Examples include <Note_en/>, <Title_en/>, <EntryName_sub1_en/>, etc. An empty element is one with no content and where you don't have matching opening and closing tags such as <Note_en></Note_en> with no content between them, but rather a single abbreviated tag <Note_en/> which means the same thing. Again, it's a legal construct in XML, so I'm not saying that these must be deleted, and I can work around them on my end.

  2. OK, and again I can live without the XSD files, by either turning off the validation steps in my parser, or else by modifying earlier copies of these files.

@efucile
Copy link
Member Author

efucile commented May 20, 2020

@jbathegit thanks for explaining. Now it is clear to me.
If you prefer you can add the xsd files in the repo and I will make sure that they get into the distribution. I understand that they should remain the same if we don't add columns in the tables.
Just put them here https://github.com/wmo-im/BUFR4/tree/master/xml and create a new branch or pull request when finished.

@marijanacrepulja
Copy link
Collaborator

Hello @efucile. I did import txt files with internal library.
Though the reason of a column "No" is not present in the txt files, the library didn't like it. I can work around those on my end and create tables.
However, I have made changes to the flies split_B, split_D, split_CodeFlag to take into account column "No" and to be align with the structure of the files that have been provided previously.
In that case I don't need to make any changes in internal library for importing tables.

I see that the commit I made yesterday had failed. Should I make pull request so can be reviewed?

On a related note, are you still planning to distribute txt for CCT eg. Common_C11_20200506_en.txt with future version releases?

@efucile
Copy link
Member Author

efucile commented May 29, 2020

@marijanacrepulja we don't want the column No because every time you add one line you have to update the all table. This is inefficient, unnecessary and error prone. You can write a script that adds the No if this is a requirement and we can run it automatically, but we cannot have that column in the csv files. Where did you do the changes on the scripts? Please revert them.

@marijanacrepulja
Copy link
Collaborator

@efucile It is OK, there is no requirement for having column No. I did change on the scripts in branch legacyFormats and reverted them.
While a go I did changes in makeLegacyFormats.sh to not append header in the txt files while merging multiple csv files. After that Daniel was able to import tables. Have you been able to see changes or I need to make pull request?

@efucile
Copy link
Member Author

efucile commented May 29, 2020

I don’t see the changes. In which branch?

@marijanacrepulja
Copy link
Collaborator

It is in branch legacyFormats.
2205e6a

@amilan17
Copy link
Member

amilan17 commented Nov 4, 2020

Archive.zip
The BUFR XML and TXT files are regenerated using code provided by @SibylleK and @marijanacrepulja.
Empty XML tags, extra CSV headers are removed and a <No> tag was added to the XML.
The changes are implemented in the https://github.com/wmo-im/BUFR4/tree/generateMachineCodes branch.

@amilan17
Copy link
Member

amilan17 commented Nov 4, 2020

I also integrated the edit for CSVtoXML in CCT, but I don't think it's giving us the results we want:
wmo-im/CCT@f73d8e7#diff-24e776c74f105f27897f072849d27ca716f89f47e766b618d76b7d9244a224f0

@erget
Copy link
Member

erget commented Nov 5, 2020

Archive.zip
The BUFR XML and TXT files are regenerated using code provided by @SibylleK and @marijanacrepulja.
Empty XML tags, extra CSV headers are removed and a <No> tag was added to the XML.
The changes are implemented in the https://github.com/wmo-im/BUFR4/tree/generateMachineCodes branch.

XML

I've checked these but I still see the empty XML tags, e.g.

<Note_en/>

I believe before the XML tables had something like this:

<Note_en>""</Note_en>

As it is I can only import 113 elements from Table B, 571 elements for Table D, all others produce errors (meaning no code/flag tables!).

CSV

Code/flag tables import fine with only 1 error. It looks like the headers are still repeated though, I see them on lines 354, 1445, 1595, 1610, 1615, 2274, 2291, 2402, 2469, 2476, 2627, 3688, 3870, 3955, 4039, 4046, 4374, 4401, 4415, 4441, 4455, 5005, 5108, and 5341.

Table B imports fine but produces errors where the header line is repeated.

Table D imports OK (601 items imported) but I see that also here the header is repeated and this causes errors for the software.

@amilan17
Copy link
Member

amilan17 commented Nov 10, 2020

Please test this new zip of BUFR XML and TXT. I believe that all of the repeating headers are cleaned up and the empty XML elements are removed.

Other differences (aside from amendments) and questions:

  1. The "No" column is removed from the txt files.
  2. The code generates <No> tags for XML files, but the values differ from the the <No> values in version 34. They are added by the code, because Sibylle's software expects them. However, earlier this year, there was a decision to NOT include these values in the TXT files (See Enrico's comment above from May 29). Does this decision also apply to the XML? My preference is to NOT include them in the XMLs, because they make comparisons difficult. Please advise on best approach.
  3. For your awareness, versions are no longer encoded into the XML tags. For example, <BUFRCREX_34_0_0_CodeFlag_en> is. now just <BUFRCREX_CodeFlag_en>. This is good, because it also makes comparisons difficult.
  4. The reference to the schema is removed, for example, xsi:noNamespaceSchemaLocation="BUFRCREX_34_0_0_CodeFlag_en.xsd".

@erget
Copy link
Member

erget commented Nov 10, 2020

Hi @amilan17 I opened up the Table B CSV with a text editor and can see the repeated headers, so I don't think it's worthwhile to check the CSVs again - maybe the file is old?

Regarding your points (3) and (4) - that's great! That makes a lot more sense now.

Regarding the XML files, I still see the empty <Note_en/> elements. So in BUFRCREX_TableB_en.xml in the entry for 000001, I changed this line:

<Note_en/>

to this:

<Note_en>Foobar</Note_en>

Then the element imports fine. Could you change the script so that the <Note_en> element contains a value, as I believe was the case before?

I also still can't import the code/flag tables. I think this is also because most of the elements are empty:

  <BUFRCREX_CodeFlag_en>
    <FXY>001003</FXY>
    <ElementName_en>WMO Region number/geographical area</ElementName_en>
    <CodeFigure>0</CodeFigure>
    <EntryName_en>Antarctica</EntryName_en>
    <EntryName_sub1_en/> <!-- empty! -->
    <EntryName_sub2_en/> <!-- empty! -->
    <Note_en/> <!-- empty! -->
    <Status>Operational</Status>
  </BUFRCREX_CodeFlag_en>

Table D also produces the same successes and errors.

It seems that these files are unchanged, as all of the errors that were in the previous ones are still present in the most recent upload. Are you sure that BUFR-2020-Nov-10.zip was updated?

@amilan17
Copy link
Member

Oops. I'm checking now!

@amilan17
Copy link
Member

BUFR-2020-Nov-10.zip
I think this is the correct version to test.

@erget
Copy link
Member

erget commented Nov 10, 2020

Hi @amilan17 , that does it - everything now imports fine, thanks for the update!

@amilan17
Copy link
Member

@wmo-im/tt-tdcf -- Will you also review ASAP and review the questions I have? Thanks!
https://github.com/wmo-im/BUFR4/files/5516448/BUFR-2020-Nov-10.zip

Please test this new zip of BUFR XML and TXT. I believe that all of the repeating headers are cleaned up and the empty XML elements are removed.

Other differences (aside from amendments) and questions:

1. The "No" column is removed from the txt files.
2. The code generates <No> tags for XML files, but the values differ from the the <No> values in version 34. They are added by the code, because Sibylle's software expects them. However, earlier this year, there was a decision to NOT include these values in the TXT files (See Enrico's comment above from May 29). Does this decision also apply to the XML? My preference is to NOT include them in the XMLs, because they make comparisons difficult. Please advise on best approach.
3. For your awareness, versions are no longer encoded into the XML tags. For example, <BUFRCREX_34_0_0_CodeFlag_en> is. now just <BUFRCREX_CodeFlag_en>. This is good, because it also makes comparisons difficult.
4. The reference to the schema is removed, for example, xsi:noNamespaceSchemaLocation="BUFRCREX_34_0_0_CodeFlag_en.xsd".

@marijanacrepulja
Copy link
Collaborator

Hi @amilan17,
I have successfully imported txt files. Thanks for update!
Only one thing, I have spotted that in BUFR_TableD_en.txt there are empty lines gap between class 15 and 16. That comes from 3 empty lines in BUFR_TableD_en_15.csv Would be good if we can have them deleted.

@amilan17
Copy link
Member

@marijanacrepulja - Good catch. I removed the extra lines and it looks better now.
https://github.com/wmo-im/BUFR4/blob/generateMachineCodes/txt/BUFR_TableD_en.txt

@amilan17
Copy link
Member

Machine readable codes are published to https://community.wmo.int/activity-areas/wmo-codes/manual-codes/latest-version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants