Issues with encoding in windows batch #8

Meibes · 2021-06-10T13:42:45Z

Hi,

I was trying to use ogr2osm in a windows batch but had a lot of encoding problems, because the batch always created ANSI-encoded files, but my workflow needs utf-8 encoded files. I managed to solve my issue by changing the following line:
self.f = open(self.filename, 'w', buffering = -1)
to
self.f = open(self.filename, 'w', buffering = -1, encoding="utf-8")

there is already a parameter called "encoding" but it seems it is only used for the source file, could we extend this "encoding" to be used in the destination file as well? or could we introduce another parameter for that? what are your thoughts? or do you have a tip how I can force the windows batch to output utf-8 without changing ogr2osm?

thanks for this awesome tool =)

The text was updated successfully, but these errors were encountered:

roelderickx · 2021-06-10T17:23:17Z

Thanks for your bug report. This issue looks like a duplicate of pnorman#15 but your solution is different and you have found a testcase where the current method has issues.

Some observations:

The encoding parameter of ogr2osm only specifies the encoding of the input file, not the encoding of the output file
The documentation of the python open() function specifies that the default encoding is used when the encoding parameter is omitted or None. This is platform dependent, I can't test it on Windows but at least for Linux it is UTF-8.
Although a clear suggestion is present, there is no strict obligation for an osm file to be encoded in UTF-8 on the OSM wiki page
According to the W3C recommendation for XML the expected encoding is UTF-8 if neither a byte order mark nor an encoding is specified, as is currently the case for ogr2osm

Given the last observation ogr2osm is supposed to output UTF-8 at the moment, eventually translating from the input file encoding if necessary. To obtain consistent behaviour across different operating systems it is as such necessary to pass encoding='utf-8' as you suggested. I would also explicitly specify the encoding in the header then, ie <?xml version="1.0" encoding="utf-8"?>.

I can confirm the testcases still pass on Linux with your suggested modification. Can you verify if the testcases pass under Windows as well?

Meibes · 2021-06-11T05:33:48Z

Thanks for the fast answer!

As far as I know both Linux and Mac use UTF-8 as their default encoding and Windows uses ANSI / Windows-1252 (at least in the german version of windows).
It seems some OSM-tools do write UTF-8 in the header, here is an example of Overpass:

<?xml version="1.0" encoding="UTF-8"?> <osm version="0.6" generator="Overpass API 0.7.56.9 76e5016d"> <note>The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note> <meta osm_base="2021-06-09T08:10:43Z"/>

After making these changes everything runs smooth in the batch.

roelderickx · 2021-06-11T17:14:21Z

Ok. I am not sure if the cram tests can be run as is under Windows, but can you try to convert at least test/shapefiles/japanese.shp and confirm if the formatted result matches test/japanese.xml?

In the test script the output is formatted using xmllint before comparing:

ogr2osm --encoding shift_jis --gis-order -f test/shapefiles/japanese.shp
xmllint --format japanese.osm > japanese.xml

roelderickx · 2021-06-13T06:39:38Z

Meanwhile I managed to test the modification in Windows, the test is conclusive. The proposed changes have been merged into master.
Thanks @Meibes for your investigation.

roelderickx mentioned this issue Jun 10, 2021

Open issues pnorman's version #3

Open

17 tasks

roelderickx added a commit that referenced this issue Jun 12, 2021

#8 Set output encoding to utf-8

0620564

roelderickx closed this as completed Jun 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with encoding in windows batch #8

Issues with encoding in windows batch #8

Meibes commented Jun 10, 2021

roelderickx commented Jun 10, 2021

Meibes commented Jun 11, 2021

roelderickx commented Jun 11, 2021

roelderickx commented Jun 13, 2021

Issues with encoding in windows batch #8

Issues with encoding in windows batch #8

Comments

Meibes commented Jun 10, 2021

roelderickx commented Jun 10, 2021

Meibes commented Jun 11, 2021

roelderickx commented Jun 11, 2021

roelderickx commented Jun 13, 2021