properly decode cp437 names in zip files #4835

dekarl · 2023-02-13T17:52:07Z

Summary

The german NeTEx files from the NAP at DELFI use legacy default encoding with code page 437 for their ZIP files. Java needs help to read such files.

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4415733 (problem statement from 2001, but without describing the solution that is available now in the JRE)

Documentation of the encoding can be found in Appendix D - Language Encoding (EFS) - in the format documentation at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Issue

I was inspired by #3640 to load german NeTEx data, but ran into multiple issues with the combined dataset.

Unit tests

I'm not sure how to add a unit test for this change.

I'm currently upstreaming my local patches and catching up with a year of refactorings, so this was tested to have the intended function before merging the refactoring.

Documentation

Rationale has been documented next to the new code.

Changelog

https://github.com/opentripplanner/OpenTripPlanner/labels/skip%20changelog

needed for the german NeTEx files from DELFI

codecov · 2023-02-13T18:13:35Z

Codecov Report

Base: 62.14% // Head: 62.13% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (356de6f) compared to base (b0b5f36).
Patch coverage: 81.81% of modified lines in pull request are covered.

❗ Current head 356de6f differs from pull request most recent head a010204. Consider uploading reports for the commit a010204 to get more accurate results

Additional details and impacted files

@@              Coverage Diff              @@
##             dev-2.x    #4835      +/-   ##
=============================================
- Coverage      62.14%   62.13%   -0.01%     
- Complexity     13000    13005       +5     
=============================================
  Files           1650     1650              
  Lines          66064    66094      +30     
  Branches        7184     7187       +3     
=============================================
+ Hits           41055    41068      +13     
- Misses         22669    22688      +19     
+ Partials        2340     2338       -2

Impacted Files	Coverage Δ
...ntripplanner/datastore/file/ZipFileDataSource.java	`80.00% <81.81%> (+0.58%)`	⬆️
...ner/ext/siri/updater/azure/SiriAzureETUpdater.java	`0.00% <0.00%> (ø)`
...ner/ext/siri/updater/azure/SiriAzureSXUpdater.java	`0.00% <0.00%> (ø)`
...aph_builder/module/OsmBoardingLocationsModule.java	`81.17% <0.00%> (+2.35%)`	⬆️
...g/opentripplanner/api/parameter/QualifiedMode.java	`68.42% <0.00%> (+2.63%)`	⬆️
...ripplanner/model/transfer/ConstrainedTransfer.java	`83.33% <0.00%> (+3.33%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

leonardehrenfried · 2023-02-13T20:11:41Z

I'm not sure it's such a good idea to default to a very old character encoding. We probably want to make that configurable and default to UTF-8, isn't it?

dekarl · 2023-02-14T07:19:45Z

I'm not sure it's such a good idea to default to a very old character encoding. We probably want to make that configurable and default to UTF-8, isn't it?

It is complicated :)

The format allows encoding as either cp437 or utf-8 at the same time. This patch changes the former, but not the latter.
https://commons.apache.org/proper/commons-compress/zip.html#Recommendations_for_Interoperability

As per the documenation of PKWare ZIP first was only cp437, then they added the option to encoded "names that can't be encoed cp437" in utf-8. But when Sun implemented the zip handling in Java they went directly to utf-8, implementing just the minimum to get there. Later Java 7 and newer got the option to actually handle the cp437 encoded names. Also to read files written by other implementations (as per documenation at Apache Commons) There is a flag to signal if cp437 or utf-8 encoding is used. This patch changes the behaviour or the former, but not the latter.

Apache Commons documents mainly Windows Compressed Folders to violate the spec in an incompatible way. For files generated by such a broken implemenation this patch changes the behaviour from one broken state to a differently broken state. When I tested it locally on Windows 10 I could not reproduce that this is still the case.

The alternative to this patch using java.util.zip is switching to Apache Commons, adding another runtime dependency. The test data generator uses Apache Commons to craft minimal test files to show how the behaviour before and after fixing the encoding work. You can verfiy the old behaviour allowing utf-8 decoding by commenting out a single line of code. Notice how the test fails only at the second file with names encoded as cp437. You can easily see the flag and different encoding with a hex editor when comparing both test files.

tl;dr the patch fixes decoding files using special characters created with 7zip, PKZIP, and WinZIP. It may change one broken behaviour when handling files createdWindows Compressed Folders to be differently broken. Also for some special versions and obscure software.

leonardehrenfried · 2023-02-14T11:30:40Z

Is there an easy way to detect which encoding was used without introducing the apache dependency? That would be the best solution in my view.

dekarl · 2023-02-14T13:39:40Z

Is there an easy way to detect which encoding was used without introducing the apache dependency? That would be the best solution in my view.

Instead trying to be creative we could be naive and just try without, catch the ZIPException and then load the charset and retry.

I'm not so great in Java, Is it ok if I just copy the 8 lines to read the directory instead of coming up with a creative loop that first tries without, then loads the charset and retries, but only on one kind of exception? I don't like repeating the code, but it appears to be easier to maintain as it still fits on one screen. (33 lines of code)

leonardehrenfried · 2023-02-14T13:55:30Z

I would be okay with that solution if the zip exception is relatively specific. I'm not a the maintainer of the netex code so @vpaturet @t2gran @hannesj @Bartosz-Kruba need to actually approve it.

I do have an interest in eventually being able to parse these terrible Delfi feeds though.

Not sure if you have the time but we have video calls twice a week where we go though these PRs and they are often the fastest way to get the attention of busy people: https://github.com/opentripplanner/OpenTripPlanner/blob/dev-2.x/CONTRIBUTING.md

handles utf-8 encoding without efs flag also uses the same entry name to reduce the hex diff of the test cases

…ith proper cp437 charset

dekarl · 2023-02-14T14:31:26Z

I would be okay with that solution if the zip exception is relatively specific.

Its one "catch all" exception for "issues with the ZIP", like CRC error etc.

In a normal environment I only except this to try two times when there is an issue with the input file.

Either its a broken file, then the first time the exception is catched and the second time its thrown. The user has to fix the input file - but this has nothing to do with the encoding - then the cost goes away.

Or if its one of the uncommon files using umlauts, then its trying two times. But that additional cost should only happen when building the graph.

leonardehrenfried · 2023-02-14T14:38:30Z

By the way, I'm fixing the Windows prettier problem over at: #4839

dekarl · 2023-02-14T14:56:46Z

I do have an interest in eventually being able to parse these terrible Delfi feeds though.

While bringing forward my changes to handle NPE I noticed that most have since been fixed either in dev-2.x or in the branch at #3640 (which is active again, too)

I'd love to see upstream OTP parse the DELFI feed in time for results from DEEZ appearing in the wild.

leonardehrenfried · 2023-02-14T17:14:58Z

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java

@@ -93,6 +96,27 @@ private void loadContent() {
        ZipEntry entry = entries.nextElement();
        content.add(new ZipFileEntryDataSource(this, entry));
      }
+    } catch (ZipException ze) {


How long does it take for this exception to be thrown? Is it straight away or half-way into reading the file?

My understanding is that the whole directory of the ZIP archive is loaded when the first entry or the collection is accessed. On the following accesses contentLoaded is true and the cache directory is used.

So its straight away (untested, yet)

I can confirm that the directory is read before the content of the files(if I remember correctly we access the files in random order, not sequential).

t2gran

I think we can go ahead and approve this - I think we all agree that using cp437 is a problem, and the provider of these feeds should not do that. But, the code change to support a workaround in this case is not big. @dekarl you should put some pressure on the data provider to fix this so we can remove this code in the future.

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java

leonardehrenfried · 2023-02-15T11:57:45Z

I think CP437 is allowed by the zip spec - that is the problem!

t2gran · 2023-02-15T14:16:39Z

I assigned @vpaturet as a reviewer in my place - I will be gone for the next week. This is good to go, as long as the 2 requested changes is fixed.

…urce.java Co-authored-by: Thomas Gran <t2gran@gmail.com>

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java

dekarl · 2023-02-15T22:48:02Z

I think we all agree that using cp437 is a problem, and the provider of these feeds should not do that. But, the code change to support a workaround in this case is not big.

Oracle appears to have decided that Java 7 is enterprise software and bug compatibility with older java.util.zip was more important than handling perfectly valid standard ZIP files, that happen to use non-ascii characters for the contained file names. Maybe they did this because cp437 is not guaranteed to be available?
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#standard

But they added an option to make their implementation standards compliant by setting the default "non-utf-8 charset" to explicit cp437 instead of the broken "we signal cp437 but deliver utf-8 instead". While at it they added an option to enable a workaround for compatibilty with differently broken implementations by specifying any charset other then cp437.

This PR enables that option for standards compliant behaviour, but does so in a backwards/bug compatible way.

But you are right. The dataset provider could simply stick to ascii characters for file names at interfaces for maximum compatibilty.

Other then that their ZIP file is compliant to the specifications and matches what common tools will create when using non-ascii filenames for the contents.
https://commons.apache.org/proper/commons-compress/zip.html#Recommendations_for_Interoperability

Co-authored-by: Thomas Gran <t2gran@gmail.com>

Fixes included

properly decode cp437 names in zip files

9ae9b34

needed for the german NeTEx files from DELFI

dekarl requested a review from a team as a code owner February 13, 2023 17:52

dekarl marked this pull request as draft February 13, 2023 17:55

fix mismerged imports

944761b

add unit test and documentation

fa0ea3e

make it prettier

0990f1f

dekarl marked this pull request as ready for review February 14, 2023 07:27

tickle build to get rid of Maven Central hickup

101bda9

dekarl added 2 commits February 14, 2023 15:20

improve unit test

36ba5cb

handles utf-8 encoding without efs flag also uses the same entry name to reduce the hex diff of the test cases

naive implementation, try as before and only if it fails then retry w…

5c046c1

…ith proper cp437 charset

make it prettier

356de6f

leonardehrenfried reviewed Feb 14, 2023

View reviewed changes

t2gran previously requested changes Feb 15, 2023

View reviewed changes

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java Outdated Show resolved Hide resolved

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java Show resolved Hide resolved

t2gran requested a review from vpaturet February 15, 2023 14:14

Update src/main/java/org/opentripplanner/datastore/file/ZipFileDataSo…

0595a3d

…urce.java Co-authored-by: Thomas Gran <t2gran@gmail.com>

vpaturet reviewed Feb 15, 2023

View reviewed changes

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java Outdated Show resolved Hide resolved

vpaturet reviewed Feb 15, 2023

View reviewed changes

src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java Show resolved Hide resolved

t2gran added this to the 2.3 milestone Feb 15, 2023

t2gran added the Technical Debt label Feb 15, 2023

t2gran added the Skip Changelog label Feb 15, 2023

remove unused import

b7622be

dekarl and others added 3 commits February 15, 2023 23:49

extract duplicate code into a private static function

5a26dbc

Co-authored-by: Thomas Gran <t2gran@gmail.com>

make it prettier

8d4756e

add log entry

a010204

dekarl requested a review from t2gran February 15, 2023 23:53

leonardehrenfried approved these changes Feb 16, 2023

View reviewed changes

vpaturet self-requested a review February 16, 2023 14:16

vpaturet approved these changes Feb 16, 2023

View reviewed changes

hannesj merged commit 3ea3031 into opentripplanner:dev-2.x Feb 16, 2023

dekarl deleted the otp2_zipfile_cp437 branch February 16, 2023 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

properly decode cp437 names in zip files #4835

properly decode cp437 names in zip files #4835

dekarl commented Feb 13, 2023

codecov bot commented Feb 13, 2023 •

edited

Loading

leonardehrenfried commented Feb 13, 2023

dekarl commented Feb 14, 2023 •

edited

Loading

leonardehrenfried commented Feb 14, 2023

dekarl commented Feb 14, 2023

leonardehrenfried commented Feb 14, 2023 •

edited

Loading

dekarl commented Feb 14, 2023

leonardehrenfried commented Feb 14, 2023

dekarl commented Feb 14, 2023

leonardehrenfried Feb 14, 2023 •

edited

Loading

dekarl Feb 14, 2023

t2gran Feb 15, 2023

t2gran left a comment

leonardehrenfried commented Feb 15, 2023

t2gran commented Feb 15, 2023 •

edited

Loading

dekarl commented Feb 15, 2023

properly decode cp437 names in zip files #4835

properly decode cp437 names in zip files #4835

Conversation

dekarl commented Feb 13, 2023

Summary

Issue

Unit tests

Documentation

Changelog

codecov bot commented Feb 13, 2023 • edited Loading

Codecov Report

leonardehrenfried commented Feb 13, 2023

dekarl commented Feb 14, 2023 • edited Loading

leonardehrenfried commented Feb 14, 2023

dekarl commented Feb 14, 2023

leonardehrenfried commented Feb 14, 2023 • edited Loading

dekarl commented Feb 14, 2023

leonardehrenfried commented Feb 14, 2023

dekarl commented Feb 14, 2023

leonardehrenfried Feb 14, 2023 • edited Loading

Choose a reason for hiding this comment

dekarl Feb 14, 2023

Choose a reason for hiding this comment

t2gran Feb 15, 2023

Choose a reason for hiding this comment

t2gran left a comment

Choose a reason for hiding this comment

leonardehrenfried commented Feb 15, 2023

t2gran commented Feb 15, 2023 • edited Loading

dekarl commented Feb 15, 2023

codecov bot commented Feb 13, 2023 •

edited

Loading

dekarl commented Feb 14, 2023 •

edited

Loading

leonardehrenfried commented Feb 14, 2023 •

edited

Loading

leonardehrenfried Feb 14, 2023 •

edited

Loading

t2gran commented Feb 15, 2023 •

edited

Loading