Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly decode cp437 names in zip files #4835

Merged
merged 13 commits into from
Feb 16, 2023

Conversation

dekarl
Copy link
Contributor

@dekarl dekarl commented Feb 13, 2023

Summary

The german NeTEx files from the NAP at DELFI use legacy default encoding with code page 437 for their ZIP files. Java needs help to read such files.

https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4415733 (problem statement from 2001, but without describing the solution that is available now in the JRE)

Documentation of the encoding can be found in Appendix D - Language Encoding (EFS) - in the format documentation at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Issue

I was inspired by #3640 to load german NeTEx data, but ran into multiple issues with the combined dataset.

Unit tests

I'm not sure how to add a unit test for this change.

I'm currently upstreaming my local patches and catching up with a year of refactorings, so this was tested to have the intended function before merging the refactoring.

Documentation

Rationale has been documented next to the new code.

Changelog

https://github.com/opentripplanner/OpenTripPlanner/labels/skip%20changelog

needed for the german NeTEx files from DELFI
@dekarl dekarl requested a review from a team as a code owner February 13, 2023 17:52
@dekarl dekarl marked this pull request as draft February 13, 2023 17:55
@codecov
Copy link

codecov bot commented Feb 13, 2023

Codecov Report

Base: 62.14% // Head: 62.13% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (356de6f) compared to base (b0b5f36).
Patch coverage: 81.81% of modified lines in pull request are covered.

❗ Current head 356de6f differs from pull request most recent head a010204. Consider uploading reports for the commit a010204 to get more accurate results

Additional details and impacted files
@@              Coverage Diff              @@
##             dev-2.x    #4835      +/-   ##
=============================================
- Coverage      62.14%   62.13%   -0.01%     
- Complexity     13000    13005       +5     
=============================================
  Files           1650     1650              
  Lines          66064    66094      +30     
  Branches        7184     7187       +3     
=============================================
+ Hits           41055    41068      +13     
- Misses         22669    22688      +19     
+ Partials        2340     2338       -2     
Impacted Files Coverage Δ
...ntripplanner/datastore/file/ZipFileDataSource.java 80.00% <81.81%> (+0.58%) ⬆️
...ner/ext/siri/updater/azure/SiriAzureETUpdater.java 0.00% <0.00%> (ø)
...ner/ext/siri/updater/azure/SiriAzureSXUpdater.java 0.00% <0.00%> (ø)
...aph_builder/module/OsmBoardingLocationsModule.java 81.17% <0.00%> (+2.35%) ⬆️
...g/opentripplanner/api/parameter/QualifiedMode.java 68.42% <0.00%> (+2.63%) ⬆️
...ripplanner/model/transfer/ConstrainedTransfer.java 83.33% <0.00%> (+3.33%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@leonardehrenfried
Copy link
Member

I'm not sure it's such a good idea to default to a very old character encoding. We probably want to make that configurable and default to UTF-8, isn't it?

@dekarl
Copy link
Contributor Author

dekarl commented Feb 14, 2023

I'm not sure it's such a good idea to default to a very old character encoding. We probably want to make that configurable and default to UTF-8, isn't it?

It is complicated :)

The format allows encoding as either cp437 or utf-8 at the same time. This patch changes the former, but not the latter.
https://commons.apache.org/proper/commons-compress/zip.html#Recommendations_for_Interoperability

As per the documenation of PKWare ZIP first was only cp437, then they added the option to encoded "names that can't be encoed cp437" in utf-8. But when Sun implemented the zip handling in Java they went directly to utf-8, implementing just the minimum to get there. Later Java 7 and newer got the option to actually handle the cp437 encoded names. Also to read files written by other implementations (as per documenation at Apache Commons) There is a flag to signal if cp437 or utf-8 encoding is used. This patch changes the behaviour or the former, but not the latter.

Apache Commons documents mainly Windows Compressed Folders to violate the spec in an incompatible way. For files generated by such a broken implemenation this patch changes the behaviour from one broken state to a differently broken state. When I tested it locally on Windows 10 I could not reproduce that this is still the case.

The alternative to this patch using java.util.zip is switching to Apache Commons, adding another runtime dependency. The test data generator uses Apache Commons to craft minimal test files to show how the behaviour before and after fixing the encoding work. You can verfiy the old behaviour allowing utf-8 decoding by commenting out a single line of code. Notice how the test fails only at the second file with names encoded as cp437. You can easily see the flag and different encoding with a hex editor when comparing both test files.

tl;dr the patch fixes decoding files using special characters created with 7zip, PKZIP, and WinZIP. It may change one broken behaviour when handling files createdWindows Compressed Folders to be differently broken. Also for some special versions and obscure software.

@dekarl dekarl marked this pull request as ready for review February 14, 2023 07:27
@leonardehrenfried
Copy link
Member

Is there an easy way to detect which encoding was used without introducing the apache dependency? That would be the best solution in my view.

@dekarl
Copy link
Contributor Author

dekarl commented Feb 14, 2023

Is there an easy way to detect which encoding was used without introducing the apache dependency? That would be the best solution in my view.

Instead trying to be creative we could be naive and just try without, catch the ZIPException and then load the charset and retry.

I'm not so great in Java, Is it ok if I just copy the 8 lines to read the directory instead of coming up with a creative loop that first tries without, then loads the charset and retries, but only on one kind of exception? I don't like repeating the code, but it appears to be easier to maintain as it still fits on one screen. (33 lines of code)

@leonardehrenfried
Copy link
Member

leonardehrenfried commented Feb 14, 2023

I would be okay with that solution if the zip exception is relatively specific. I'm not a the maintainer of the netex code so @vpaturet @t2gran @hannesj @Bartosz-Kruba need to actually approve it.

I do have an interest in eventually being able to parse these terrible Delfi feeds though.

Not sure if you have the time but we have video calls twice a week where we go though these PRs and they are often the fastest way to get the attention of busy people: https://github.com/opentripplanner/OpenTripPlanner/blob/dev-2.x/CONTRIBUTING.md

handles utf-8 encoding without efs flag

also uses the same entry name to reduce the hex diff of the test cases
@dekarl
Copy link
Contributor Author

dekarl commented Feb 14, 2023

I would be okay with that solution if the zip exception is relatively specific.

Its one "catch all" exception for "issues with the ZIP", like CRC error etc.

In a normal environment I only except this to try two times when there is an issue with the input file.

Either its a broken file, then the first time the exception is catched and the second time its thrown. The user has to fix the input file - but this has nothing to do with the encoding - then the cost goes away.

Or if its one of the uncommon files using umlauts, then its trying two times. But that additional cost should only happen when building the graph.

@leonardehrenfried
Copy link
Member

By the way, I'm fixing the Windows prettier problem over at: #4839

@dekarl
Copy link
Contributor Author

dekarl commented Feb 14, 2023

I do have an interest in eventually being able to parse these terrible Delfi feeds though.

While bringing forward my changes to handle NPE I noticed that most have since been fixed either in dev-2.x or in the branch at #3640 (which is active again, too)

I'd love to see upstream OTP parse the DELFI feed in time for results from DEEZ appearing in the wild.

@@ -93,6 +96,27 @@ private void loadContent() {
ZipEntry entry = entries.nextElement();
content.add(new ZipFileEntryDataSource(this, entry));
}
} catch (ZipException ze) {
Copy link
Member

@leonardehrenfried leonardehrenfried Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does it take for this exception to be thrown? Is it straight away or half-way into reading the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the whole directory of the ZIP archive is loaded when the first entry or the collection is accessed. On the following accesses contentLoaded is true and the cache directory is used.

So its straight away (untested, yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the directory is read before the content of the files(if I remember correctly we access the files in random order, not sequential).

t2gran
t2gran previously requested changes Feb 15, 2023
Copy link
Member

@t2gran t2gran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go ahead and approve this - I think we all agree that using cp437 is a problem, and the provider of these feeds should not do that. But, the code change to support a workaround in this case is not big. @dekarl you should put some pressure on the data provider to fix this so we can remove this code in the future.

@leonardehrenfried
Copy link
Member

I think CP437 is allowed by the zip spec - that is the problem!

@t2gran t2gran requested a review from vpaturet February 15, 2023 14:14
@t2gran
Copy link
Member

t2gran commented Feb 15, 2023

I assigned @vpaturet as a reviewer in my place - I will be gone for the next week. This is good to go, as long as the 2 requested changes is fixed.

…urce.java

Co-authored-by: Thomas Gran <t2gran@gmail.com>
@t2gran t2gran added this to the 2.3 milestone Feb 15, 2023
@dekarl
Copy link
Contributor Author

dekarl commented Feb 15, 2023

I think we all agree that using cp437 is a problem, and the provider of these feeds should not do that. But, the code change to support a workaround in this case is not big.

Oracle appears to have decided that Java 7 is enterprise software and bug compatibility with older java.util.zip was more important than handling perfectly valid standard ZIP files, that happen to use non-ascii characters for the contained file names. Maybe they did this because cp437 is not guaranteed to be available?
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/Charset.html#standard

But they added an option to make their implementation standards compliant by setting the default "non-utf-8 charset" to explicit cp437 instead of the broken "we signal cp437 but deliver utf-8 instead". While at it they added an option to enable a workaround for compatibilty with differently broken implementations by specifying any charset other then cp437.

This PR enables that option for standards compliant behaviour, but does so in a backwards/bug compatible way.

But you are right. The dataset provider could simply stick to ascii characters for file names at interfaces for maximum compatibilty.

Other then that their ZIP file is compliant to the specifications and matches what common tools will create when using non-ascii filenames for the contents.
https://commons.apache.org/proper/commons-compress/zip.html#Recommendations_for_Interoperability

@dekarl dekarl requested a review from t2gran February 15, 2023 23:53
@vpaturet vpaturet self-requested a review February 16, 2023 14:16
@hannesj hannesj dismissed t2gran’s stale review February 16, 2023 15:18

Fixes included

@hannesj hannesj merged commit 3ea3031 into opentripplanner:dev-2.x Feb 16, 2023
@dekarl dekarl deleted the otp2_zipfile_cp437 branch February 16, 2023 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants