-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly decode cp437 names in zip files #4835
Conversation
needed for the german NeTEx files from DELFI
Codecov ReportBase: 62.14% // Head: 62.13% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev-2.x #4835 +/- ##
=============================================
- Coverage 62.14% 62.13% -0.01%
- Complexity 13000 13005 +5
=============================================
Files 1650 1650
Lines 66064 66094 +30
Branches 7184 7187 +3
=============================================
+ Hits 41055 41068 +13
- Misses 22669 22688 +19
+ Partials 2340 2338 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I'm not sure it's such a good idea to default to a very old character encoding. We probably want to make that configurable and default to UTF-8, isn't it? |
It is complicated :) The format allows encoding as either cp437 or utf-8 at the same time. This patch changes the former, but not the latter. As per the documenation of PKWare ZIP first was only cp437, then they added the option to encoded "names that can't be encoed cp437" in utf-8. But when Sun implemented the zip handling in Java they went directly to utf-8, implementing just the minimum to get there. Later Java 7 and newer got the option to actually handle the cp437 encoded names. Also to read files written by other implementations (as per documenation at Apache Commons) There is a flag to signal if cp437 or utf-8 encoding is used. This patch changes the behaviour or the former, but not the latter. Apache Commons documents mainly Windows Compressed Folders to violate the spec in an incompatible way. For files generated by such a broken implemenation this patch changes the behaviour from one broken state to a differently broken state. When I tested it locally on Windows 10 I could not reproduce that this is still the case. The alternative to this patch using java.util.zip is switching to Apache Commons, adding another runtime dependency. The test data generator uses Apache Commons to craft minimal test files to show how the behaviour before and after fixing the encoding work. You can verfiy the old behaviour allowing utf-8 decoding by commenting out a single line of code. Notice how the test fails only at the second file with names encoded as cp437. You can easily see the flag and different encoding with a hex editor when comparing both test files. tl;dr the patch fixes decoding files using special characters created with 7zip, PKZIP, and WinZIP. It may change one broken behaviour when handling files createdWindows Compressed Folders to be differently broken. Also for some special versions and obscure software. |
Is there an easy way to detect which encoding was used without introducing the apache dependency? That would be the best solution in my view. |
Instead trying to be creative we could be naive and just try without, catch the ZIPException and then load the charset and retry. I'm not so great in Java, Is it ok if I just copy the 8 lines to read the directory instead of coming up with a creative loop that first tries without, then loads the charset and retries, but only on one kind of exception? I don't like repeating the code, but it appears to be easier to maintain as it still fits on one screen. (33 lines of code) |
I would be okay with that solution if the zip exception is relatively specific. I'm not a the maintainer of the netex code so @vpaturet @t2gran @hannesj @Bartosz-Kruba need to actually approve it. I do have an interest in eventually being able to parse these terrible Delfi feeds though. Not sure if you have the time but we have video calls twice a week where we go though these PRs and they are often the fastest way to get the attention of busy people: https://github.com/opentripplanner/OpenTripPlanner/blob/dev-2.x/CONTRIBUTING.md |
handles utf-8 encoding without efs flag also uses the same entry name to reduce the hex diff of the test cases
…ith proper cp437 charset
Its one "catch all" exception for "issues with the ZIP", like CRC error etc. In a normal environment I only except this to try two times when there is an issue with the input file. Either its a broken file, then the first time the exception is catched and the second time its thrown. The user has to fix the input file - but this has nothing to do with the encoding - then the cost goes away. Or if its one of the uncommon files using umlauts, then its trying two times. But that additional cost should only happen when building the graph. |
By the way, I'm fixing the Windows prettier problem over at: #4839 |
While bringing forward my changes to handle NPE I noticed that most have since been fixed either in dev-2.x or in the branch at #3640 (which is active again, too) I'd love to see upstream OTP parse the DELFI feed in time for results from DEEZ appearing in the wild. |
@@ -93,6 +96,27 @@ private void loadContent() { | |||
ZipEntry entry = entries.nextElement(); | |||
content.add(new ZipFileEntryDataSource(this, entry)); | |||
} | |||
} catch (ZipException ze) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does it take for this exception to be thrown? Is it straight away or half-way into reading the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the whole directory of the ZIP archive is loaded when the first entry or the collection is accessed. On the following accesses contentLoaded is true and the cache directory is used.
So its straight away (untested, yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the directory is read before the content of the files(if I remember correctly we access the files in random order, not sequential).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go ahead and approve this - I think we all agree that using cp437 is a problem, and the provider of these feeds should not do that. But, the code change to support a workaround in this case is not big. @dekarl you should put some pressure on the data provider to fix this so we can remove this code in the future.
src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java
Outdated
Show resolved
Hide resolved
I think CP437 is allowed by the zip spec - that is the problem! |
I assigned @vpaturet as a reviewer in my place - I will be gone for the next week. This is good to go, as long as the 2 requested changes is fixed. |
…urce.java Co-authored-by: Thomas Gran <t2gran@gmail.com>
src/main/java/org/opentripplanner/datastore/file/ZipFileDataSource.java
Outdated
Show resolved
Hide resolved
Oracle appears to have decided that Java 7 is enterprise software and bug compatibility with older java.util.zip was more important than handling perfectly valid standard ZIP files, that happen to use non-ascii characters for the contained file names. Maybe they did this because cp437 is not guaranteed to be available? But they added an option to make their implementation standards compliant by setting the default "non-utf-8 charset" to explicit cp437 instead of the broken "we signal cp437 but deliver utf-8 instead". While at it they added an option to enable a workaround for compatibilty with differently broken implementations by specifying any charset other then cp437. This PR enables that option for standards compliant behaviour, but does so in a backwards/bug compatible way. But you are right. The dataset provider could simply stick to ascii characters for file names at interfaces for maximum compatibilty. Other then that their ZIP file is compliant to the specifications and matches what common tools will create when using non-ascii filenames for the contents. |
Co-authored-by: Thomas Gran <t2gran@gmail.com>
Summary
The german NeTEx files from the NAP at DELFI use legacy default encoding with code page 437 for their ZIP files. Java needs help to read such files.
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4415733 (problem statement from 2001, but without describing the solution that is available now in the JRE)
Documentation of the encoding can be found in Appendix D - Language Encoding (EFS) - in the format documentation at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Issue
I was inspired by #3640 to load german NeTEx data, but ran into multiple issues with the combined dataset.
Unit tests
I'm not sure how to add a unit test for this change.
I'm currently upstreaming my local patches and catching up with a year of refactorings, so this was tested to have the intended function before merging the refactoring.
Documentation
Rationale has been documented next to the new code.
Changelog
https://github.com/opentripplanner/OpenTripPlanner/labels/skip%20changelog