Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 1840 invalid characters #1892

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

feat: 1840 invalid characters #1892

wants to merge 6 commits into from

Conversation

qcdyx
Copy link
Contributor

@qcdyx qcdyx commented Oct 16, 2024

Summary:

Closes #1840

Expected behavior:
image

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@qcdyx qcdyx requested a review from davidgamez October 16, 2024 15:26
Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit aefbc02
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7 invalid_characters
ch-unknown-lk2-gtfs-914 invalid_characters
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 invalid_characters
nl-unknown-allgo-keolis-gtfs-1077 invalid_characters
pt-setubal-carris-metropolitana-gtfs-1874 invalid_characters
Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance
New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
ch-unknown-lk2-gtfs-914 duplicate_route_name
ch-unknown-lk2-gtfs-914 fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914 fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914 missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 missing_timepoint_value
ch-unknown-lk2-gtfs-914 stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077 stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874 stop_without_stop_time
ch-unknown-lk2-gtfs-914 stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 4.02 4.06 ⬆️+0.05
Median -- 1.40 1.46 ⬆️+0.06
Standard Deviation -- 11.53 11.32 ⬇️-0.21
Minimum in References Reports us-california-flex-v2-developer-test-feed-3-gtfs-1819 0.50 0.73 ⬆️+0.23
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 297.19 287.84 ⬇️-9.36
Minimum in Latest Reports us-california-catalina-express-gtfs-299 0.60 0.55 ⬇️-0.06
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 297.19 287.84 ⬇️-9.36
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 486.19 MiB 480.76 MiB ⬇️-5.44 MiB
Median -- 245.95 MiB 246.85 MiB ⬆️+922.84 KiB
Standard Deviation -- 877.41 MiB 874.83 MiB ⬇️-2.58 MiB
Minimum in References Reports us-oregon-hut-airport-shuttle-gtfs-635 34.05 MiB 34.09 MiB ⬆️+40.00 KiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 9.96 GiB 10.12 GiB ⬆️+161.15 MiB
Minimum in Latest Reports us-virginia-jaunt-inc-gtfs-1324 34.06 MiB 34.05 MiB ⬇️-16.00 KiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 9.96 GiB 10.12 GiB ⬆️+161.15 MiB

@emmambd
Copy link
Contributor

emmambd commented Oct 16, 2024

@tzujenchanmbd Curious about your thoughts on the acceptance tests. In cases where this is happening, it looks like it's because of how the producer is encoding accents (examples below).
Screenshot 2024-10-16 at 1 47 22 PM
Screenshot 2024-10-16 at 1 47 07 PM
Screenshot 2024-10-16 at 1 47 11 PM
Screenshot 2024-10-16 at 1 47 17 PM
Is there some kind of guidance it would make sense for us to provide in the notice about how to encode these to prevent the issue from occurring?

Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 477dbdd
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7 invalid_characters
ch-unknown-lk2-gtfs-914 invalid_characters
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 invalid_characters
nl-unknown-allgo-keolis-gtfs-1077 invalid_characters
pt-setubal-carris-metropolitana-gtfs-1874 invalid_characters
Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance
New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
ch-unknown-lk2-gtfs-914 duplicate_route_name
ch-unknown-lk2-gtfs-914 fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914 fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914 missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 missing_timepoint_value
ch-unknown-lk2-gtfs-914 stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077 stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874 stop_without_stop_time
ch-unknown-lk2-gtfs-914 stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 4.02 4.14 ⬆️+0.12
Median -- 1.38 1.43 ⬆️+0.05
Standard Deviation -- 11.61 11.86 ⬆️+0.25
Minimum in References Reports us-california-flex-v2-developer-test-feed-3-gtfs-1819 0.51 0.62 ⬆️+0.11
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 300.20 291.45 ⬇️-8.76
Minimum in Latest Reports ar-buenos-aires-subterraneos-de-buenos-aires-subte-gtfs-6 0.53 0.54 ⬆️+0.01
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 300.20 291.45 ⬇️-8.76
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 487.60 MiB 476.31 MiB ⬇️-11.30 MiB
Median -- 248.03 MiB 245.48 MiB ⬇️-2.56 MiB
Standard Deviation -- 863.99 MiB 843.71 MiB ⬇️-20.28 MiB
Minimum in References Reports us-california-flex-v2-developer-test-feed-1-gtfs-1817 34.05 MiB 34.06 MiB ⬆️+8.00 KiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.15 GiB 9.79 GiB ⬇️-366.70 MiB
Minimum in Latest Reports tr-kocaeli-metro-izmir-gtfs-1824 34.07 MiB 34.05 MiB ⬇️-16.00 KiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.15 GiB 9.79 GiB ⬇️-366.70 MiB

@tzujenchanmbd
Copy link

Some examples of correct name in the screenshots:

  • Rotterdam, Selma Lagerlöfweg -> Rotterdam, Selma Lagerl�fweg
  • Rotterdam, Port Saïdstraat -> Rotterdam, Port Sa�dstraat
  • Estación Washington -> Estaci��n Washington

Problem example on maps: https://maps.app.goo.gl/YqnS2Gj9goeWN8GT9

So it seems the issue usually happen on accented characters like "ó", "ö", "ï" in Western European languages.

Perhaps dev team can help confirm, but I guess this is probably because of encoding and decoding mismatch during the data production process. For example, if the text was originally saved using a specific encoding (e.g. ISO-8859-1 or Windows-1252, "legacy" encoding covering characters commonly used in Western European languages, such as accented characters (é, ñ, ü) and special symbols), but is then read using a different encoding (e.g. UTF-8). In this case characters outside the ASCII range (like accented characters) will probably not decode correctly, leading to errors like the replacement character (�).

@davidgamez
Copy link
Member

Some examples of correct name in the screenshots:

  • Rotterdam, Selma Lagerlöfweg -> Rotterdam, Selma Lagerl�fweg
  • Rotterdam, Port Saïdstraat -> Rotterdam, Port Sa�dstraat
  • Estación Washington -> Estaci��n Washington

Problem example on maps: https://maps.app.goo.gl/YqnS2Gj9goeWN8GT9

So it seems the issue usually happen on accented characters like "ó", "ö", "ï" in Western European languages.

Perhaps dev team can help confirm, but I guess this is probably because of encoding and decoding mismatch during the data production process. For example, if the text was originally saved using a specific encoding (e.g. ISO-8859-1 or Windows-1252, "legacy" encoding covering characters commonly used in Western European languages, such as accented characters (é, ñ, ü) and special symbols), but is then read using a different encoding (e.g. UTF-8). In this case characters outside the ASCII range (like accented characters) will probably not decode correctly, leading to errors like the replacement character (�).

We assumed that the feeds are in UTF-8, replacement characters and other variations might be due to the fact that is not in proper UTF-8. The legacy Google validator replaced the non-UTF-8 compatible characters with the replacement character. Maybe they have this implemented somewhere in their data pipeline to guarantee that the text is properly rendered in the UI, even with some characters "replaced", legacy validator code

@emmambd
Copy link
Contributor

emmambd commented Oct 16, 2024

Revisions after discussion with @tzujenchanmbd:

@davidgamez @qcdyx Is it feasible for us to parse the non-UTF-8 characters too? Ideally we could show them to the user in the notice table description, highlighted in bold so they know which characters are causing the problem.

  • Notice name: Invalid character (not plural, to match the style of our other notices)
  • Notice description: Description: This field contains invalid characters, marked in bold. Text must be encoded in UTF-8 in order to be valid. When reading text, use the same encoding that was used to save.

I also want to note it looks like feeds with this error will have unparseable files that will mean notices are dropped, from the acceptance tests above.

@davidgamez
Copy link
Member

Revisions after discussion with @tzujenchanmbd:

@davidgamez @qcdyx Is it feasible for us to parse the non-UTF-8 characters too? Ideally we could show them to the user in the notice table description, highlighted in bold so they know which characters are causing the problem.

  • Notice name: Invalid character (not plural, to match the style of our other notices)
  • Notice description: Description: This field contains invalid characters, marked in bold. Text must be encoded in UTF-8 in order to be valid. When reading text, use the same encoding that was used to save.

I also want to note it looks like feeds with this error will have unparseable files that will mean notices are dropped, from the acceptance tests above.

I suggest creating a different notice for non-UTF-8 text. There are two distinct situations: the first is the presence of a replacement character that is a valid UTF-8 character, and the second is the presence of an invalid UTF-8 character. I suspect that if we have a replacement character, it is due to a tool that transformed the feed and potentially replaced the invalid UTF-8 characters(or any other encoding) to UTF-8 or a different target encoding(violating the spec in this case).

@davidgamez
Copy link
Member

Regarding the dropped notices, they are expected because of the severity of the notice(Error).

@emmambd
Copy link
Contributor

emmambd commented Oct 16, 2024

@davidgamez Makes sense. How's this for a revised notice description then, so it's more suggestive and less prescriptive that the issue is non-UTF-8 encoding:

Notice name: Invalid character (not plural, to match the style of our other notices)
Notice description: Description: This field contains invalid characters, such as the replacement character ("�"). Check that text was properly encoded in UTF-8 as required by GTFS.

@davidgamez
Copy link
Member

@davidgamez Makes sense. How's this for a revised notice description then, so it's more suggestive and less prescriptive that the issue is non-UTF-8 encoding:

Notice name: Invalid character (not plural, to match the style of our other notices) Notice description: Description: This field contains invalid characters, such as the replacement character ("�"). Check that text was properly encoded in UTF-8 as required by GTFS.

The notice name and description make sense to me.

@qcdyx
Copy link
Contributor Author

qcdyx commented Oct 16, 2024

@davidgamez @emmambd image

Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 2cf9ad2
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (5 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
br-rio-grande-do-sul-empresa-publica-de-transportes-e-circulacao-eptc-gtfs-7 invalid_character
ch-unknown-lk2-gtfs-914 invalid_character
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 invalid_character
nl-unknown-allgo-keolis-gtfs-1077 invalid_character
pt-setubal-carris-metropolitana-gtfs-1874 invalid_character
Dropped Errors (1 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance
New Warnings (0 out of 1602 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Warnings (4 out of 1602 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
ch-unknown-lk2-gtfs-914 duplicate_route_name
ch-unknown-lk2-gtfs-914 fast_travel_between_consecutive_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_consecutive_stops
ch-unknown-lk2-gtfs-914 fast_travel_between_far_stops
nl-unknown-allgo-keolis-gtfs-1077 fast_travel_between_far_stops
ch-unknown-lk2-gtfs-914 missing_bike_allowance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 missing_timepoint_value
ch-unknown-lk2-gtfs-914 stop_has_too_many_matches_for_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_has_too_many_matches_for_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_has_too_many_matches_for_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_too_far_from_shape
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape
ch-unknown-lk2-gtfs-914 stop_too_far_from_shape_using_user_distance
nl-unknown-allgo-keolis-gtfs-1077 stop_too_far_from_shape_using_user_distance
pt-setubal-carris-metropolitana-gtfs-1874 stop_too_far_from_shape_using_user_distance
mx-jalisco-secretaria-de-movilidad-del-estado-de-jalisco-gtfs-1926 stop_without_stop_time
nl-unknown-allgo-keolis-gtfs-1077 stop_without_stop_time
pt-setubal-carris-metropolitana-gtfs-1874 stop_without_stop_time
ch-unknown-lk2-gtfs-914 stops_match_shape_out_of_order
nl-unknown-allgo-keolis-gtfs-1077 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 stops_match_shape_out_of_order
pt-setubal-carris-metropolitana-gtfs-1874 trip_distance_exceeds_shape_distance_below_threshold

🛡️ Corruption Check

0 out of 1602 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 4.02 4.05 ⬆️+0.03
Median -- 1.39 1.44 ⬆️+0.05
Standard Deviation -- 11.59 11.40 ⬇️-0.19
Minimum in References Reports us-california-flex-v2-developer-test-feed-2-gtfs-1818 0.52 0.58 ⬆️+0.06
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 301.68 289.98 ⬇️-11.70
Minimum in Latest Reports us-massachusetts-massachusetts-area-express-max-gtfs-431 0.54 0.54 ⬇️-0.00
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 301.68 289.98 ⬇️-11.70
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 494.48 MiB 479.71 MiB ⬇️-14.76 MiB
Median -- 247.23 MiB 245.94 MiB ⬇️-1.29 MiB
Standard Deviation -- 894.05 MiB 850.47 MiB ⬇️-43.58 MiB
Minimum in References Reports ph-unknown-hm-transport-inc-and-robinsons-malls-gtfs-1105 34.05 MiB 34.07 MiB ⬆️+24.00 KiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.18 GiB 10.04 GiB ⬇️-146.33 MiB
Minimum in Latest Reports us-oregon-high-desert-point-gtfs-636 34.05 MiB 34.05 MiB ⬇️-8.00 KiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.18 GiB 10.04 GiB ⬇️-146.33 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator Accepts Replacement Character in stop_name Field
4 participants