Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

" escaped double quotes in JSON/CSV #1769

Open
Tracked by #6429 ...
hangy opened this issue May 12, 2019 · 4 comments
Open
Tracked by #6429 ...

" escaped double quotes in JSON/CSV #1769

hangy opened this issue May 12, 2019 · 4 comments
Assignees
Labels
API READ All READ APIs include Product, Search… API Issues related to the Open Food Facts API. More specific labels exist & should be used (API WRITE…) Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data

Comments

@hangy
Copy link
Member

hangy commented May 12, 2019

Summary:

If a field (ie. product_name or brands) contains a value with ", then the JSON output and CSV output for that product contains " instead of the correct escaped form of " for each format.

Steps to reproduce:

  • Create a product with " in the name.
  • Search for it using the advanced search and export the search result as CSV.
  • Open the Read API for the created product and view the JSON.

Expected behavior:

  • For JSON, double quotes should be escaped with a backslash: \".
  • For CSV, a quote character based on the employed CSV variant should be used. For example: "".

Observed behavior:

In the CSV file and in the JSON output, the " is escaped as ".

Someone who uses the JSON API or parses the CSV output should not need to parse fields/properties as HTML.

@hangy hangy added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data API READ All READ APIs include Product, Search… labels May 12, 2019
@hangy
Copy link
Member Author

hangy commented May 12, 2019

The main reason seems to be that any input is HTML encoded before being written to the MongoDB and the .sto file.

I think, it is pretty obvious that a simple string (unless marked explicitly as HTML input/output, but we don't have those in OFF) should not be HTML encoded in internal storage in the first place. However, changing this might not be trivial, because we do provide a full MongoDB dump for anyone to download, and changing the encoding behaviour of HTML entities could be seen as a breaking change.

However, we also shouldn't require anyone that just wants to use the JSON/CSV output to decode HTML entities before displaying a string value.

Which approach do you prefer, @stephanegigandet, @CharlesNepote, @teolemon, @openfoodfacts/openfoodfacts-server? From a standards and portability perspective, I'd prefer not HTML encoding anything before writing it to the database(s), and letting the output format decide the correct method of escaping values - but that approach does have compatibility issues with current consumers of the data. Did I miss anything?

@hangy
Copy link
Member Author

hangy commented May 12, 2019

Regardless of the question how to we want to store the data, an option to keep up compatibility with existing API consumers could be to introduce a parameter or HTTP header to the API. For example, we might have ?legacyHtmlEncodeStrings=true for the current behaviour, and ?legacyHtmlEncodeStrings=false to disable HTML encoding of strings. true might be the default value for a transition period, after which we might to default to false for a while, and after the documented transition period, all compatibility code might be removed.

One more thing to keep in mind is that if we did decide to not encode the values before storing them, we should review the ProductOpener source to ensure that strings are HTML encoded before they get displayed.

@stephanegigandet
Copy link
Contributor

Related to that, the ingredient parser does not understand the quote entities, and so ingredients lists like "20% "Los" will create an ingredient tag 20-quot...

@CharlesNepote
Copy link
Member

" is used for HTML and I don't see any reason to store " in the database. Different kind of formats have their own way to encode quotes but it should be done during the export process, not before, just because there IS different ways to encode quotes in different formats.

About the idea of changing the API with ?legacyHtmlEncodeStrings=true, why make our API heavier to manage past problems? I think it would be better to have a clear process for this kind of change. For example:

  • discuss the change with the data consumers
  • publish the information in a place where every data consumer should be (mailing list?), with a deadline (at least 3 months before changing?)
  • make the change after the deadline

On https://world.openfoodfacts.org/data page there is a section dealing with "Mailing list for data, API and exports": why not use it? Also this section should be upper in the document. And it would also be interesting to publish more information such as a changelog for API and database big changes.

@teolemon teolemon added the API Issues related to the Open Food Facts API. More specific labels exist & should be used (API WRITE…) label Aug 21, 2021
@teolemon teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024
@teolemon teolemon added the 🐛 bug This is a bug, not a feature request. label Jul 19, 2024
@teolemon teolemon removed the 🐛 bug This is a bug, not a feature request. label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API READ All READ APIs include Product, Search… API Issues related to the Open Food Facts API. More specific labels exist & should be used (API WRITE…) Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data
Projects
Status: To discuss and validate
Development

No branches or pull requests

4 participants