Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for all string input detecting unicode/XML "gremlins" #242

Closed
zzgvh opened this issue Jul 9, 2013 · 11 comments
Closed

Add validation for all string input detecting unicode/XML "gremlins" #242

zzgvh opened this issue Jul 9, 2013 · 11 comments

Comments

@zzgvh
Copy link
Contributor

zzgvh commented Jul 9, 2013

There are a number of unicode control characters that aren't allowed in XML. This creates a headache when text is copied into the admin forms, for example to the text fields in the Project and Organisation (where we've seen this problem occur)(I strongly suspect M$ word et al in this case 😛) when the data is later requested through the API in XML format.

To fix this we need proper validation of all string input. The place for this is in the validators that are called when an admin form is saved and which can be called when inputting data through the API.

A couple of references:
http://en.wikipedia.org/wiki/Valid_characters_in_XML
http://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters

Older issue covering the same ground: #189

@adriancollier
Copy link
Contributor

Do we need to put this into the Guidelines for Partners sending us XML files?

@ghost ghost assigned zzgvh Sep 21, 2013
@adriancollier adriancollier added this to the 2.3.5 Uglyfruit milestone Mar 25, 2014
@adriancollier
Copy link
Contributor

@zzgvh any ideas on this? Should we add validation or add documentation or both?

@zzgvh
Copy link
Contributor Author

zzgvh commented Apr 2, 2014

Currently there is a script, https://github.com/akvo/akvo-rsr/blob/develop/akvo/api/xml_char_check.py, that queries the API for all projects, organisations and updates in XML format and reports if any objects that fail to render. I think it needs some small fixing to run, but it gets the job done. We could set it up to run once per X and send a message if it finds problems.

But in the long run we probably should do something since we're going to output more and more XML over time.

This raises a number of questions. When do we check? On input or on output. What do you do if you detect a gremlin? Delete it? Tell the user? Both?
Just deleting the problem characters is not very nice to the user.
Telling the user in a meaningful way means highlighting the exact problem characters. This requires non-standard error reporting.
Probably the simplest solution if we check on input is to replace the gremlins with � and tell the user about it.
Another way it to run the replacing on output so that the XML will contain the �. But it'll be harder for the user to fix the issue, since it's not even reported when entering the data.
The actual replacing implementation is not 100% clear but it seems the best route is to allow only http://en.wikipedia.org/wiki/Valid_characters_in_XML#Non-restricted_characters.

@adriancollier
Copy link
Contributor

Ah, I think I understand this a lot more now, I was thinking this was input API, but it's input Admin and output API that causes this.

I suggest that this should be added to input validation on save in the Admin. Makes sense to prevent invalid characters from entering the database.

As we move to more API based services and solutions, having a standard and accepted character set will become increasingly important.

Is this possible to add to the Admin Validation?

(Moving this to the 2.3.6 Release as it's potentially a larger issue)

@zzgvh
Copy link
Contributor Author

zzgvh commented Apr 4, 2014

It should be possible and it might be fairly easy, but will need more investigation to say for sure. I'm guessing the way to do it is to override the base validator for text fields so that every text field gets tested for gremlins. That the same validator can then be used for text coming in through the API (which isn't a problem for XML coming that way but might be for JSON data).

@KasperBrandt
Copy link
Contributor

Merged in #531

@KasperBrandt
Copy link
Contributor

Test plan

GIVEN a text field in the RSR admin
WHEN an invalid XML character is used (e.g. � )
AND the admin is saved
THEN a warning is shown

@rumca
Copy link
Contributor

rumca commented May 13, 2014

@KasperBrandt @zzgvh I could be wrong here (or it might be handled elsewhere) but aren't &, <, >, ", and ' all invalid XML characters? The admin seems to have no issue with saving these down at present

@KasperBrandt
Copy link
Contributor

@rumca @adriancollier Not sure if we should disallow these characters, because they pose no problem. They are encoded using utf-8, e.g. '<' is stored as '<', but in the actual HMTL / XML it will display &lt;.

However, other weird chars (such as �) cannot be displayed by XML in any way and make the XML invalid. These are currently filtered out.

@adriancollier
Copy link
Contributor

OK, it's my opinion that we should only restrict characters that will cause a problem in the Export and Import functions to XML, so if this works then we should leave it.

@KasperBrandt
Copy link
Contributor

Yep, just tested the XML Export and regular project page when inserting </html> to be sure, but that works.

@MichaelAkvo MichaelAkvo moved this to Done in RSR Dec 8, 2022
@MichaelAkvo MichaelAkvo added this to RSR Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants