-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data in fields where HTML characters were double escaped #6158
Fix data in fields where HTML characters were double escaped #6158
Conversation
@driusan discussion required for :
|
@sruthymathew123 @nicolasbrossard can you check if any of the IBIS data contains these characters ? running the tool without any arguments should find such fields |
0de2dfa
to
5296092
Compare
if (!is_array($newValue)) { | ||
$newValue = htmlspecialchars($newValue); | ||
} | ||
// // Always sanitize user-controlled input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this fix because it removes an important security measure. I understand though that the double-escaping bug has caused data loss which is a severe issue.
I'll see if I can come up with an alternate fix this week. If I can't figure it out then I think we'll have to go ahead with this and find another way to address XSS issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What data loss? Double escaped entries can be restored with a script. It's very unlikely that a patient entered something like &
on a form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to the number of &
added several fields got truncated by mySQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@driusan This was discussed in a group chat on Slack and you can view the details there. <
, >
, and &
seem to be common characters used by clinicians when entering instruments by hand (!) and so there is a need to transcribe them literally into forms.
Since they can't be black-listed, we need to do better at displaying these characters to the front-end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ridz1208 The problem arose because of MySQL 5.1 though right? If CCNA can't/won't upgrade then this might be a project override rather than something that needs to go into core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mostly the "
that are used by clinicians.
and @johnsaigle not sure I see what can be an override. everyone still has the issue, it's just that the other people's data wont get truncated if they are on more recent versions of mysql, they will still have the double escaping issue. and given that this bug existed since LORIS 17, I think we should cover a wide range of versions in our fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some research in earlier PRs and some basic testing, I think we can go ahead with this fix:
The htmlspecialchars()
line was added in an earlier phase of LORIS. Right now, the extensive use of strict typing prevents many injection points that this previously addressed. Additionally, the use of ReactJS provides sanitization of output by default on many data points. Finally, HTMLEscapeArray in the DB class ensures that almost all data going into the database is escaped.
Removing this line may slightly increase the risk of XSS attacks but I think the risk is small and outweighed by the potential for data loss that it creates. It was always a bit of a hack and I think it's one that we can probably do without now.
Beyond that, I have some comments + questions.
Also, can you please run PHPCS against this? You would do so by adding a line in test/run-php-linter.sh
.
@@ -0,0 +1,280 @@ | |||
<?php |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<?php | |
<?php declare(strict_types=1); |
* which was previously scored but no longer meets the criteria above (i.e. if the | ||
* administration was changed from `all` to `none`, the score will not be removed). | ||
* | ||
* @package behavioural |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @package behavioural |
Not needed
$repair =true; | ||
} | ||
$useHistory =false; | ||
if (in_array('use-history', $argv, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the logic for these if-statements you should probably throw an error in case too many arguments are passed, for example if someone tries to do both report
and repair
} catch (Exception $e) { | ||
printError( | ||
"There was an error instantiating instrument $instrumentName. | ||
This instrument will be skipped." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to print the stack trace of the exception
$query = "SELECT primaryVals as CommentID, col, tbl, new, changeDate | ||
FROM history | ||
WHERE primaryCols='CommentID' AND | ||
primaryVals IN ($listCIDs) AND | ||
tbl IN ($listTables) AND | ||
col IN ($listFields)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this query would work, lets say
- commentID XYZ has an issue with instrument A.
- commentID WXY has an issue with instrument B.
Then your results will include
- CommentId XYZ instrument B
- Commentid WXY instruemnt A
Both of which might not be problematic. I think this query needs to be exact and exhaustive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the query be exact is very costly and I wasn't actually able to get results because the server kept disconnecting...
This way, the query is not as costly and we do most of the heavy lifting in the script itself by comparing to the array of confirmedTruncations
I know you mentioned it might be an issue with resourcing. Perhaps we can discuss offline and with IT and try to see if an exact query will work when more resources are placed on the history DB.
// Reverse the array obtained from history data so that it is in anti-chronological order | ||
// Does not preserve keys, assuming that the keys are indexes & not relevant info | ||
$historyData = array_reverse($result, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jesscall there is no garantee that the array returned in in chronological order. you should instead use an order by in the query itself
@@ -152,22 +216,28 @@ | |||
|
|||
|
|||
|
|||
foreach ($escapedEntries as $instrumentName=>$cids) { | |||
printOut("Opening $instrumentName"); | |||
foreach ($escapedEntries as $tableName=>$cids) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jesscall replacing instrument_name with tablename might be problematic since you need the actual instrument name to instanciate the instrument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, i see...
the reasoning for using table name in certain cases is very specific to how CCNA has set up Follow-Up instruments.
I can make adjustments so that tableName
is used specifically for constructing the queries to information_schema
and history
. However, I don't think the logic for doing so is apparent in the Core-LORIS use case, will adding a comment that indicates a project-specific choice be sufficient ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think there are project -specific stuff here. Every LORIS project might be using different instrument and table names. there is no reason here preventing you from associating each instruemnt to a table and use that mapping to instantiate them properly
its just a question of what to use where.
still working on this. trying to see what i can do about the query to history, might potentially be able to rework the logic of the script entirely, depending if my query works @ridz1208 i'll let you know when its ready for testing |
@ridz1208 prod had 500 error because of the query that was being run... i think we should keep the script as is, even though the query is returning more data than needed. We can parse through it on PHP side and not put too much strain on the DB and server which has limited memory. Can you do a prelim code review and let me know if there's anything else i should change or improve? |
I have added to this PR a "lite" version of the tool designed to only report on issues. this script can do the checking using object instanciation or raw SQL queries. the raw SQL way obviously does not allow you to have a database table name different then the instrument name so it might omit some instruments in your database. if you decide to use the SQL way to avoid object instanciation errors, you can use the
let me know if it detects anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the code attentively, haven't run it but look Ok.
LGTM
if (array_key_exists($field, $confirmedTruncations[$tbl][$cid]) && | ||
strlen($val) < $maxFieldLengths[$tbl][$field] && | ||
!isset($untruncatedValues[$tbl][$cid][$field]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (array_key_exists($field, $confirmedTruncations[$tbl][$cid]) && | |
strlen($val) < $maxFieldLengths[$tbl][$field] && | |
!isset($untruncatedValues[$tbl][$cid][$field]) | |
if (isset($val) && | |
isset($confirmedTruncations[$tbl][$cid][$field]) && | |
strlen($val) < $maxFieldLengths[$tbl][$field]['CHARACTER_MAXIMUM_LENGTH'] && | |
!isset($untruncatedValues[$tbl][$cid][$field]) |
array_key_exists($field, $confirmedTruncations[$tableName][$cid]) && | ||
array_key_exists($field, $untruncatedValues[$tableName][$cid]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array_key_exists($field, $confirmedTruncations[$tableName][$cid]) && | |
array_key_exists($field, $untruncatedValues[$tableName][$cid]) | |
isset($confirmedTruncations[$tableName][$cid][$field]) && | |
isset($untruncatedValues[$tableName][$cid][$field]) |
} | ||
if (!empty($set) && $repair) { | ||
$instrumentInstance->_save($set); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
unset($set); |
Brief summary of changes
This fixes the issue where HTML special characters were escaped twice before being saved causing every save to re-escape the
&
of the previously escaped values.The problematic characters are
&><"
Example, if an instrument field contains a
>
:after first save value becomes
&gt;
after second save
&amp;amp;gt;
Causing some values to look like this on projects running cronjobs nightly
&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;
Testing instructions (if applicable)
Link(s) to related issue(s)