Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not index large case property values #30009

Merged
merged 1 commit into from
Jul 7, 2021
Merged

do not index large case property values #30009

merged 1 commit into from
Jul 7, 2021

Conversation

calellowitz
Copy link
Contributor

@calellowitz calellowitz commented Jul 5, 2021

Summary

Followup from https://dimagi-dev.atlassian.net/browse/SAAS-12367

This change will cause the case search index not to index case property values that are over 8191 characters. They will still be stored in the document, so will be retrieved when a case matches another query, but cannot be searched. This is due to a limit in lucene that indexed fields cannot exceed 32766 bytes. Given how case search is used (equality searching only), this is extremely unlikely to affect actual usage, since it would require someone inputting 8000 character exact matches in the application.

8192 was chosen because it is the max length allowed if all characters were 4 byte utf-8 characters, as recommended by ES.

ES docs on the setting https://www.elastic.co/guide/en/elasticsearch/reference/2.4/ignore-above.html

Safety Assurance

  • Risk label is set correctly
  • All migrations are backwards compatible and won't block deploy
  • The set of people pinged as reviewers is appropriate for the level of risk of the change
  • If QA is part of the safety story, the "Awaiting QA" label is used
  • I have confidence that this PR will not introduce a regression for the reasons below

Safety story

This is a simple settings change that will not affect any existing data, and can be applied with zero downtime.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

To rollback this change, the PR must be reverted and then cchq <env> django-manage update_es_mapping case_search must be run on all environments on which it was deployed.

@dimagimon dimagimon added reindex/migration Reindex or migration will be required during or before deploy Risk: High Change affects files that have been flagged as high risk. labels Jul 5, 2021
@calellowitz calellowitz added product/invisible Change has no end-user visible impact and removed Risk: High Change affects files that have been flagged as high risk. reindex/migration Reindex or migration will be required during or before deploy labels Jul 5, 2021
@calellowitz
Copy link
Contributor Author

I did not include a migration for this to be automatically applied on deploy, as I was not sure what our recommended process was. I think that is safe and easy, but so is running it manually, and if we do have to rollback, the revert would be cleaner without a migration. Open to suggestions about what others prefer

Copy link
Contributor

@snopoke snopoke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will still be stored in the document, so will be retrieved when a case matches another query

Have you confirmed this with testing?

I did not include a migration for this to be automatically applied on deploy, as I was not sure what our recommended process was. I think that is safe and easy, but so is running it manually, and if we do have to rollback, the revert would be cleaner without a migration. Open to suggestions about what others prefer

In the past we've always done these kind of changes manually. I could see an argument for doing it automatically for non-destructive changes.

@@ -34,7 +34,8 @@
"exact": {
"index": "not_analyzed",
"type": "string",
"null_value": ""
"null_value": "",
"ignore_above": 8191
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also need it on the main value field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. Specifically the error message referenced this field, not case properties as a whole Document contains at least one immense term in field="case_properties.value.exact", which makes sense because each field is indexed separately rather than as a whole, so it's specifically this string one that is having the issue, but as with the other question above, I will make sure to confirm on staging.

@calellowitz
Copy link
Contributor Author

Have you confirmed this with testing?

Confirmed that I can save both 10k character properties (greater than the setting, less than the lucene limit) and 40k character properties (above the lucene limit), and it appears in the doc shown in doc_in_es and in the relevant column of the case list explorer.

FYI: @snopoke @joxl

I am offline after tomorrow, so I am not going to deploy this, although I think it is good to go if someone else would like to push it forward. I think the sooner we deploy this the better, since in the meantime, relevant cases will not be updated in this index.

@snopoke snopoke merged commit 90c0da9 into master Jul 7, 2021
@snopoke snopoke deleted the ce/ignore-above branch July 7, 2021 07:57
@snopoke
Copy link
Contributor

snopoke commented Jul 7, 2021

I'll roll this out today

@snopoke
Copy link
Contributor

snopoke commented Jul 7, 2021

Update on prod successful.

@snopoke
Copy link
Contributor

snopoke commented Jul 7, 2021

@calellowitz I think this deserves a commcare-cloud changelog entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants