Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix data integrity issues in production due to deadlock bug #708

Open
4 tasks
Ahmed-Ghanam opened this issue Feb 12, 2025 · 2 comments
Open
4 tasks
Assignees
Labels
kind/analysis kind/bug Something isn't working

Comments

@Ahmed-Ghanam
Copy link
Contributor

Description

A previously identified bug in the SmsNotificationRepository.UpdateSendStatus function caused deadlocks, impacting the Altinn Notifications service. This bug resulted in unintended status updates of SMS orders in the database due to incorrect matching logic in update queries.

We need to investigate the data in production to determine if it has been affected and, if so, develop a strategy to correct any incorrect data.

Investigation

  1. Determine impact on production
    • Identify any SMS orders in the production database that may have had their status incorrectly updated.
    • Verify whether there are unintended status changes related to empty gateway references.
    • Assess whether the issue is still occurring or has been resolved by the recent bug fix (The pull request Do not update the SMS order based on the gateway reference. #707 must be deployed before taking this step).

References

Actions

  • Run SQL queries to detect inconsistencies in production.
  • Analyze log data to confirm correlation with deadlocks.
  • Propose a data correction plan if necessary.
  • Document findings and recommend preventive measures.
@Ahmed-Ghanam Ahmed-Ghanam self-assigned this Feb 12, 2025
@Ahmed-Ghanam Ahmed-Ghanam added kind/bug Something isn't working kind/analysis labels Feb 12, 2025
@Ahmed-Ghanam
Copy link
Contributor Author

Ahmed-Ghanam commented Feb 14, 2025

14.02.2025 - Use case to investigate
When a user places an order that includes either an organization number or national identity number for sending email or SMS notifications, the Notifications API calls the Register API to retrieve the recipient’s contact information. During processing, the Notifications API changes the order’s status from “New” to “Processing” and then issues one or more requests to the Register API. If the Register API is unavailable at that time, the system logs the exception but leaves the order in the “Sending” state—an invalid state that prevents proper follow-up and completion of the order.

Friday 14.02.2025
They are all test data.

@Ahmed-Ghanam
Copy link
Contributor Author

Ahmed-Ghanam commented Feb 14, 2025

Friday 14.02.2025

In the Altinn Notifications system, the client responsible for sending SMS messages communicates with LinkMobility for message delivery. If a successful response is not received from LinkMobility, the client distinguishes the failure mode: it returns a status of "Failed_InvalidRecipient" if the response begins with "Invalid RCV," and "Failed" for all other cases. In the last scenario, the sending service assigns an empty string as the gateway reference.

Simultaneously, the API that processes delivery reports records this empty string directly in the database. Consequently, whenever a delivery fails, the corresponding database record shows an empty gateway reference. Production data has identified a single instance of a failed delivery with this characteristic, suggesting that a script should be executed to update this record to a null value.

Moreover, the endpoint responsible for updating the delivery status for each SMS uses the empty string as the gateway reference and employs a query with an OR operator to determine which SMS records to update upon receiving a delivery report from LinkMobility. Due to the system’s operational schedule—where SMS messages are not sent between 17:00 and 09:00 Norwegian time—the cronjob queues SMS orders during this interval to be processed once the service resumes at 09:00. For example, if the cronjob generates ten SMS orders overnight, the Notifications API retrieves and sends them individually at 09:00 to the SMS service. Since these messages are test data, the SMS service responds quickly with a failure status and an empty gateway reference. As a result, the Notifications API attempts to update a large volume of records—more than 12,000 in AT22 and 4,338 in production—immediately after receiving the response. This expensive operation led to a deadlock because the update process did not complete before the API reattempted to update the same set of rows in a different order.

Notes

Solution

  • The single record with ID 35011 in the SMSNOTIFICATIONS table should be updated to use a null value instead of an empty string as the gateway reference.
  • The SMSNOTIFICATIONS table contains also 4,337 rows, each with an empty string as the gateway reference. We need to decide on the course of action for these records.

Suggestion

  • A job should be implemented to periodically delete test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/analysis kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant