Skip to content

Commit

Permalink
Fix #1748: High db update load because of callback event circuit breaker
Browse files Browse the repository at this point in the history
  • Loading branch information
jnpsk committed Oct 18, 2024
1 parent 4c40901 commit b8c900a
Show file tree
Hide file tree
Showing 20 changed files with 106 additions and 116 deletions.
9 changes: 5 additions & 4 deletions docs/Configuration-Properties.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,9 @@ In certain scenarios, repeatedly attempting to dispatch callback events may be p
receiver's side. To address this, if multiple callback events with the same configuration fail consecutively, the
service temporarily halts further dispatch attempts and marks these events as failed without retrying. The number of
consecutive failures allowed before stopping dispatch is defined by the `failureThreshold` property, while the halt
period is configurable via the `resetTimeout` property. After this period, a callback dispatch attempt will be made again
to check the receiver's availability.
period is configurable via the `failureResetTimeout` property. After this period, a callback dispatch attempt will be
made again to check the receiver's availability. If the `failureThreshold` is set to `-1`, the functionality is not
enabled.

PowerAuth dispatches a callback as soon as a change in operation or activation status is detected. Each newly created
callback is passed to a configurable thread pool executor for dispatch. Even if the thread pool's queue is full, the
Expand All @@ -132,8 +133,8 @@ to callback events with max attempts set to 1, such callback events are never sc
| `powerauth.service.callbacks.threadPoolMaxSize` | `2` | Maximum number of threads in the thread pool used by the executor. |
| `powerauth.service.callbacks.threadPoolQueueCapacity` | `1000` | Queue capacity of the thread pool used by the executor. |
| `powerauth.service.callbacks.forceRerunPeriod` | | Time period after which a currently processed callback event is considered stale and should be scheduled to rerun. |
| `powerauth.service.callbacks.failureThreshold` | `200` | The number of consecutive failures allowed for callback events with the same configuration. |
| `powerauth.service.callbacks.resetTimeout` | `60s` | Time period after which a Callback URL Event will be dispatched, even if failure threshold has been reached. |
| `powerauth.service.callbacks.failureThreshold` | `200` | The number of consecutive failures allowed for callback events with the same configuration. If set to `-1`, unlimited number of failures is allowed. |
| `powerauth.service.callbacks.failureResetTimeout` | `60s` | Time period after which a Callback URL Event will be dispatched, even if failure threshold has been reached. |
| `powerauth.service.callbacks.clients.cache.refreshAfterWrite` | `5m` | Callback REST clients are cached and automatically evicted if updated through the Callback Management API on a single node. Time-based refreshing mechanism is a fallback in clustered environments. |

The backoff period after the `N-th` attempt is calculated as follows:
Expand Down
2 changes: 0 additions & 2 deletions docs/Database-Structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,6 @@ Stores callback URLs - per-application endpoints that are notified whenever an a
| max_attempts | INTEGER | - | Maximum number of attempts to dispatch a callback. |
| initial_backoff | VARCHAR(64) | - | Initial backoff period before the next send attempt, stored as a ISO 8601 string. |
| retention_period | VARCHAR(64) | - | Minimal duration for which is a completed callback event persisted, stored as a ISO 8601 string. |
| timestamp_last_failure | DATETIME | - | The timestamp of the most recent failed callback event associated with this configuration. |
| failure_count | INTEGER | DEFAULT 0 NOT NULL | The number of consecutive failed callback events associated with this configuration. |
| enabled | BOOLEAN | - | Indicator specifying whether the Callback URL should be used. |
| timestamp_created | DATETIME | DEFAULT NOW() NOT NULL | Timestamp when the record was created. |
| timestamp_last_updated | DATETIME | - | Timestamp of the last update of the record via the Callback Management API. |
Expand Down
7 changes: 0 additions & 7 deletions docs/PowerAuth-Server-1.9.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,6 @@ options for the retry strategy with an exponential backoff algorithm. Namely:

These settings at the individual callback level overrides the global default settings at the application level.

### Add Columns to Enable Callback Failures Monitoring

Following columns has been added to the `pa_application_callback` table to enable monitoring of callback dispatch
failures:
- `failure_count` to hold the number of consecutive failed callbacks of the same configuration, and
- `timestamp_last_failure` to store the timestamp of the most recent failed callback attempt.

### Add Column Indicating If a Callback Is Enabled

A new column `enabled` has been added to the `pa_application_callback` table to indicate whether a Callback URL is
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,32 +110,6 @@
<createSequence sequenceName="pa_app_callback_event_seq" startValue="1" incrementBy="50" cacheSize="20"/>
</changeSet>

<changeSet id="8" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
<preConditions onFail="MARK_RAN">
<not>
<columnExists tableName="pa_application_callback" columnName="timestamp_last_failure" />
</not>
</preConditions>
<comment>Add timestamp_last_failure column to pa_application_callback table.</comment>
<addColumn tableName="pa_application_callback">
<column name="timestamp_last_failure" type="timestamp(6)" />
</addColumn>
</changeSet>

<changeSet id="9" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
<preConditions onFail="MARK_RAN">
<not>
<columnExists tableName="pa_application_callback" columnName="failure_count" />
</not>
</preConditions>
<comment>Add failure_count column to pa_application_callback table.</comment>
<addColumn tableName="pa_application_callback">
<column name="failure_count" type="integer" defaultValueNumeric="0">
<constraints nullable="false" />
</column>
</addColumn>
</changeSet>

<changeSet id="10" logicalFilePath="powerauth-java-server/1.9.x/20240704-callback-event-table.xml" author="Jan Pesek">
<preConditions onFail="MARK_RAN">
<not>
Expand Down
Binary file modified docs/images/arch_db_structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 0 additions & 10 deletions docs/sql/mssql/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,6 @@ GO
CREATE SEQUENCE pa_app_callback_event_seq START WITH 1 INCREMENT BY 50;
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure datetime2(6);
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count int CONSTRAINT DF_pa_application_callback_failure_count DEFAULT 0 NOT NULL;
GO

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled bit CONSTRAINT DF_pa_application_callback_enabled DEFAULT 1 NOT NULL;
Expand Down
8 changes: 0 additions & 8 deletions docs/sql/oracle/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,6 @@ CREATE INDEX pa_app_cb_event_ts_del_idx ON pa_application_callback_event(timesta
-- Create a new sequence pa_app_callback_event_seq
CREATE SEQUENCE pa_app_callback_event_seq START WITH 1 INCREMENT BY 50 CACHE 20;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure TIMESTAMP(6);

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count INTEGER DEFAULT 0 NOT NULL;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled BOOLEAN DEFAULT 1 NOT NULL;
Expand Down
8 changes: 0 additions & 8 deletions docs/sql/postgresql/migration_1.8.0_1.9.0.sql
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,6 @@ CREATE INDEX pa_app_cb_event_ts_del_idx ON pa_application_callback_event(timesta
-- Create a new sequence pa_app_callback_event_seq
CREATE SEQUENCE IF NOT EXISTS pa_app_callback_event_seq START WITH 1 INCREMENT BY 50 CACHE 20;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::8::Jan Pesek
-- Add timestamp_last_failure column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD timestamp_last_failure TIMESTAMP(6) WITHOUT TIME ZONE;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::9::Jan Pesek
-- Add failure_count column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD failure_count INTEGER DEFAULT 0 NOT NULL;

-- Changeset powerauth-java-server/1.9.x/20240704-callback-event-table.xml::10::Jan Pesek
-- Add enabled column to pa_application_callback table.
ALTER TABLE pa_application_callback ADD enabled BOOLEAN DEFAULT TRUE NOT NULL;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@

package io.getlime.security.powerauth.app.server.configuration;

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import com.github.benmanes.caffeine.cache.LoadingCache;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEntity;
import io.getlime.security.powerauth.app.server.service.callbacks.CallbackUrlRestClientCacheLoader;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CachedRestClient;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
Expand Down Expand Up @@ -56,4 +58,16 @@ public LoadingCache<String, CachedRestClient> callbackUrlRestClientCache(
.build(cacheLoader);
}

/**
* Configuration of the cache for gathering failure statistics during callback processing.
* {@link CallbackUrlEntity#getId()} is used as a cache key.
*
* @return Cache for FailureStats.
*/
@Bean
public Cache<String, FailureStats> callbackFailureStatsCache() {
return Caffeine.newBuilder()
.build();
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ public class PowerAuthCallbacksConfiguration {
* Number of allowed Callback URL Events failures in a row. When the threshold is reached no other
* events with the same Callback URL configuration will be posted.
*/
private Integer failureThreshold = 200;
private int failureThreshold = 200;

/**
* Period after which a Callback URL Event will be dispatched even though failure threshold is reached.
*/
private Duration resetTimeout = Duration.ofSeconds(60);
private Duration failureResetTimeout = Duration.ofSeconds(60);

}
Original file line number Diff line number Diff line change
Expand Up @@ -121,18 +121,6 @@ public class CallbackUrlEntity implements Serializable {
@Convert(converter = DurationConverter.class)
private Duration retentionPeriod;

/**
* Timestamp of last callback failure.
*/
@Column(name = "timestamp_last_failure")
private LocalDateTime timestampLastFailure;

/**
* Number of failed callbacks in a row.
*/
@Column(name = "failure_count", nullable = false)
private Integer failureCount;

/**
* Whether the callback is enabled and can be used.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,22 +39,6 @@ public interface CallbackUrlRepository extends CrudRepository<CallbackUrlEntity,

List<CallbackUrlEntity> findByApplicationIdAndTypeOrderByName(String applicationId, CallbackUrlType type);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
SET c.failureCount = c.failureCount + 1, c.timestampLastFailure = :timestampLastFailure
WHERE c.id = :id
""")
void incrementFailureCount(String id, LocalDateTime timestampLastFailure);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
SET c.failureCount = 0, c.timestampLastFailure = NULL
WHERE c.id = :id
""")
void resetFailureCount(String id);

@Modifying
@Query("""
UPDATE CallbackUrlEntity c
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ public CreateCallbackUrlResponse createCallbackUrl(CreateCallbackUrlRequest requ
entity.setType(CallbackUrlTypeConverter.convert(request.getType()));
entity.setCallbackUrl(request.getCallbackUrl());
entity.setAttributes(request.getAttributes());
entity.setFailureCount(0);
final EncryptableString encrypted = callbackUrlAuthenticationEncryptor.encrypt(request.getAuthentication(), entity.getApplication().getId());
entity.setAuthentication(encrypted.encryptedData());
entity.setEncryptionMode(encrypted.encryptionMode());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,14 @@

package io.getlime.security.powerauth.app.server.service.callbacks;

import com.github.benmanes.caffeine.cache.Cache;
import io.getlime.security.powerauth.app.server.configuration.PowerAuthCallbacksConfiguration;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEntity;
import io.getlime.security.powerauth.app.server.database.model.entity.CallbackUrlEventEntity;
import io.getlime.security.powerauth.app.server.database.model.enumeration.CallbackUrlEventStatus;
import io.getlime.security.powerauth.app.server.database.repository.CallbackUrlEventRepository;
import io.getlime.security.powerauth.app.server.database.repository.CallbackUrlRepository;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlEvent;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import lombok.AllArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
Expand All @@ -47,8 +48,8 @@
public class CallbackUrlEventResponseHandler {

private final CallbackUrlEventRepository callbackUrlEventRepository;
private final CallbackUrlRepository callbackUrlRepository;
private final PowerAuthCallbacksConfiguration powerAuthCallbacksConfiguration;
private final Cache<String, FailureStats> callbackFailureStatsCache;

/**
* Handle successful Callback URL Event attempt.
Expand All @@ -68,7 +69,7 @@ public void handleSuccess(final CallbackUrlEvent callbackUrlEvent) {
callbackUrlEventEntity.setAttempts(callbackUrlEventEntity.getAttempts() + 1);
callbackUrlEventEntity.setStatus(CallbackUrlEventStatus.COMPLETED);
callbackUrlEventRepository.save(callbackUrlEventEntity);
callbackUrlRepository.resetFailureCount(callbackUrlEventEntity.getCallbackUrlEntity().getId());
resetFailureCount(callbackUrlEventEntity.getCallbackUrlEntity().getId());
}

/**
Expand Down Expand Up @@ -104,7 +105,7 @@ public void handleFailure(final CallbackUrlEvent callbackUrlEvent, final Throwab
}

callbackUrlEventRepository.save(callbackUrlEventEntity);
callbackUrlRepository.incrementFailureCount(callbackUrlEntity.getId(), LocalDateTime.now());
incrementFailureCount(callbackUrlEntity.getId());
}

/**
Expand All @@ -125,4 +126,27 @@ private static Duration calculateExponentialBackoffPeriod(final int attempts, fi
return Duration.ofMillis(Math.min(backoffMillis, maxBackoff.toMillis()));
}

private void incrementFailureCount(final String callbackUrlId) {
final int failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
if (failureThreshold == -1) {
logger.debug("Failure stats are turned off for Callback URL processing");
return;
}

callbackFailureStatsCache.asMap().compute(callbackUrlId, (key, cachedFailureStats) -> {
if (cachedFailureStats == null) {
return new FailureStats(1, LocalDateTime.now());
} else {
return new FailureStats(cachedFailureStats.failureCount() + 1, LocalDateTime.now());
}
});

}

private void resetFailureCount(final String callbackUrlId) {
callbackFailureStatsCache.asMap().computeIfPresent(callbackUrlId,
(key, cachedFailureStats) -> new FailureStats(0, cachedFailureStats.timestampLastFailure())
);
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

package io.getlime.security.powerauth.app.server.service.callbacks;

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.LoadingCache;
import com.wultra.core.rest.client.base.RestClient;
import com.wultra.core.rest.client.base.RestClientException;
Expand All @@ -30,6 +31,7 @@
import io.getlime.security.powerauth.app.server.service.callbacks.model.CachedRestClient;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlConvertor;
import io.getlime.security.powerauth.app.server.service.callbacks.model.CallbackUrlEvent;
import io.getlime.security.powerauth.app.server.service.callbacks.model.FailureStats;
import io.getlime.security.powerauth.app.server.service.util.TransactionUtils;
import jakarta.annotation.PostConstruct;
import lombok.AllArgsConstructor;
Expand Down Expand Up @@ -63,6 +65,7 @@ public class CallbackUrlEventService {
private final CallbackUrlEventRepository callbackUrlEventRepository;
private final CallbackUrlEventResponseHandler callbackUrlEventResponseHandler;
private final LoadingCache<String, CachedRestClient> restClientCache;
private final Cache<String, FailureStats> callbackFailureStatsCache;

private final PowerAuthServiceConfiguration powerAuthServiceConfiguration;
private final PowerAuthCallbacksConfiguration powerAuthCallbacksConfiguration;
Expand Down Expand Up @@ -175,13 +178,21 @@ public int obtainMaxAttempts(final CallbackUrlEntity callbackUrlEntity) {
* @return True if the callback should be processed, false otherwise.
*/
public boolean failureThresholdReached(final CallbackUrlEntity callbackUrlEntity) {
final Integer failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
final Duration resetTimeout = powerAuthCallbacksConfiguration.getResetTimeout();
final String callbackUrlId = callbackUrlEntity.getId();
final FailureStats failureStats = callbackFailureStatsCache.getIfPresent(callbackUrlId);
if (failureStats == null) {
logger.debug("No failure stats available yet for Callback URL processing: id={}", callbackUrlId);
return false;
}

final int failureThreshold = powerAuthCallbacksConfiguration.getFailureThreshold();
final Duration resetTimeout = powerAuthCallbacksConfiguration.getFailureResetTimeout();

final Integer failureCount = callbackUrlEntity.getFailureCount();
final LocalDateTime timestampLastFailure = Objects.requireNonNullElse(callbackUrlEntity.getTimestampLastFailure(), LocalDateTime.MAX);
final int failureCount = failureStats.failureCount();
final LocalDateTime timestampLastFailure = failureStats.timestampLastFailure();

if (failureCount >= failureThreshold && LocalDateTime.now().minus(resetTimeout).isAfter(timestampLastFailure)) {
logger.debug("Callback URL reached failure threshold, but before specified reset timeout period, id={}", callbackUrlId);
return false;
}

Expand Down
Loading

0 comments on commit b8c900a

Please sign in to comment.