Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delivery policy and TriggerEngine revision proposal #554

Closed
Annopaolo opened this issue Mar 18, 2021 · 5 comments
Closed

Delivery policy and TriggerEngine revision proposal #554

Annopaolo opened this issue Mar 18, 2021 · 5 comments
Labels
app:trigger_engine This issue or pull request is about astarte_trigger_engine application discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated) enhancement New feature or request

Comments

@Annopaolo
Copy link
Collaborator

Motivation

AMQP triggers and HTTP triggers are treated differently: an AMQP trigger generates a message which is sent to an exchange, and the handling of delivery errors is up to the third-party client. Instead, a HTTP trigger generates a message and must also handle the delivery of that message.
Therefore, AMQP and HTTP triggers should be two separate things, and HTTP triggers should be provided with a policy to address possible errors.

Definitions

  • Routing rule: the old AMQP trigger. Mechanism for the delivery of an AMQP message to an exchange upon the occurrence of specified conditions .
  • Trigger: the old HTTP trigger. Mechanism for the generation of actions upon the occurrence of specified conditions. An action delivers a HTTP message to a recipient.
  • (Delivery) Policy: structured description of how data from triggers are delivered to the recipient.

Description

A policy is an attribute of a trigger. The main focus of the policy is error handling upon delivery of messages. The payload is described by the action of the trigger. Each policy has a unique name.
Due to constraint on the dimensionality of data, it is not possible for a policy in the general case to assure that a message will reach its destination (e.g. messages are to be sent to a non-existing resource). However, if the destination is available, policy guarantee the reception of the message - unless explicitly stated otherwise.

The main attributes of a policy are the error_handling to adopt if a message does not reach destination and the retry_queues to use if messages are to be resent in case of failed delivery.

Main components

  • Error handling is done by a list of handlers. Each handler refers to one or more delivery errors and describes the retreat strategy Astarte should take when they occur.
    It is possible to select a range of error numbers (e.g. HTTP client errors: [400 ... 499]) and explicitly exclude some in the range.
    There are two possible retreat strategies: none (ignore the error) or retry. In case of retrying, an associated retry queue must be added. The default retreat strategy is none on all errors.

  • The retrying queue is a space containing messages whose delivery has failed and must be resent.
    For memory reason, the size of the retrying queue must be fixed and therefore is not optional. This gives an upper bound on the amount of space a retry queue uses.
    Every message in a retry queue can be resent up to a specified number of times.
    If the number of messages in the retry exceeds the size, the maximum capacity policy specifies what messages to delete. The maximum capacity policy is one between fifo (default) and lifo.
    The order in the retrying queue can be based on the timestamp of the first or the last delivery failure.
    it is possible to have more than one handler related to the same queue. In this case, all messages in the queue have the same number of retry times, as specified by the queue.

Syntax

Policy:

{
    "name" : <string>,
    "error_handlers" : [<handler>],
    "retry_queues" : [<string>]
}

If error_handlers if empty, then messages failing to reach destination are lost.

Handler:

{
    "error_type" : [<int>] | '[' <int> '...' <int> ']' | "*",
    "excluding" : [<int>],
    "retreat_strategy" : "none" | "retry",
    "retry_queue": <string>
}

The error_type list or range must contain only valid HTTP errors. Every error not included in an handler range defaults to none repeat strategy. "*" is a shortcut for all errors.
excluding is optional.
If retreat_strategy is not retry then retry_queue must not be set.

Retry queue:

{
    "name" : <string>,
    "retry_times" : <int>,
    "order" : "first_failure" | "last_failure",
    "maximum_capacity" : <int>,
    "mc_policy" : "fifo" | "lifo"
}

Each message in the queue will be resent up to retry_times, unless the maximum_capacity is exceeded.
deletion_order describes which timestamp to use for ordering the queue, either the one of the first_failure or the one of the last_failure.
maximum_capacity sets the size of the retrying queue, while mc_policy describes the policy for deletion of exceeding messages.

Examples

The simplest policy:

{
    "name" : "simple_policy",
    "error_handlers" : [],
    "retry_queues" : []
}

This policy ensures no guarantee of good delivery for any HTTP error.

A more complex policy:

{
    "name" : "complex_policy",
    "error_handlers" : [
            {
                "error_type" : "*",
                "excluding" : [402, 403, 404],
                "retreat_strategy" : "none"
            },
            {
                "error_type" : [404],
                "retreat_strategy" : "retry",
                "retry_queue": "queue_1"
            },
            {
                "error_type" : [402, 403],
                "retreat_strategy" : "retry",
                "retry_queue": "queue_2"
            }
        ],
    "retry_queues" : [
            {
                "name" : "queue1",
                "retry_times" : 5,
                "order" : "first_failure",
                "maximum_capacity" : 500,
                "mc_policy" : "fifo"
            },
            {
                "name" : "queue2",
                "retry_times" : 1,
                "order" : "last_failure",
                "maximum_capacity" : 100,
                "mc_policy" : "lifo"
            }
        ]
}

If there occurs an HTTP error outside the range [402, 403, 404], then Astarte will do nothing.
If an HTTP 404 error occurs, then Astarte will try to resend the message up to 5 times, using a retry queue holding up to 500 messages, ordered by timestamp of the first failure;
if more messages are present in the queue, the oldest ones will be deleted, even if they were resent less than 5 times.
If an HTTP 402 or 403 error occurs, then Astarte will try to resend the message up 1 time, using a retry queue holding up to 100 messages, ordered by timestamp of the last failure;
if more messages are present in the queue, the newest ones will be deleted, even if they were resent less than 1 time.

@rbino rbino added app:trigger_engine This issue or pull request is about astarte_trigger_engine application discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated) enhancement New feature or request labels Mar 22, 2021
@Annopaolo
Copy link
Collaborator Author

Annopaolo commented Mar 23, 2021

A possible implementation may revise the current delivery of messages from Data Updater Plant to Trigger Engine.
At the moment, all HTTP data are published by Data Updater Plant to the astarte_events exchange in an unique trigger_engine queue. Trigger Engine then synchronously processes each message in order and acks the broker.
A revised implementation might allow for more than one AMQP queue (possibly user-defined), and the possibility of nacks to track failed delivery upon HTTP errors.
This means that retry_queues are to be mapped to AMQP queues, therefore some proposed attributes (e.g. "order" and "mc_policy") might not be implemented. In general, retry_queue attributes would be more similar to RabbitMQ queues'.
Some combination of policy and queues (1-to-1, n-to-m) are to be discussed.

Revised Syntax (up-to the type of policy/queue combination)

Policy: unchanged.
Handler:

    "error_type" : [<int>] | '[' <int> '...' <int> ']' | "*",
    "excluding" : [<int>],
    "retreat_strategy" : "none" | "retry",
    "message_ttl" : <int>

message_ttl is optional and sets an expiration time in milliseconds for each message.

Queue:

    "name" : <string>,
    "retry_times" : <int>,
    "maximum_capacity" : <int>,
    "mc_policy" : "remove_old" | "reject_publish",
    "increase_priority_on_failure" : "true" | "false",
    "max_priority_level" : <1 ... 10>

A RabbitMQ queue either discards failed messages or re-enqueues them near their original position (see here).
The default behavior if the queue capacity is exceeded is to delete oldest messages, but it may be set to reject new messages until publication is again possible (see here). This might not be a desiderable feature, as newer messages may be lost.
There is no built-in tracking of retry_times, so this will probably be implemented adding a retry_time field to the message.
It could be useful to increase the priority of failed messages in order to resend them asap; in that case, a maximum priority level must be set (see here). Note, however, that this may alter the order of messages.

Possible combinations of policies and queues

Assuming 1,000 messages with 10 KB of payload, a RabbitMQ queue with Astarte's internal messages could use up to 10 MB of memory (see here). In a production environment, we may assume 30-50 installed Astarte interfaces.

  • 1-to-1: a policy, a queue
    • Pro: can easily estimate an upper bound on memory usage, order of messages is preserved;
    • Con: each retried message has the same number of retry_times ad the same message_ttl.
  • 1-to-many: a policy, many queues
    • Pro: the client can customize retry_times and message_ttl for any kind of error;
    • Con: client message reception could not be in order, there is no bound on memory usage.
  • many-to-many: policies and queues are not strictly related
    • Pro: Customizable as the 1-to-many case, there could be an hardcoded upper bound on the number of queues;
    • Con: Need an installation procedure (and overhead) for queues, client message reception could not be in order.

@Annopaolo
Copy link
Collaborator Author

Annopaolo commented Mar 23, 2021

Other interesting points:

  • it should be possible to install a policy in the same way an interface is installed, so that interfaces can point to policies without declaring them at the same moment;
  • "increase_priority_on_failure" and "max_priority_level" have no interesting use cases;
  • "reject_publish" policy is useless;
  • Probably Astarte Cloud can lead to using more than ~50 interfaces;
  • A simple implementation of queues without using RabbitMQ may be considered.

@Annopaolo
Copy link
Collaborator Author

Annopaolo commented Mar 26, 2021

Revised proposal

Definitions

  • Routing rule: the old AMQP trigger. Mechanism for the delivery of an AMQP message to an exchange upon the occurrence of specified conditions.
  • Trigger: the old HTTP trigger. Mechanism for the generation of actions upon the occurrence of specified conditions. An action delivers a HTTP event to a recipient.
  • Events: messages sent by a trigger.
  • (Delivery) Policy: structured description of how data from triggers are delivered to the recipient.
  • (Event) queue: the queue of events to be delivered using a policy.

Description

The main focus of the policy is error handling upon delivery of events. The payload is described by the action of the trigger. Each policy has a unique name and must be installed like and interface.
A trigger must specify an installed policy in the required policy field.
Due to constraint on the dimension of data, it is not possible for a policy in the general case to assure that an event will reach its destination (e.g. events are to be sent to a non-existing resource).

A policy describes what to do in case of delivery errors and how undelivered events are enqueued.
A policy is mapped to one queue, and queues are not shared among policies.

Main components

  • Error handling is done by a list of handlers. Each handler refers to groups of delivery errors and describes the strategy Astarte should take when they occur.
    It is possible to select the constants for client errors, server errors, all errors, or explicitly write an array of error codes.
    There are two possible retreat strategies: discard the event or retry. In case of retrying, events will be enqueued. The default retreat strategy is discarding on all errors.

  • The event queue is a space containing events to be delivered.
    Due to memory bounds, the size of the events queue must be fixed and therefore is not optional. This gives an upper bound on the amount of space.
    Every event in the event queue can be resent up to a specified number of times. Different retry times could mess the order of events and therefore are not allowed.
    If the number of messages in the retry exceeds the size, the maximum capacity policy specifies what messages to delete.
    It is possible to specify an event TTL to further lower the space requirement of the queue.

Syntax

Policy:

{
    "name" : <string>,
    "error_handlers" : [<handler>],
    "retry_times" : <int>,
    "maximum_capacity" : <int>,
    "event_ttl" : <int>
}

If error_handlers if empty, then events failing to reach destination are lost.
retry_times is optional, but required if an handler specifies the retry strategy.
Each event in the queue will be resent at least retry_times, unless the maximum_capacity is exceeded. In that case, older events will be deleted from the queue until the number of enqueued events is below maximum_capacity.evet_ttl is optional and may discard an event even if the former condition are not matched.

Handler:

{
    "on" : "client_error" | "server_error" | "any_error" | [<int>],
    "strategy" : "discard" | "retry"
}

The on field refers to the kind of HTTP errors that will be handled: client_error(400-499), server_error (500-599), any_error (400-599), or a custom array of errors. Different handlers must not have overlapping on fields.

Examples

A simple policy:

{
    "name" : "simple_policy",
    "error_handlers" : [],
    "maximum_capacity" : 100
}

This policy ensures no guarantee of good delivery for any HTTP error. At most 100 events can be in the queue at any time.

A more complex policy:

{
    "name" : "complex_policy",
    "error_handlers" : [
            {
                "error_type" : "server_error",
                "retreat_strategy" : "none"
            },
            {
                "error_type" : "client_error"
                "retreat_strategy" : "retry"
            }
        ],
    "retry_times" : 5,
    "maximum_capacity" : 100,
    "event_ttl" : 10
}

If there occurs an HTTP server error, then Astarte will do nothing. At most 100 events can be in the queue at any time.
If an HTTP client error occurs, then Astarte will try to resend the event up to 5 times.
If more than 100 events are present in the queue, the oldest ones will be deleted (even if they were resent less than 5 times in the case of HTTP client errors). If any event lasts for longer than 10 second in the queue, it will be discarded.

Implementation details

The implementation will revise the current delivery of messages from Data Updater Plant to Trigger Engine.
At the moment, all HTTP data are published by Data Updater Plant to the astarte_events exchange in an unique trigger_engine queue. Trigger Engine then synchronously processes each message in order and acks the broker.
The implementation will map each event queue to an AMQP queue on the astarte_events exchange. A good suggestion could be to allow the possibility of nacks to track failed delivery upon HTTP errors.
Attributes of event queues can be mapped to RabbitMQ queue's, with the exception of retry_times that will be implemented over RabbitMQ.
RabbitMQ's lazy queues might be considered to lower RAM requirements by offloading events to disk.

Here the JSON-schema.

@Annopaolo
Copy link
Collaborator Author

See also #117

@Annopaolo
Copy link
Collaborator Author

Closed by #583.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app:trigger_engine This issue or pull request is about astarte_trigger_engine application discussion This issue needs to be investigated/discussed (it might be already fixed, invalid or duplicated) enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants