Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] Internal AsEnumerable in EventDataBatch #20568

Closed
rzepinskip opened this issue Apr 21, 2021 · 10 comments · Fixed by #21881
Closed

[QUERY] Internal AsEnumerable in EventDataBatch #20568

rzepinskip opened this issue Apr 21, 2021 · 10 comments · Fixed by #21881
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@rzepinskip
Copy link

Query/Question
I am migrating the code using Microsoft.Azure.EventHubs to Azure.Messaging.EventHubs SDK and we use create and send EventDataBatch . Previously, this class had public ToEnumerable (docs) method and the current version of the library only has an internal AsEnumerable (source code). What is the reasoning behind this change?

ToEnumerable was useful for:

  1. Verifying items sent in tests - you can track items added to batch by EventHubsModelFactory.EventDataBatch in the new library but then you have to keep reference between an instance of EventDataBatch and its items to verify it in the Send call.
  2. Implement failover scenario (other than sending to another EH [1]) - in case of Send failure you could inspect items in a batch and log or forward them.

[1] Is calling a CreateBatchAsync() on one Event Hub and sending resulting object to another Event Hub even supported? I tested that it works (as long as these Event Hubs has the same maximum batch size (e.g. Standard tier)) but I am not sure whether this is intended.

Environment:

  • Name and version of the Library package used: Azure.Messaging.EventHubs 5.4.0
  • Hosting platform or OS and .NET runtime version (dotnet --info output for .NET Core projects): Azure Cloud Service, .Net Framework 4.7.2
  • IDE and version : Visual Studio 16.9.4
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Apr 21, 2021
@jsquire jsquire self-assigned this Apr 21, 2021
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Apr 21, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Apr 21, 2021
@jsquire
Copy link
Member

jsquire commented Apr 21, 2021

Hi @rzepinskip. Thank you for your feedback.

What is the reasoning behind this change?

When you add an event to the batch, it has to be measured to enforce the size limitations for publishing to the service. To ensure that the batch is reliable and those measurements remain accurate, we cannot allow changes to events after they've been accepted into the batch.

As a result, we make a defensive copy and intentionally do not allow access to the copied events to ensure that they cannot be mutated and potentially invalidate the batch. This also allows us to make certain optimizations to the batch, such as avoiding serializing the event multiple times and instead storing only the serialized form instead of coping the EventData model.

This also fixes a confusing behavior in the legacy library, in which the EventData that you read from the batch may not be the same data that is published due to allowing the event to be accessed and mutated but not reflecting any changes to the copy that the batch sends to the service.

Is calling a CreateBatchAsync() on one Event Hub and sending resulting object to another Event Hub even supported?

Yes, with the caveat that you mentioned. The batch itself is not associated with a specific Event Hub, but it queries the one associated with the producer that creates it in order to reflect the size limit. From the client's perspective, the size is controlled by the service and may change from Event Hub to Event Hub. Speaking practically, that's not how the service actually works today, as the size limit is controlled by the SKU.

@jsquire jsquire added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Apr 21, 2021
@rzepinskip
Copy link
Author

rzepinskip commented Apr 22, 2021

@jsquire Thank you for prompt response.

As a result, we make a defensive copy and intentionally do not allow access to the copied events to ensure that they cannot be mutated and potentially invalidate the batch.

Right, I just found the copy remark in the TryAdd docs. In code I see a Clone() in EventDataBatch.TryAdd() and AsEnumerable() used internally returns the list of copied EventDatas stored in AmqpEventBatch.

I think the ability to read and ability to change data are two different things and removing the former to guard against the latter comes as surprise to me. Would you be willing to bring read possibility back? I see two possible ways:

  1. Adding something like ToReadonlyEnumerable() with AsEnumerable<EventData>().Select(x => x.Clone()) implementation
  2. Introducing read-only version of EventData (read-only Properties dictionary etc.) and returning it in now public AsEnumerable<T>().

This also allows us to make certain optimizations to the batch, such as avoiding serializing the event multiple times and instead storing only the serialized form instead of coping the EventData model.

I am confused. In AmqpEventBatch I see List<EventData> being stored and the AmqpMessage created (is this the serialization you refer to?), used only for size calculation and discarded. The AmqpMessage creation also occurs on SendAsync() (code). This seems like serializing event multiple times (when the old SDK did it only once).

From the client's perspective, the size is controlled by the service and may change from Event Hub to Event Hub. Speaking practically, that's not how the service actually works today, as the size limit is controlled by the SKU.

So client is able to support more granular (Event Hub level size limit instead of SKU level) scenario but are there any plans to utilize it (and hence breaking the use case I mentioned)?

@jsquire
Copy link
Member

jsquire commented Apr 22, 2021

Adding something like ToReadonlyEnumerable() with AsEnumerable().Select(x => x.Clone()) implementation

The challenge with that approach is that EventData itself is mutable. The clone that is returned can be modified by callers, but the changes would not be reflected in what is published to the Event Hubs service. This was the behavior that the legacy library had, which we had considered when designing the new library. We found that it caused confusion in user studies and had also generated similar feedback for the legacy SDK.

It also caused confusion due to the lack of strong identity for a event within the Event Hubs ecosystem in general. Because there's no identity concept, developers would often use a reference comparison like batch.AsEnumerable().Any(evt => evt == myEvent) to check "is this event in the batch" which would return false even if that event had been added, due to the clone. Unfortunately, there's no way to override equality in a meaningful way because two instances with the same payload and properties may represent the same event or it may not.

Introducing read-only version of EventData (read-only Properties dictionary etc.) and returning it in now public AsEnumerable().

This helps with the first challenge, but is still likely to suffer from the second. This also has two additional considerations; it precludes the optimization discussed in the next response and introduces another type into the hierarchy and increasing overall complexity. It's possible, but would have to be accepted as a uniform change across languages and approved by the SDK architect board. From previous discussions, I see this as likely to meet a strong challenge that we probably don't have enough justification for at the moment.

I'll introduce this as a design-discussion item early next week and mention you on the issue. Any real-world scenarios that you or other community members can provide would be helpful to make the case, as would community members upvoting via the thumbs-up reaction.

I am confused. In AmqpEventBatch I see List being stored and the AmqpMessage created (is this the serialization you refer to?), used only for size calculation and discarded. (is this the serialization you refer to?)

Yes, that is what I'm referring to. What we're doing now is not what we'd like to be doing. We ran into problems with our transport library that forced us to hold onto the EventData model and pay the serialization cost twice. We'd rather not be doing that, and would prefer to be able to serialize once and then hold onto only the serialized version in the interest of better performance and reducing allocations. This is tracked by #9569, which also offers additional context.

So client is able to support more granular (Event Hub level size limit instead of SKU level) scenario but are there any plans to utilize it (and hence breaking the use case I mentioned)?

This isn't a decision made by the client. We react to the maximum allowed size that the service tells us each time we open an AMQP link. The service can send whatever value that it would like at that point and the client is expected to enforce it. Today the size is controlled by the SKU used for the namespace. I do not know of any plans for that to change, but I can't speak authoritatively to what the service may or may not do in the future.

@rzepinskip
Copy link
Author

The challenge with that approach is that EventData itself is mutable. The clone that is returned can be modified by callers, but the changes would not be reflected in what is published to the Event Hubs service. This was the behavior that the legacy library had, which we had considered when designing the new library. We found that it caused confusion in user studies and had also generated similar feedback for the legacy SDK.

I understand the "changes not reflected argument" but isn't this the same situation as in EventDataBatch.TryAdd? ToReadonlyEnumerable could have the same "frozen state" remark.

It also caused confusion due to the lack of strong identity for a event within the Event Hubs ecosystem in general. Because there's no identity concept, developers would often use a reference comparison like batch.AsEnumerable().Any(evt => evt == myEvent) to check "is this event in the batch" which would return false even if that event had been added, due to the clone. Unfortunately, there's no way to override equality in a meaningful way because two instances with the same payload and properties may represent the same event or it may not.

I didn't know identity was the issue. My initial understanding was: on receiving side you could rely on EventHub namespace + name + partition + sequence number combination but having it on producer requires custom implementation (e.g. Guid entry in Properties) which does not seem like a big deal.

I'll introduce this as a design-discussion item early next week and mention you on the issue. Any real-world scenarios that you or other community members can provide would be helpful to make the case, as would community members upvoting via the thumbs-up reaction.

Thank you very much!

This isn't a decision made by the client. We react to the maximum allowed size that the service tells us each time we open an AMQP link. The service can send whatever value that it would like at that point and the client is expected to enforce it. Today the size is controlled by the SKU used for the namespace. I do not know of any plans for that to change, but I can't speak authoritatively to what the service may or may not do in the future.

I understand. Just wanted to know whether our "send the batch to different Event Hub that created it" may stop in next versions.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Apr 23, 2021
@jsquire
Copy link
Member

jsquire commented Apr 23, 2021

I have to admit that after letting this spin in my head for a while, I'm more convinced that opening this up is likely to create more problems and confusion than it solves. The requested functionality can be added into an application in a fairly straightforward way, which has the added benefit of offering developers of the application more explicit awareness of potential unclear behavior.

The usage pattern that I'm thinking of would look something like:

    var connectionString = "<< SOME CONNECTION STRING >>";
    var hub = "<< SOME HUB >>";
    
    await using var producer = new EventHubProducerClient(connectionString, hub);
    
    // Create a batch and use it as the source for an application-specific wrapper that
    // tracks the source events as they're added.
    
    using var batch = new ObservableEventDataBatch(await producer.CreateBatchAsync());

    // Add some events to the batch.

    foreach (var body in new[] { "One", "Two", "Three" })
    {
        if (!batch.TryAdd(new EventData(new BinaryData(body))))
        {
            throw new Exception($"Could not add all events to the batch.  Failed at: { body }.");
        }
    }
    
    // Iterate through the events that were added.

    foreach (var eventData in batch.Events)
    {
        Debug.WriteLine($"Event Body: { eventData.EventBody.ToString() }");
    }

    Debug.WriteLine($"There are { batch.Count } events in the batch.");
    Debug.WriteLine($"The total size of the batch, in bytes, is { batch.SizeInBytes }");

    // Thanks to implicit conversion, the observable batch can be sent just like any 
    // EventDataBatch instance.
    
    await producer.SendAsync(batch);

Where ObservableEventDataBatch is a small wrapper around the EventDataBatch:

public class ObservableEventDataBatch : IDisposable
{
    private List<EventData> _events = new();
    private EventDataBatch _batch { get; }

    public IReadOnlyList<EventData> Events { get; }  
    
    public int Count => Batch.Count;
    public long SizeInBytes => Batch.SizeInBytes;
    public long MaximumSizeInBytes => _batch.MaximumSizeInBytes;
    
    public ObservableEventDataBatch(EventDataBatch sourceBatch)
    {
        _batch = sourceBatch ?? throw new ArgumentNullException(nameof(sourceBatch));
        Events = _events.AsReadOnly();
    }

    public bool TryAdd(EventData eventData)
    {
        if (Batch.TryAdd(eventData))
        {
            _events.Add(eventData);
            return true;
        }
        
        return false;
    }

    public void Dispose() => _batch .Dispose();        
    public static implicit operator EventDataBatch(ObservableEventDataBatch observable) => observable._batch ;
}

I'm still committed to doing a design-discussion write-up to gather more perspectives, but I wanted to be transparent in my thinking and suggest a work-around that could be used today.

@jsquire
Copy link
Member

jsquire commented Apr 23, 2021

I didn't know identity was the issue. My initial understanding was: on receiving side you could rely on EventHub namespace + name + partition + sequence number combination but having it on producer requires custom implementation (e.g. Guid entry in Properties) which does not seem like a big deal.

This subject of identity for an event comes up quite often. While your statement about the sequence number + entity properties can be used to identify the specific instance of an event in the eyes of the broker, that's not quite the same thing as understanding the application intent. To illustrate:

var json = await GetDataFromSomeServiceAsync();
var first = new EventData(new DataBody(json));
var second = new EventData(first.EventBody);

await producer.SendAsync(first);
await producer.SendAsync(second);

Those events have the same data and even share the same body instance. Does the application consider those "the same event?" The broker does not. Should the receiving application scrub duplicate data or should it accept the broker's view that "there really are two of these?"

What about:

var eventData = new EventData(new DataBody(await GetJsonFromServiceAsync()));

try
{
    await producer.SendAsync(eventData);
}
catch (TimeoutException)
{
    await producer.SendAsync(eventData);
}

The same scenario plays out when the publishing application crashes and needs to resume as well as others. In that same category, Event Hubs offers an "at least once" guarantee; a service failure can cause duplication. (it's very rare, but not impossible.)

The guidance for this seems straightforward, use an application property that assigns some unique identifier to your data that your application(s) recognize. However, we see this come up consistently as questions and feedback where folks are confused or don't wish to own responsibility for identity. When designing the API for the SDK, we try to do our best to take this into account and not introduce additional areas of confusion.

@jsquire jsquire added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Apr 23, 2021
@rzepinskip
Copy link
Author

rzepinskip commented Apr 23, 2021

The requested functionality can be added into an application in a fairly straightforward way, which has the added benefit of offering developers of the application more explicit awareness of potential unclear behavior.

We went along the same route few hours ago internally and stumbled upon one minor issue: you may get you objects out of sync if somebody does ObservableEventDataBatch.Batch.TryAdd(EventData) instead of ObservableEventDataBatch.TryAdd(EventData). In our case making ObservableEventDataBatch.Batch internal was good enough (as we expose wrapper around producer from our library so users deal with ObservableEventDataBatch and not EventDataBatch; just have to avoid this mistake internally in the library code). The other solution may be to make EventDataBatch private, store SendEventOptions and use Send(ObservableEventDataBatch.Events, SendEventOptions)?

The guidance for this seems straightforward, use an application property that assigns some unique identifier to your data that your application(s) recognize. However, we see this come up consistently as questions and feedback where folks are confused or don't wish to own responsibility for identity. When designing the API for the SDK, we try to do our best to take this into account and not introduce additional areas of confusion.

Do you plan to add some form of ID field into Event Hubs then or just provide guidance for custom implementation in the docs?

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Apr 23, 2021
@jsquire
Copy link
Member

jsquire commented Apr 26, 2021

We went along the same route few hours ago internally and stumbled upon one minor issue: you may get you objects out of sync if somebody does ObservableEventDataBatch.Batch.TryAdd(EventData) instead of ObservableEventDataBatch.TryAdd(EventData).

Agreed; there's no absolute way to prevent this that I can see. Even if we hide the member, you could still cast to EventDataBatch and see the behavior. It's another potential confusing behavior and another trade-off.

Do you plan to add some form of ID field into Event Hubs then or just provide guidance for custom implementation in the docs?

AMQP offers a MessageId annotation, which we'll be promoting to a first-class property on EventData as part of #20105. We could default the value and allow callers to update with their own concept (which we may do), but that's really not much better than a custom application property without the broker recognizing it and promoting it as the identity of an event within the ecoysystem.

@jsquire jsquire added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Apr 26, 2021
@ghost ghost removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Apr 26, 2021
@rzepinskip
Copy link
Author

Agreed; there's no absolute way to prevent this that I can see. Even if we hide the member, you could still cast to EventDataBatch and see the behavior.

Well, I personally would avoid casting and go with one of these (probably first one):

  1. Internal EventDataBatch member + Send extension method for EventHubProducerClient
  2. Not exposing EventDataBatch at all but this requires storing CreateBatchOptions and custom CreateBatchAsync.

(Omitted CancellationToken for brevity)
[1]

public class ObservableEventDataBatch : IDisposable
{
    private readonly List<EventData> _events = new();
    internal EventDataBatch Batch { get; }

    public IReadOnlyList<EventData> Events { get; }
    public int Count => Batch.Count;
    public long SizeInBytes => Batch.SizeInBytes;
    public long MaximumSizeInBytes => Batch.MaximumSizeInBytes;

    public ObservableEventDataBatch(EventDataBatch sourceBatch)
    {
        Batch = sourceBatch ?? throw new ArgumentNullException(nameof(sourceBatch));
        Events = _events.AsReadOnly();
    }

    public bool TryAdd(EventData eventData)
    {
        if (Batch.TryAdd(eventData))
        {
            _events.Add(eventData);
            return true;
        }

        return false;
    }

    public void Dispose() => Batch.Dispose();
}

public static class EventHubProducerClientExtensions
{
    public static Task SendAsync(this EventHubProducerClient producer, ObservableEventDataBatch observableBatch)
    {
        return producer.SendAsync(observableBatch.Batch);
    }
}

[2]

public class ObservableEventDataBatch : IDisposable
{
    private readonly List<EventData> _events = new();
    private readonly EventDataBatch _batch;

    public IReadOnlyList<EventData> Events { get; }
    public SendEventOptions Options { get; }
    public int Count => _batch.Count;
    public long SizeInBytes => _batch.SizeInBytes;
    public long MaximumSizeInBytes => _batch.MaximumSizeInBytes;

    internal ObservableEventDataBatch(EventDataBatch sourceBatch, CreateBatchOptions options = null)
    {
        _batch = sourceBatch ?? throw new ArgumentNullException(nameof(sourceBatch));
        Events = _events.AsReadOnly();
        Options = options; // We do not have access to EventDataBatch.SendOptions
    }

    public bool TryAdd(EventData eventData)
    {
        if (_batch.TryAdd(eventData))
        {
            _events.Add(eventData);
            return true;
        }

        return false;
    }

    public void Dispose() => _batch.Dispose();
}

public static class EventHubProducerClientExtensions
{
    public static async ValueTask<ObservableEventDataBatch> CreateObservableBatchAsync(this EventHubProducerClient producer, CreateBatchOptions options = null)
    {
        return new ObservableEventDataBatch(await producer.CreateBatchAsync(options), options);
    }

    public static Task SendAsync(this EventHubProducerClient producer, ObservableEventDataBatch observableBatch)
    {
        return producer.SendAsync(observableBatch.Events, observableBatch.Options);
    }
}

AMQP offers a MessageId annotation, which we'll be promoting to a first-class property on EventData as part of #20105.

Seems useful, I will follow it.

It was a nice discussion - thanks for detailed responses! Feel free to close the issue as I do not have anything to add anymore - EventDataBatch wrapper is the way forward for us.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Apr 27, 2021
@jsquire
Copy link
Member

jsquire commented Apr 30, 2021

After having some internal discussions, this is unlikely to receive enough support to be adopted officially into the client library and supported across languages. The best way forward for now will be to include this as a sample to illustrate the approach and capture additional feedback. I think this works out better than promoting to a pure design discussion.

I've opened #20779 and captured much of the discussion here to help drive that sample. Please feel free to use that issue for further discussion. I'm going to close this out, since we've got our work-around and next steps.

@jsquire jsquire closed this as completed Apr 30, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants