Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DurableTask.AzureStorage API to enumerate instances #187

Merged
merged 11 commits into from
Jun 5, 2018
Merged

DurableTask.AzureStorage API to enumerate instances #187

merged 11 commits into from
Jun 5, 2018

Conversation

TsuyoshiUshio
Copy link
Collaborator

Hi @cgillum and @gled4er,

I open this pull request to start discussion the details.

I have two topic to discuss.

  1. How many attribute do we need?

In the OrchestrationState has a lot of members. Which attribute do we need for this purpose?

  1. Style of Query

We'll full scan search for the instance table. Is it OK? or do we have any other clever way to fetch the on the fly instances? Currenty, I just return all record of the isntance table.

This pull request is correspond to this issue.
Azure/azure-functions-durable-extension#316

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! I have some initial feedback.

To answer your second question, I think a full scan is okay. Even if we try to filter to just active instances, the table storage performance will be the same since Azure Tables do not support secondary indexes. If you want to do filtering to save CPU on the customer's worker VM, I suggest adding another parameter which allows for flexible filtering in the table storage query.

foreach(DynamicTableEntity entity in instanceTableEntities)
{
var instance = new OrchestrationInstance();
instance.InstanceId = entity.Properties["PartitionKey"].ToString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the previous GetStateAsync method above. Since it already has logic to convert table storage rows into OrchestrationState objects, I think it would be best to reuse that code as much as possible. That way if we need to change the logic for reading from table storage, we can do it in one place instead of two places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I’ll fix that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @cgillum ,

Will it make sense to add additional partition in the Instance table with partition key "OnTheFly" (and row key the instance id) that will contain all the instances currently on the fly. Each of the instance will have also dedicated row with its instance id as partition key as well. We can have a setting that will enable the duplicate update of the instance in the "OnTheFly" partition and its own partition. When the instance completes we will delete the row in the "OnTheFly" partition. This will be one more call but we will know both the partition and the row key.

Then the query for retrieving the on the fly instances will be straightforward. I am worried that the Instnaces table will grow over time and a query that will make table scan each time will have bad performance impact.

Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! lets discuss that. I also worry about the performance since storage table only have a key for partition key and the instance table might be growing. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gled4er , One question. OrchestrationInstanceStatus might be a domain class of instance table right? Then which attribute is the instanceId? I guess PartitionKey. Is it right?

@gled4er
Copy link
Contributor

gled4er commented May 23, 2018

Hi, let's discuss today in person.

@cgillum
Copy link
Member

cgillum commented May 23, 2018

I'm a little concerned about creating a new partition for "on the fly" for several reasons. I'm not sure if I know how to communicate it clearly. The high-level issue is that it pollutes the data in the instances table. It also creates complexity in the data model (and I prefer to avoid complexity whenever possible). In general, data in the partition column should be as balanced as possible, and "on-the-fly" partitions will create a lot of imbalance and will eventually result in different kinds of performance issues. The design is also specific to a particular use case and may not add value to other use cases.

My thought is that we should keep this feature simple and not change the structure of the data. Yes there will be performance issues with large numbers of rows - however, i think this is unavoidable when using Azure Table storage, so it's better if we just document it. For many users, the simple implementation will still be very valuable. If there is a need for huge numbers of instances, then we should recommend the Event Grid approach to put the data into a different, optimized data store based on the customer's specific needs (maybe they will use Cosmos DB, or something else).

@gled4er
Copy link
Contributor

gled4er commented May 23, 2018

Hello @cgillum,

Thank you very much for the explanation.

I agree.

Thank you!

@TsuyoshiUshio
Copy link
Collaborator Author

@cgillum , @gled4er I have one suggestion. How about adding some configuration to remove the old instances. e.g. 1day or 7 days or something on the host.json. Then we can remove the old record which is already done. Then the number of the record might be limited. Also you don't need to add new column. Just asynchronously remove the record which is not used.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum One more thing. Could you enable CI for this pull request? I refactor the code to keep DRY. I just did extract method. however, it fails the test. When I see the test code, it seems related timeout. Which seems flakiness. I'd like to see what happens.

@cgillum
Copy link
Member

cgillum commented May 24, 2018

This repo unfortunately does not have a CI enabled, so we have to rely on local testing. :( Hopefully this can be fixed in the near future - but I'll need to talk to the repo owners about it.

Regarding the suggestion for removing old instances, let's consider it as a separate feature. We have a GitHub issue tracking something like this: Azure/azure-functions-durable-extension#17

@TsuyoshiUshio
Copy link
Collaborator Author

@cgillum Make sense! Sure.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum and @gled4er ,

I have a question for testing. I add the query method for AzureTableTrackingStore.cs and AzureStorageOrchestrationService, and TaskHubClient and IOrchestrationServiceClient. Except for the first one, all of others are just delegation.

I might need to write test on the AzureStorageScenarioTests.cs with TestOrchestraionClient.
Since this method fetch all instances from instance table, the test will be like this.

  1. Clear the instance table
  2. Insert data ( or start Orchestrator)
  3. Fetch the all data and compare the collect number and contents.

However the tests are executed parallely, If I clear the instance table, it might cause a problem. In this case how do you test the method? Do you have similar example on your test?

@cgillum
Copy link
Member

cgillum commented May 27, 2018

I don't think the tests are executed in parallel because all of them clear the task hub in the very beginning. Is that not what you're observing?

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I like how it is simple and easy to understand. One concern I have is about changing DurableTask.Core. I recommend we try not to change it yet and try to test the client-side by calling AzureStorageOrchestrationService.GetStateAsync directly. That way we don't need to convince the core DTFx team about adding new public APIs that need to be implemented by all providers.


// TODO do we need these?
this.stats.StorageRequests.Increment();
this.stats.TableEntitiesRead.Increment(instancestatuses.Count - previousCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the TODO: Yes, we need these. We track these statistics to know how much load we're putting on the Azure Storage account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

var query = new TableQuery<OrchestrationInstanceStatus>();
TableContinuationToken token = null;

var instancestatuses = new List<OrchestrationInstanceStatus>(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalize the "S" - instanceStatuses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. However, it eventually removed. lol

/// Get states of the all orchestration instances
/// </summary>
/// <returns>List of <see cref="OrchestrationState"/></returns>
Task<IList<OrchestrationState>> GetStateAsync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be very careful about making changes to DurableTask.Core. If we need any changes in this area, then we need to core DTFx team to approve them - and that may not be easy since this design might not work well for other implementations, like DurableTask.ServiceBus. Should we try to implement this first only in DurableTask.AzureStorage to avoid any potential conflict?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand. I remove this commit from my branch. I add it because of OrchestratorClient uses TaskHub. I start with create a testing. :)

var instancestatuses = new List<OrchestrationInstanceStatus>(100);
var stopwatch = new Stopwatch();
var requestCount = 0;
bool finishedEarly = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like stopwatch, requestCount, and finishedEarly aren't really used for anything. Can they be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

/// Get states of the all orchestration instances
/// </summary>
/// <returns>List of <see cref="OrchestrationState"/></returns>
public async Task<IList<OrchestrationState>> GetStateAsync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other methods similar to this one are named GetOrchestrationStateAsync. I think we should name this one the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
foreach(OrchestrationInstanceStatus entity in instancestatuses)
{
var state = await ConvertFromAsync(entity, entity.PartitionKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling ConvertFromAsync inside a second foreach loop, would it be more efficient to do it inside the previous foreach loop (so that we only loop through the list of instances once instead of twice)?

Copy link
Collaborator Author

@TsuyoshiUshio TsuyoshiUshio May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

One concern is, I use Linq for the refactoring. I'm not sure if you prefer the Linq. I do like this. for make it single loop.

var tasks = segment.AsEnumerable<OrchestrationInstanceStatus>().Select(async x => await ConvertFromAsync(x, x.PartitionKey));
OrchestrationState[] result = await Task.WhenAll(tasks);
orchestrationStates.AddRange(result);

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum

For enable the testing, I add one method to the TestOrchestrationClient.cs. Since you said that to add Core we should be careful, however, the TestOrchestrationClient is depend on TaskHub and
I needed to add this method. It is not cool way, however, I can't find any other good solution.

internal AzureStorageOrchestrationService GetServiceClient()
{
       return (AzureStorageOrchestrationService)this.client.serviceClient;
}

ba1afae

Any feedback is welcome.

@TsuyoshiUshio
Copy link
Collaborator Author

@cgillum Then what is the next step? I have two options.

  1. Implement OrchestratorClient on the DurableTaskFramework extension repo
  2. Implement TaskHub and other Core methods.

Maybe 1. should be fine. Then How I can test it? Create a nuget package of this branch then put it on local, then implement OrchestratorClient and test it using AzureStorageOrchestrationService directly then move on 2?

@TsuyoshiUshio
Copy link
Collaborator Author

One of the concern for the next step is, how can we handle the branching. This branch comes from azure-storage branch. You might want to release this one with azure-storage branch. After the branch is merged, we can refer it from Durable Framework extension repo. Do you have a strategy how to manage two repos?

@cgillum
Copy link
Member

cgillum commented May 28, 2018

I still need to figure out the exact branching and release strategy, but I have some ideas. My goal is that DurableTask.AzureStorage can be updated and released independently as much as possible.

For now, I think the next step is for you is to build your DurableTask.AzureStorage nuget package locally and reference it locally from Microsoft.Azure.WebJobs.Extensions.DurableTask. This is the normal workflow I follow and this will unblock the next phase of development.

Once we are happy with the PR for Microsoft.Azure.WebJobs.Extensions.DurableTask, then we can merge your changes in DurableTask.AzureStorage into azure-storage. From there, I can publish to the myget feed, and that will unblock the azure-functions-durable-extension CI. Once the CI looks good, we can decide whether to go ahead and merge azure-storage into master and publish the updated DurableTask.AzureStorage to nuget.org.

Will that work for you? Let me know if you need help with instructions for deploying the DurableTask.AzureStorage nuget package locally. Kanio might know how to do this as well.

I think your suggestion to isolate the changes in DurableTask.AzureStorage is good and it's exactly what I would have done. I'll try to review it in more detail tomorrow. For now I think you should be unblocked though.

@TsuyoshiUshio
Copy link
Collaborator Author

Thank you, @cgillum .

I'll test and start implement the logic to the Durable Functions repo.
Then, Can I folk a branch from dev branch as usual?

@cgillum cgillum changed the title On the fly Orchestrator status DurableTask.AzureStorage API to enumerate instances Jun 3, 2018
@cgillum
Copy link
Member

cgillum commented Jun 3, 2018

When you are ready, please increment the package version in DurableTask.AzureStorage.csproj. I will then merge your change, create a nuget package, and publish it to the staging location. If the CI in azure-functions-durable-extension passes, then we can publish this package to nuget.org.

@TsuyoshiUshio
Copy link
Collaborator Author

Hi @cgillum ,

I tested locally, it works well. I update the version of the AzureStorage.

One thing to share is, I encounter the issue of Azure/azure-webjobs-sdk#1492
I solve this via the workaround. Since this issue is not related this pull request.

 return new HttpResponseMessage(System.Net.HttpStatusCode.OK) { Content = new StringContent(result) };

If you update / publish the nuget package, please let me know. :) I also start investigate the documentation change for this new feature.

@TsuyoshiUshio
Copy link
Collaborator Author

TsuyoshiUshio commented Jun 4, 2018

Hi @cgillum Something to share.

When we use this feature from Http API, it might be like this. code= is host key of Azure Functions.

"http://localhost:7071/runtime/webhooks/DurableTaskExtension/instances/?taskHub=DurableFunctionsHub&connection=Storage&code=es4Bb3qqtascBKLtKo5z41IP7Ne3QuooyWIoVXSpaqcZ2E47bFlSZg==",

I was wondering how customer get TaskHub/Connection/Code. Simplest solution is use CreateCheckStatusRespons API. However, it requires "instanceId". It might be not a big deal. However, I'd like to share with you in advance.

@cgillum
Copy link
Member

cgillum commented Jun 5, 2018

TaskHub and Connection should be known, since they are configured statically. The code can be fetched from the Azure Functions host admin APIs. It's a good point that this may not be easy to acquire. We will need to think about how difficult this might be for users. Hopefully we can use tools to help make it easier. We could also consider creating a new API for getting your new instance query URL.

@cgillum cgillum merged commit 3f10f0d into Azure:azure-storage Jun 5, 2018
@TsuyoshiUshio
Copy link
Collaborator Author

Yeah. That it will be the next pull request. :)

simonporter pushed a commit that referenced this pull request Jun 19, 2018
* DurableTask.AzureStorage API to enumerate instances (#187)

* DurableTask.AzureStorage ETW trace improvements (#192)

* Adding 4MB limit check for Azure Storage Table batch insert (#191)

* DurableTask.AzureStorage: Alternate fix for the 4 MB max entity size which covers more scenarios. (#194)

* Updated Newtonsoft.Json to v11.0.2, WindowsAzure.Storage to v8.6.0. (#193)

* Fixed issues with the ETW event source and added test reliability improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants