Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use generators to stream output #23

Closed
wants to merge 10 commits into from

Conversation

iainelder
Copy link
Collaborator

Back in #11 there was an idea:

I wonder if moving to lazy loading and yielding where possible would add a bit more performance - I intentionally set the "assume roles" and "do work" into two disparate steps on the first pass.

I've create this a draft pull request to see what you think of this attempt to solve it using generators.

It avoids collecting all the errors and results into a single huge return object and instead yields each one.

Later I would like to measure its memory usage compared to the current v1.4.0. I believe that generators may allow us to program a more memory-efficient version of the library and so provide a solution for #20.

See demo.py for how to use it in client code.

$ poetry run python demo.py > ~/tmp/inventory.json
Assuming sessions: 100%|███████████████████████████████████████████████████| 6/6 [00:01<00:00,  5.65it/s]
Executing function: 100%|██████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.05it/s]
Assuming sessions: 100%|███████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.04it/s]
Executing function: 100%|██████████████████████████████████████████████████| 6/6 [00:00<00:00,  7.08it/s]
$ cat ~/tmp/inventory.json | jq
{
  "Id": "111111111111",
  "Arn": "arn:aws:organizations::...",
  "Email": "...",
  "Name": "...",
  "Status": "ACTIVE",
  "AssumeRoleSuccess": true,
  "RoleSessionName": "OrganizationAccountAccessRole",
  "Result": {
    "Vpcs": [
      {
        "CidrBlock": "172.31.0.0/16",
        "DhcpOptionsId": "dopt-11111111111111111",
        "State": "available",
        "VpcId": "vpc-11111111111111111",
        "OwnerId": "11111111111111111",
        "InstanceTenancy": "default",
        "CidrBlockAssociationSet": [
          {
            "AssociationId": "vpc-cidr-assoc-11111111111111111",
            "CidrBlock": "172.31.0.0/16",
            "CidrBlockState": {
              "State": "associated"
            }
          }
        ],
        "IsDefault": true
      }
    ]
  }
}
...
{
  "Id": "222222222222",
  "Arn": "arn:aws:organizations::....",
  "Email": "....",
  "Name": "....",
  "Status": "ACTIVE",
  "AssumeRoleSuccess": true,
  "RoleSessionName": "OrganizationAccountAccessRole",
  "ExceptionDetails": "ClientError('An error occurred (NoSuchEntity) when calling the GetUser operation: The user with name does_not_exist cannot be found.')"
}

@connelldave
Copy link
Owner

Definitely open to the idea - my main concerns are breaking the API contract in the 1.x version, as this will either need to be released as a toggle or 2.0 release (don't want to break other people's code!), and it's a tricky one to test and get a decent feedback cycle on with it requiring a large number of accounts to repro the behaviour.

Possibly just having a smaller integration test for general benchmarking would help here: I'm interested to understand exactly where the memory leak (?) is happening and what optimisation we can do. It's possible just tweaking the number of concurrent workers might help? At 1k+ account scale though, the code the user runs being optimised is going to be important too. Let me know what you find from this change!

Being able to test for exactly where we yield for performance might be influenced by the above too - I don't have a great deal of time at the moment, but super interested in a good resolution to the issue.

@iainelder
Copy link
Collaborator Author

my main concerns are breaking the API contract in the 1.x version, as this will either need to be released as a toggle or 2.0 release (don't want to break other people's code!)

I'm with you on that one. The existing code using this library should continue to work. So it seems like a toggle option in the cove parameters might be a good way to control it. It would default to the current behavior (e.g. output_type="CoveResults") and could be set to streaming when it makes a difference (output_type="CoveResultsGenerator" ?).

Internally the CoveResults object could be built from a CoveResultsGenerator, probably just by passing the generators to list as happens in 1.4.0.

At 1k+ account scale though, the code the user runs being optimized is going to be important too. Let me know what you find from this change!

It's another good point. I haven't scrutinized my aws-org-inventory tool for optimal performance. It further transforms the result object from botocove so that could contribute to the memory use as well.

I don't have a great deal of time at the moment

Especially around this time of year! That's fine, we have other priorities :-) If I discover anything useful I'll update here.

@iainelder
Copy link
Collaborator Author

it's a tricky one to test and get a decent feedback cycle on with it requiring a large number of accounts to repro the behaviour.

We can solve this, at least for this kind of testing, by allowing duplicate target_ids. See dacd90c.

Demo.py now takes as input a member account ID and a number of repetitions. Botocove assumes the role in the same member account N times and executes the function for each assumed role. See 7d46271.

In this way we can measure the effect on memory consumption without needing to create hundreds of otherwise empty accounts in our personal organizations.

I've rewritten the generator code so that the output is almost the same shape as v1.4.0. The CoveOutput shape is still used, but the keys are now generators instead of lists. The demo code prints each item in the iteration instead of just printing the list/generator. In that way the same code could be used for comparing each implementation.

I'll present some experimental data here when I have time to prepare it.

@iainelder
Copy link
Collaborator Author

I presented the experimental data in #20.

This is no longer my active branch for experiments, so I'll close this PR and we can continue the conversation in #20.

@iainelder iainelder closed this Jan 18, 2022
@iainelder iainelder deleted the streaming_output branch November 29, 2022 16:53
@iainelder iainelder restored the streaming_output branch November 29, 2022 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants