Use generators to stream output #23

iainelder · 2021-12-19T17:11:05Z

Back in #11 there was an idea:

I wonder if moving to lazy loading and yielding where possible would add a bit more performance - I intentionally set the "assume roles" and "do work" into two disparate steps on the first pass.

I've create this a draft pull request to see what you think of this attempt to solve it using generators.

It avoids collecting all the errors and results into a single huge return object and instead yields each one.

Later I would like to measure its memory usage compared to the current v1.4.0. I believe that generators may allow us to program a more memory-efficient version of the library and so provide a solution for #20.

See demo.py for how to use it in client code.

$ poetry run python demo.py > ~/tmp/inventory.json
Assuming sessions: 100%|███████████████████████████████████████████████████| 6/6 [00:01<00:00,  5.65it/s]
Executing function: 100%|██████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.05it/s]
Assuming sessions: 100%|███████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.04it/s]
Executing function: 100%|██████████████████████████████████████████████████| 6/6 [00:00<00:00,  7.08it/s]
$ cat ~/tmp/inventory.json | jq
{
  "Id": "111111111111",
  "Arn": "arn:aws:organizations::...",
  "Email": "...",
  "Name": "...",
  "Status": "ACTIVE",
  "AssumeRoleSuccess": true,
  "RoleSessionName": "OrganizationAccountAccessRole",
  "Result": {
    "Vpcs": [
      {
        "CidrBlock": "172.31.0.0/16",
        "DhcpOptionsId": "dopt-11111111111111111",
        "State": "available",
        "VpcId": "vpc-11111111111111111",
        "OwnerId": "11111111111111111",
        "InstanceTenancy": "default",
        "CidrBlockAssociationSet": [
          {
            "AssociationId": "vpc-cidr-assoc-11111111111111111",
            "CidrBlock": "172.31.0.0/16",
            "CidrBlockState": {
              "State": "associated"
            }
          }
        ],
        "IsDefault": true
      }
    ]
  }
}
...
{
  "Id": "222222222222",
  "Arn": "arn:aws:organizations::....",
  "Email": "....",
  "Name": "....",
  "Status": "ACTIVE",
  "AssumeRoleSuccess": true,
  "RoleSessionName": "OrganizationAccountAccessRole",
  "ExceptionDetails": "ClientError('An error occurred (NoSuchEntity) when calling the GetUser operation: The user with name does_not_exist cannot be found.')"
}

connelldave · 2021-12-19T20:28:39Z

Definitely open to the idea - my main concerns are breaking the API contract in the 1.x version, as this will either need to be released as a toggle or 2.0 release (don't want to break other people's code!), and it's a tricky one to test and get a decent feedback cycle on with it requiring a large number of accounts to repro the behaviour.

Possibly just having a smaller integration test for general benchmarking would help here: I'm interested to understand exactly where the memory leak (?) is happening and what optimisation we can do. It's possible just tweaking the number of concurrent workers might help? At 1k+ account scale though, the code the user runs being optimised is going to be important too. Let me know what you find from this change!

Being able to test for exactly where we yield for performance might be influenced by the above too - I don't have a great deal of time at the moment, but super interested in a good resolution to the issue.

iainelder · 2021-12-20T14:35:19Z

my main concerns are breaking the API contract in the 1.x version, as this will either need to be released as a toggle or 2.0 release (don't want to break other people's code!)

I'm with you on that one. The existing code using this library should continue to work. So it seems like a toggle option in the cove parameters might be a good way to control it. It would default to the current behavior (e.g. output_type="CoveResults") and could be set to streaming when it makes a difference (output_type="CoveResultsGenerator" ?).

Internally the CoveResults object could be built from a CoveResultsGenerator, probably just by passing the generators to list as happens in 1.4.0.

At 1k+ account scale though, the code the user runs being optimized is going to be important too. Let me know what you find from this change!

It's another good point. I haven't scrutinized my aws-org-inventory tool for optimal performance. It further transforms the result object from botocove so that could contribute to the memory use as well.

I don't have a great deal of time at the moment

Especially around this time of year! That's fine, we have other priorities :-) If I discover anything useful I'll update here.

iainelder · 2021-12-21T08:41:08Z

it's a tricky one to test and get a decent feedback cycle on with it requiring a large number of accounts to repro the behaviour.

We can solve this, at least for this kind of testing, by allowing duplicate target_ids. See dacd90c.

Demo.py now takes as input a member account ID and a number of repetitions. Botocove assumes the role in the same member account N times and executes the function for each assumed role. See 7d46271.

In this way we can measure the effect on memory consumption without needing to create hundreds of otherwise empty accounts in our personal organizations.

I've rewritten the generator code so that the output is almost the same shape as v1.4.0. The CoveOutput shape is still used, but the keys are now generators instead of lists. The demo code prints each item in the iteration instead of just printing the list/generator. In that way the same code could be used for comparing each implementation.

I'll present some experimental data here when I have time to prepare it.

iainelder · 2022-01-18T21:30:05Z

I presented the experimental data in #20.

This is no longer my active branch for experiments, so I'll close this PR and we can continue the conversation in #20.

Use generators to stream output

5b73657

iainelder added 7 commits December 20, 2021 18:24

Use one stream per result type

e9a3f73

Make demo work with old CoveOutput

ce46c6a

Retain old API with gen instead of list

a208aee

Remove tqdm to measure memory impact

d9ddb13

Allow duplicate target_ids for performance testing

dacd90c

Parameterize repetitions in demo

7d46271

Bring back tqdm in execution

00b8ca9

iainelder added 2 commits December 21, 2021 17:36

Add script to profile cove memory use

899fdb1

Add profiling notes

8b39e18

iainelder mentioned this pull request Jan 15, 2022

How to reduce memory usage? #20

Closed

iainelder closed this Jan 18, 2022

iainelder deleted the streaming_output branch November 29, 2022 16:53

iainelder restored the streaming_output branch November 29, 2022 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use generators to stream output #23

Use generators to stream output #23

iainelder commented Dec 19, 2021

connelldave commented Dec 19, 2021

iainelder commented Dec 20, 2021

iainelder commented Dec 21, 2021

iainelder commented Jan 18, 2022

Use generators to stream output #23

Use generators to stream output #23

Conversation

iainelder commented Dec 19, 2021

connelldave commented Dec 19, 2021

iainelder commented Dec 20, 2021

iainelder commented Dec 21, 2021

iainelder commented Jan 18, 2022