ENH Implement LoadContext to handle multiple instances #209

E-Aho · 2022-11-04T21:35:27Z

Fixes #206.

PR to add ability to persist a single instance that is referenced in multiple places.

Notes:

Currently, this does save the state of the object multiple times, but only initialises once. This is done due to the issue discussed in Add Method type in dispatch calls #195 around not knowing which instance will be loaded first, particularly in complex cases with interwoven dependencies.
~~At the moment, I've only added __id__ in the places needed to solve the xfail test, but if others are happy with how this looks I will expand this to other get_state_* methods :)~~ This is now global, done at the high level get_state and get_instance functions so works for all types.

Regarding 1: I looked into any ways to store DAGs in JSON and bumped into a relatively new addition to JSON syntax, JSON-LD, which also happens to have a Python implementation. I feel like if we want to only save these objects once, we would need some kind of DAG implementation to hold dependencies. This, however, would need a major rework and feels like it might be overkill for the time being, but it's worth holding in mind.

BenjaminBossan · 2022-11-07T11:15:16Z

I think this looks pretty good, like the simplest way to solve the problem.

if others are happy with how this looks I will expand this to other get_state_* methods

This would mean that we need to add something like this everywhere, right?

def get_instance_foo(state, load_state):
    saved_id = state.get("__id__")
    if saved_id and saved_id in load_state.memo:
        # an instance has already been loaded, just return the loaded instance
        return load_state.get_instance(saved_id)
    ...
    if saved_id:
        load_state.memoize(loaded_obj, saved_id)
    return loaded_obj

So is it something we want to move into a decorator to avoid the boilerplate?

Currently, this does save the state of the object multiple times

Let's leave it like this for now and bother about the problem later.

I looked into any ways to store DAGs in JSON and bumped into a relatively new addition to JSON syntax, JSON-LD,

I only took a glance but I hope we can manage to solve it without such a solution.

E-Aho · 2022-11-07T11:39:01Z

This would mean that we need to add something like this everywhere, right?

Thankfully no, because of the way I structured it, the memo check is at the top level get_instance function, before it gets sent to the specific get_instance_foo for whatever type that is, so it will be able to handle any memoized things with minimal boilerplate.

We could switch to a decorator pattern, but I felt it would be easier to just pass __id__ as one of the state parameters for any types we want to persist as a singular instance. So, to implement elsewhere, you'd just need to add res["__id__"] = id(obj) in get_state_foo.

I only took a glance but I hope we can manage to solve it without such a solution.

I hope there's a nice solution I've missed too, but it feels like once you start getting into edge cases with nested dependencies and such, you do need some kind of DAG implementation...

BenjaminBossan · 2022-11-07T13:09:10Z

Thankfully no, because of the way I structured it, the memo check is at the top level get_instance function

Ah yes, nice, I missed that.

you'd just need to add res["__id__"] = id(obj) in get_state_foo.

Okay, this might be something we would want to move out of the get_state_foo functions, since it's easy to forget and not statically checked.

I hope there's a nice solution I've missed too, but it feels like once you start getting into edge cases with nested dependencies and such, you do need some kind of DAG implementation...

Hmm, not sure, let's see. Maybe we can learn how pickle implements it and copy that approach.

E-Aho · 2022-11-07T20:30:24Z

I've added in a decorator that can be used to wrap get_state functions that we want to be able to persist as single instances.

I initially tried having it in the high level get_state function, but ran into issues I think were being caused with IDs being garbage collected and reused during persist for things which were not equivalent.

BenjaminBossan · 2022-11-08T09:43:59Z

I initially tried having it in the high level get_state function, but ran into issues I think were being caused with IDs being garbage collected and reused during persist for things which were not equivalent.

Interesting, do you still have the code for this? It's not clear to me why moving to a decorator solves this.

Assuming that code looks similar to what you have now:

        result = func(obj, save_state)
        result["__id__"] = id(obj)

did you try if this works instead?

        __id__ = save_state.memoize(obj)
        result = func(obj, save_state)
        result["__id__"] = __id__

E-Aho · 2022-11-08T11:08:44Z

Interesting, do you still have the code for this? It's not clear to me why moving to a decorator solves this.

What worked was less just moving to a decorator, more not persisting certain types with an id. When we start to persist __id__ for:

get_state_dict
get_state_list
get_state_ndarray

Some tests start failing in a flaky way (sometimes they would work fine on reruns). I jumped into the debugger and saw that objects that shouldn't have the same id did, and it seemed to be objects that were being created during get_state (dicts, lists etc).

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I'll play with what you suggested tonight, and might try using a more unique ID for objects as well :)

BenjaminBossan · 2022-11-08T11:22:30Z

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I think this is the most likely explanation, which is why I suggested to memoize the object first. This should prevent gc and make the id stable.

E-Aho · 2022-11-08T11:27:34Z

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

My intuition was that CPython might be garbage collecting these, then reusing the memory address, giving the same id.

I think this is the most likely explanation, which is why I suggested to memoize the object first. This should prevent gc and make the id stable.

That should work, although it would also mean holding a lot more things in memory at all times, which might give a minor performance hit. If we don't need to actually hold the object, just get a unique id, it might be easier to just use a UUID, but I'm happy to just memoize the objects as well.

BenjaminBossan · 2022-11-08T11:36:10Z

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

The memory is cleared once dumping is over, so gc should work after that. But it's true that during the dumping process, we could potentially see a memory spike, though I assume that most objects are held in memory anyway. If I'm not missing something, It should only really matter for attributes that are generated on the fly and require a lot of memory.

E-Aho · 2022-11-08T11:37:56Z

I suppose either method would work, but memoizing every object would mean basically never garbage collecting,

The memory is cleared once dumping is over, so gc should work after that. But it's true that during the dumping process, we could potentially see a memory spike, though I assume that most objects are held in memory anyway. If I'm not missing something, It should only really matter for attributes that are generated on the fly and require a lot of memory.

No that's true, I'm overthinking it and these on-the-fly objects being held in memory should be a negligable memory hit.

I'll rework it this eve and check if it helps solve the flaky tests

…skops into FIX-multiple-instance-persistance

…gletons

E-Aho · 2022-11-08T19:07:45Z

Memoizing seems to have solved the problem without needing any UUID rework, good call @BenjaminBossan!

I've added a test that checks it works for a few different object types, let me know if there's anything else you think would be worth testing!

E-Aho · 2022-11-08T19:22:35Z

RE: CodeCov, as far as I can tell from the report, this PR is at 100% coverage, and just failing on an "indirect coverage change", so I think the CodeCov/project check failing is just noise, but let me know if this isn't the case

BenjaminBossan

Looks very good to me, I just have some minor comments/questions.

Regarding the naming, it's a bit unfortunate that we now have state and LoadState. When adding SaveState, I didn't think about this issue. Maybe we can find a better name (for both for consistency)?

skops/io/tests/test_persist.py

skops/io/_dispatch.py

E-Aho · 2022-11-09T12:05:27Z

On the names, how do you feel about changing them to SaveContext and LoadContext @BenjaminBossan?

BenjaminBossan · 2022-11-09T12:34:03Z

On the names, how do you feel about changing them to SaveContext and LoadContext @BenjaminBossan?

Yes, I prefer that. Pinging @adrinjalali wdyt?

BenjaminBossan

Thanks a lot, fantastic work.

@adrinjalali would you please also give this a look?

adrinjalali

Other than the minor comments, looks brilliant.

skops/io/tests/test_persist.py

skops/io/_utils.py

…skops into FIX-multiple-instance-persistance

E-Aho · 2022-11-11T19:43:13Z

Thanks for the PR reviews guys ❤️ I think it should be ready to merge in!

E-Aho added 2 commits November 4, 2022 20:35

First pass at LoadState with src

2c7faf4

Fix old xfail test for bound method

5109f14

Add persist id decorator

2854fe8

Update test to fix problem with clashing test

9945c48

E-Aho added 8 commits November 8, 2022 12:56

Merge branch 'main' into FIX-multiple-instance-persistance

b766041

Memoize temp objects to avoid ids being reused

4121a35

Merge branch 'FIX-multiple-instance-persistance' of github.com:E-Aho/…

7698774

…skops into FIX-multiple-instance-persistance

Small test for non-bound-method persistance

1311027

Add parametrized test for multiple reference object

d493cd1

Rename objects in test for clarity

7bc4f7a

Reorder some parts of logic to persist JSON and string objects as sin…

a2798ae

…gletons

Reorder try excepts

d27449c

BenjaminBossan reviewed Nov 9, 2022

View reviewed changes

skops/io/tests/test_persist.py Outdated Show resolved Hide resolved

skops/io/_dispatch.py Outdated Show resolved Hide resolved

E-Aho added 3 commits November 9, 2022 18:13

Address PR comments

3715d17

Rename SaveState and LoadState

459e68a

Update docstrings to use context, not state

df3848a

BenjaminBossan approved these changes Nov 10, 2022

View reviewed changes

E-Aho changed the title ~~FIX Implement LoadState to handle multiple instances~~ FIX Implement LoadContext to handle multiple instances Nov 10, 2022

E-Aho force-pushed the FIX-multiple-instance-persistance branch from aeb5b12 to df3848a Compare November 10, 2022 16:39

Merge branch 'main' into FIX-multiple-instance-persistance

f851c95

adrinjalali reviewed Nov 11, 2022

View reviewed changes

skops/io/tests/test_persist.py Show resolved Hide resolved

skops/io/_utils.py Outdated Show resolved Hide resolved

E-Aho added 2 commits November 11, 2022 19:39

Remove newline in docstring

5566f0d

Merge branch 'FIX-multiple-instance-persistance' of github.com:E-Aho/…

622ad98

…skops into FIX-multiple-instance-persistance

E-Aho mentioned this pull request Nov 11, 2022

Bug: loading certain scipy functions fails #184

Open

adrinjalali changed the title ~~FIX Implement LoadContext to handle multiple instances~~ ENH Implement LoadContext to handle multiple instances Nov 14, 2022

adrinjalali merged commit 80be9ce into skops-dev:main Nov 14, 2022

E-Aho deleted the FIX-multiple-instance-persistance branch November 14, 2022 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Implement LoadContext to handle multiple instances #209

ENH Implement LoadContext to handle multiple instances #209

E-Aho commented Nov 4, 2022 •

edited

Loading

BenjaminBossan commented Nov 7, 2022

E-Aho commented Nov 7, 2022

BenjaminBossan commented Nov 7, 2022

E-Aho commented Nov 7, 2022 •

edited

Loading

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

E-Aho commented Nov 8, 2022

E-Aho commented Nov 8, 2022 •

edited

Loading

BenjaminBossan left a comment

E-Aho commented Nov 9, 2022

BenjaminBossan commented Nov 9, 2022

BenjaminBossan left a comment

adrinjalali left a comment

E-Aho commented Nov 11, 2022

ENH Implement LoadContext to handle multiple instances #209

ENH Implement LoadContext to handle multiple instances #209

Conversation

E-Aho commented Nov 4, 2022 • edited Loading

BenjaminBossan commented Nov 7, 2022

E-Aho commented Nov 7, 2022

BenjaminBossan commented Nov 7, 2022

E-Aho commented Nov 7, 2022 • edited Loading

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

BenjaminBossan commented Nov 8, 2022

E-Aho commented Nov 8, 2022

E-Aho commented Nov 8, 2022

E-Aho commented Nov 8, 2022 • edited Loading

BenjaminBossan left a comment

Choose a reason for hiding this comment

E-Aho commented Nov 9, 2022

BenjaminBossan commented Nov 9, 2022

BenjaminBossan left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

E-Aho commented Nov 11, 2022

E-Aho commented Nov 4, 2022 •

edited

Loading

E-Aho commented Nov 7, 2022 •

edited

Loading

E-Aho commented Nov 8, 2022 •

edited

Loading