-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepcopy: update the meaning for objects like BigInt #24875
Conversation
I think this is still true – that our serializer does make a new BigInt should be seen as simply an implementation detail which shouldn't be assumed generally. (for example, a hypothetical serializer might use a special representation for common numbers like 0 and 1). On the other hand, it seems like |
If you want to copy a |
This is the question at stake here, depending on whether we consider |
You introduce the word "usually" so that I don't think this helps. This just opens the door for more and more arbitrary exceptions and less and less predicatible behavior of I don't understand why this has to be be discussed again. Everything was said in #16999. |
I introduce the word "usually" so that "fully independent object" is not taken too strictly or as a binding definition, which I think was the motivation for Yichao's PR.
I'm fine with removing this part of the docstring altogether. For me, the part I introduced with "atoms" and "containers" makes more sense, and in this respect,
Because no consensus was found in that PR. Yichao's work was to fix a bug in a way that agrees with his interpretation of the docstring of |
I disagree. deepcopy is extremely convenient if you construct a big data structure in a nontrivial way, and afterwards want to improve memory-layout for faster access times.
People are using deepcopy for a reason. No tool should presume it knows better than the user. Edit: I think it would be more important to document the traversal order (I think DFS?) of deepcopy, so that users can better anticipate the resulting memory layout. Offering both DFS and BFS order would be nice but probably not worth the effort. |
That's very interesting, and as far as I can tell the first argument given in favor of copying the internals of
I intended to expand on my motivations here tomorrow, but the gist of it is that I believe there are two classes of use cases for So I could imagine using a keyword argument to let the user choose, like |
I thought cleaning up memory fragmentation and over-allocated Vectors was the main use case for deepcopy? If I want a shallow (fast) copy I would use copy instead of deepcopy. While the shuffle is somewhat pathological, I often have temporary allocs or need to apply permutations (e.g. sorting) during construction of big objects, and afterwards need to do a lot of in-place read/write operations. I'm no a BigInt user though, so maybe the BigInt community has different usecases? |
Of course, but if your |
Fair enough. So in the end, you think deepcopy should rather try to approximate a lazy-copy? Then one should probably create a separate function, i.e. deepcopy vs lazydeepcopy, and then think about whether one can do some hare-brained copy-on-write scheme for this? A lazydeepcopy would actually be quite awesome, and possibly achievable by using shared (mmap-ed) arrays. |
I was not aware at all of this use case, it was not mentioned in the discussions concerning |
I didn't think in these terms... but lazy-copy would be a possible implementation of my understanding of |
I think a lot of the problem is that |
After thinking about it, you are right. Deepcopy should split into a deepcopy_fast and a different deepcopy_mem (whatever their names end up being). Then the future development of both is also clear: deepcopy_mem should attempt to talk to the allocator (e.g.: make new old pools, create the objects as old, allow for easy paging-out of the copy, maybe play some offset-games to prevent false aliasing, ensure that the temporary allocs for the object_id dict do not fragment the same pools as the deepcopy). It should also become part of the performance tips. deepcopy_fast, on the other hand, can look for more unreliable speedups, like skipping the bignum copies or (one can dream) copy-on-write. Especially a deepcopy_fast of a deepcopy_mem object should allow cow, because the deepcopy_mem is nicely contained and not splattered all over the heap. |
Seems like this is something that should be addressed in DataFrames
I think you mean edit: wiktionary link looks like a better reference source for those not familiar with dwim |
Shouldn't it be split into |
I would guess that some people would use this version to know that they can mutate e.g. Also, I believe that the two versions would share their implementation in many cases, so it could be more economic for implementors to have only one function, with a keyword for discriminating the variant, which can be ignored or passed down recursively when it doesn't matter, e.g, in a non-rigorous way and forgetting about function deepcopy(x::Array, strict=false) # strict=true to deepcopy as deep as possible, even BigInt internals
dest = similar(x)
for i=1:length(x)
dest[i] = deepcopy(x[i], strict=strict)
end
end although |
@vtnash It could certainly be changed in DataFrames if we wanted to. I mentioned it because that choice depends on how we define |
I suppose |
FWIW, I haven't read all of this thread but I don't think that |
Also, in the future, we reserve the right to make bugnums actually immutable, so anyone who is relying on the current behavior is going to have their code broken. We may as well break it now. |
Since you did not read the entire thread, I'll repeat myself. Please reconsider
Currently, deepcopy has the extremely useful side-effect of getting rid of heap-fragmentation and making subsequent access (in DFS==deepcopy order) much more cache-friendly. Regardless of the default behavior of deepcopy, this should remain conveniently accessible. In order to be more realistic, mentally replace shuffle! by some sort!. |
With a private bignum copy API you would just replace |
@rfourquet I really like your strict version. @StefanKarpinski depthlimit: In principle, deepcopy is called on a directed graph, following object references and respecting object identity (graph homomorphism). What is the proposed behavior if an object X is reachable by pathes of different lengths? Say: Root <-> A1 <-> A2 <-> X Now we must copy everything (if we want to preserve object identity), regardless of the depth limit and length of the chain. Preserving object identity by depth level must do something like "take transitive closure, choose favorite way of assigning levels in the Hasse-diagram, go that far for the copy but not further". This sounds far too fancy for base. |
I'm not sure I see the problem with depth-limiting |
You get a copy of something if it reaches some that is reachable within that number of recursive steps, otherwise you don't. ftfy. It's a problem mostly just in that it's subtle, and would give even worse performance than |
True. What about the strict option? So, say, copy and deepcopy don't copy unless strict=true? Then the default could change in a later version, without breaking anything, if strict turns out to be more useful in practice. Also, won't the same question apply to strings if they ever become immutable (cf #22193? |
The strict thing seems pretty weird to me. What does it mean in general? When can you pass the strict keyword to copy and deepcopy? Strings are already immutable on master. |
Strict deepcopy would mean "really, really make a copy; make it so that the resulting object-tree is indistinguishable from serializing-and-deserializing, even if I abuse pointers, and including heap placement". This means that we need to copy the actual bytes of bignums and strings, into newly allocated space (that is allocated in the DFS-order, hence pretty predictably placed on the heap, and the deepcopy works even if someone later uses an unsafe_store to mutate bignums in-place) Non-strict would mean "give me a semantic deepcopy, as fast as possible", which allows us to be faster and allocate less memory. In 99% of cases, we have no optimization for "strict=false" and both do the same thing. After such a change, A=deepcopy(A, strict=true) would be the preferred way of cleaning up an object-tree with bad placement (maybe because A was constructed over a long time, or because the user applied a permutation like sorting some arrays), and B=deepcopy(A, strict=false) would be the preferred way of pulling a copy for algorithmic reasons. Currently, only "strict=true" is implemented; this pull-request introduces the first optimization for "strict=false", and originally proposed to make the "strict=true" behavior entirely inaccessible. I want the optional argument because I think that strict copies are quite useful. I think that a change from default "strict=true" to "strict=false" is more breaking than the other way (because of code doing evil pointer tricks, or crucially relying on good memory patterns); if non-strict is supposed to become default, then it would probably be better to do this now? Regarding strings: Can we implement a deepcopy of strings without copying the actual buffer? Would you feel comfortable if a copy of a string only copied the pointer (with strict=false)? Wouldn't you feel that some way of "nope, I am going to unsafe_store! into this string/ bignum, please allocate and memcopy!" should be provided by the language (e.g. via strict=true)? Ultimately, there only is a difference for "weird" types that need a custom deepcopy function, like bignum (which abuse pointers and finalizers) or possibly types that directly hook into the runtime, or are builtin. |
It's a bit short on time for designing the
We define We define I'm not sure if we should define a default fall-back as |
I think we can leave this as is for now and introduce a |
Design notes for |
Very outdated |
Starting from bug #16667, two fixes have been proposed: #16826 which implements
deepcopy(::BigInt)
as the identity, and #16999 which implementsdeepcopy
by creating a totally independant newBigInt
object.As far as I could understand, the rationale for the latter (which was the one merged), was to follow a strict interpretation of the docstring for
deepcopy
, which mentions in particular a "fully independent object" and also "Calling deepcopy on an object should generally have the same effect as serializing and then deserializing it." @yuyichao argued for example that this definition forces us to implementdeepcopy
such that its result is indistiguishable from serializing and then deserializing.I believe the correct fix to the initial bug is the first PR, as I see no point in forcing wasting resources on copying the internals of
BigInt
when the semantics and the API for this type is to be immutable. If the definition ofdeepcopy
prevents definingdeepcopy(x::BigInt) = x
, then the definition needs to be fixed, which is what I tried to do here (suggestions for improvements are very welcome), in addition to definingdeepcopy
as the identity. I also cherry-picked the tests implemented by @ScottPJones in his PR.It's possible that I missed a use case for the stricter interpretation (copying the internals), but none have been exhibited so far. On the other hand, the use case for
deepcopy == identity
is simply performance, for when one has to deepcopy a structure containingBigInt
s.