Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should Base.summarysize include alignment? #32881

Open
sjoelund opened this issue Aug 13, 2019 · 7 comments · Fixed by #32886
Open

should Base.summarysize include alignment? #32881

sjoelund opened this issue Aug 13, 2019 · 7 comments · Fixed by #32886
Labels
needs decision A decision on this change is needed

Comments

@sjoelund
Copy link

Base.summarysize seems inconsistent and feels like the return value cannot be relied upon. The documentation also says different things than the actual implementation.

It says it will compute the memory used by unique objects.

However, strings are not checked for uniqueness (I would expect it to find references that are identical, not identical content):

julia> a="abc"
"abc"

julia> Base.summarysize([a,a,a])
97

julia> Base.summarysize(["abc","def","ghi"])
97

Then it says it will compute the memory used by objects, and as far as I know memory allocations are aligned so I would expect all sizes to be at least multiples of 4, but the string "a" has a size of 9:

julia> Base.summarysize("a")
9

Then we have the fact that sizeof() is used to calculate sizes of each field, or in arrays. An array with 100 elements that is a union of 3 different types all with sizeof=0 will not consume 0 memory:

julia> struct ABC end

julia> Base.summarysize([ABC(),nothing])
40

julia> Base.summarysize([ABC(),nothing,nothing])
40

Wouldn't it make sense to compute this in a C function where the user cannot overload sizeof operators, etc and actually return the size that the data consumes in memory (including overhead)? The garbage collector should know the sizes of all objects after all. A function like that should give the user a much better hint of how much memory an object consumes.

I tested the code in

Julia Version 1.3.0-alpha.0
Commit 6c11e7c2c4 (2019-07-23 01:46 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

and

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
@JeffBezanson
Copy link
Member

summarysize calls Core.sizeof, which is not overloadable. There are some specific things we can fix here though:

  • only count unique String objects, and add alignment
  • add the space used by the union selector bytes in Array

@mbauman
Copy link
Member

mbauman commented Aug 13, 2019

Didn't we at some point explicitly state in the docs that it's a good-effort rough estimate and should not be relied upon? Or perhaps that was just a Jameson comment on GitHub that never got incorporated into the docs. In any case, it seems like we can improve the good-effort to best-effort in some places, and perhaps for the others we should document that it'll never be 100% perfect for the remainder.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Aug 13, 2019

This made me think of a different, more precise way to implement this functionality. Instead of using heuristics to try to guess which memory belongs to which object, track an ordered set of non-overlapping, non-adjacent intervals of memory which are in use. When a new bit of memory is added to the data structure, it is either merged with one or more existing intervals that it touches, or it is inserted as a new, isolated interval in the structure. Then at the end, the total memory reached is the sum of the lengths of the intervals. This gives the precise amount of memory that is reachable from any set of objects. It also handles the recursion aspect: you recur into references only if they point into memory that is not already in some interval.

@JeffBezanson
Copy link
Member

Yes, it's a rough estimate and shouldn't be relied on for anything serious. There are some real reasons for that, for example that whether immutable objects are copied is subject to change. But we can always improve it to be as accurate as reasonably possible.

Another fundamental issue is that we can't really follow the user's mental model of object ownership. For example say X points to A and B, and Y points to B and C. 3 objects are reachable from each of X and Y, but there are only 5 objects total. And perhaps B "belongs" to X in some semantic sense, but Y's reference is incidental.

A somewhat real example is:

julia> Base.summarysize(Int)
176

julia> Base.summarysize([Int,Int])
56

We can tell you how big the representation of the Int type is, but if you have an array of them it doesn't quite seem right to charge the array that amount. Similar for Modules. To the extent we use heuristics, that's what they're about. Just computing total reachability is not hard, since we don't have object interior pointers or overlapping objects.

@sjoelund
Copy link
Author

Well, it would be useful to have a function that computes the size using total reachability of data objects at least (types, functions, modules I understand why it would be hard to count).
Base.summarysize is quite slow due to counting unique objects (I assume). It takes longer for me to run summarysize on a 100 MB+ object than it does to construct the object (not sure how big it is because summarysize cannot be trusted and there are many shared strings and structs without fields in there). But it's OK if it's slow and creates some slightly more realistic figures for memory consumption.

It is today quite hard to figure out how much memory your object uses in the end and how much memory pressure could be saved by doing for example some memoization or caching. Total number of allocated bytes using @time I find a better estimate than summarysize.

julia> Base.summarysize(Int)
176

julia> Base.summarysize([Int,Int])
56

I would have sort of expected that Int was built in and constant (and thus have size 0). But if it was of size 176, I would have expected the [Int,Int] object to have size 40+8+8+176 (size of the array to fit the pointers to the type which is size 176 and unique assuming pointer_from_objref(Int) == pointer_from_objref(Int).

@JeffBezanson
Copy link
Member

It takes longer for me to run summarysize on a 100 MB+ object than it does to construct the object

Yes, that is expected due to counting unique objects as you said, and also because the summarysize code is reused for all types and not specialized. Computing it very quickly would not be worth the compilation overhead.

Total number of allocated bytes using @time I find a better estimate than summarysize.

Yes, that number should be quite accurate, but measures something totally different, including garbage objects. If you like that number, you wouldn't want [Int] to include the 176 bytes, since your code never allocates that object. Maybe the size of types should depend on whether they are hash-consed (we do that internally for some types) --- if so, give 0, otherwise count the size for each unique object.

Modules seem to be a different case. It can be useful to do summarysize(MyModule) to see roughly how much memory is taken by loading a certain package. But I don't think we should recur into Modules for other references to them.

But I'm not arguing --- we should make these improvements to summarysize.

JeffBezanson added a commit that referenced this issue Aug 13, 2019
- 0-field mutable structs take 1 word
- include alignment in object sizes
- take uniqueness into account for Strings
- include union selector bytes for Arrays
JeffBezanson added a commit that referenced this issue Aug 15, 2019
- 0-field mutable structs take 1 word
- take uniqueness into account for Strings
- include union selector bytes for Arrays
@JeffBezanson JeffBezanson reopened this Aug 15, 2019
@JeffBezanson
Copy link
Member

Mostly fixed by #32886, but repurposing to discuss whether alignment/gc overhead should be included.

@JeffBezanson JeffBezanson changed the title Base.summarysize has inconsistent behavior should Base.summarysize include alignment? Aug 15, 2019
@JeffBezanson JeffBezanson added the needs decision A decision on this change is needed label Aug 15, 2019
@staticfloat staticfloat reopened this Aug 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants