-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json.nim: smaller init size #15048
json.nim: smaller init size #15048
Conversation
There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.
The real bug is in slotsNeeded in hashcommon.nim: slotsNeeded should add 1, not 4 (no idea where 4 came from). Then requesting a 1-item table will give 2 slots, 2 items -> 4 slots, 3 items -> 8 (not ideal, could be 4), 4 -> 8, etc. 4-item tables need an empty slot so 8 is the smallest possible. As written, slotsNeeded yields 8 for a 1-3 item table, 16 for a 4-item table, etc. It doesn't matter for large tables of course, but for tiny tables it does. Then in mustRehash, instead of |
@c-blake your opinion on the above? (You're the author of |
I believe I left I agree that "minimal tables" are sometimes useful when you have zillions of small tables. I could see json being one such case. So, I am +1 on the idea of lowering the minimum size. I only would say two things. A) make sure that mustRehash and slotsNeeded are inverses in the proper sense in that and B) It might be possible to do even better than the idea here and have an empty table use no space at all (like Beyond these two things, while I have your attention, I feel I should say that in my For example, if you happen to have a great hash, like Besides that search depth measurement and growth rule, you do need a "memory attack" block for cases where, for example, someone set Anyway, a hash over-tuned for expected keys that winds up wrong about realized keys can cause other problems and |
As a double-check, I changed slotsNeeded and mustRehash in hashcommon.nim to:
Then added this tiny table test to tables.nim:
It works. But if What is harder to check is that it is resizing to the minimum possible size, because there is no t.slots to get number of slots. I made t.data public so t.data.len is visible, added an echo to the test, and get this output (this is echoed after the insert occurs):
Sorry for the length, didn't know a good way to verify the resizing behavior. |
I added this test to tables.nim (this was preliminary; see actual test 8 posts down):
Those two ifs would become doAssert. There are 2000 tests and 42 failures I'm going to look at. @c-blake does this test look right? I had to add a slots proc, and since mustRehash uses the current table slots and counts, it's a little weird. |
Well, I don't know why you are even calling initTable and t.len and all that. I would think you can just call slotsNeeded (formerly rightSize) and mustRehash. That's what my old test did. It's probably in git history... Another thing (maybe implicit in my prior comment) I should say is that there is absolutely NO WAY for Since we are talking about optimizing the way growth works, well, someone should mention these expert knobs again. Many language runtimes don't even let you presize (like Python, for example) which can make a pretty big perf delta. And Nim stdlib currently doesn't "autoshrink" a table or even interrogate it for its load factor for client code to rebuild a denser one. So, if you delete a bunch of stuff and iterate over it a bunch of times you will not have great performance. Right now deleters have to "shadow" the table population as they add/set/del with a wrapper object since |
mustRehash doesn't take args; it uses the current table size and counts. The 2nd if should have used <=, not ==. Now there are 6 failures and they are for cases where there are 0 or 1 items in the table - probably special cases because of the + 2 in mustRehash:
|
I feel like mustRehash used to take the cur size + 1 or something. Maybe that's why they deleted my tests. ;-) |
See new version of tiny table test below. It doesn't detect changing the +2 to +1 in mustRehash, probably because of the >= 2 tests, but does assert if +2 is changed to +3, or the *2 *3 stuff is changed. |
That's looking better. Maybe |
+1 does not work if size requested is 0 or 1. For the 0 case, the initial size will be 1. When mustRehash is called to add a new item, 1-0 is not less than 1 so no resize. But you cannot add an item to a 1-slot table because search will hang on missing keys. For the 1 case, the initial size will be 2. Before first insert, in mustRehash 2-0 is not less than 1 so no resize - correct. On the 2nd insert, 2-1 < 1 is still false, so no resize again. But this is incorrect, because there would be no vacant slot and you'd get a hang again. See the test added below for tiny tables:
Also:
|
After thinking about it, it may be a good idea to change the items requested in json.nim all the way down to 1. Then if someone uses lots of single-item objects, there will only be two slots instead of 4 (when 2 items are requested on init). I tested this by modifying json.nim to use (2), as committed, and the old hashcommon:
With json(1) and new hashcommon:
|
The test for slotsNeeded matching mustRehash should be this (gotta still do the insert):
|
I found my old test. I guess TC removed it, and I neglected to restore in my reversal of all his pseudorandom probing churn back in February/March. 8c22518 . Seems that was more one-sided anyway..not checking "tightness"/non-wasting of space. So maybe there should be two loops, one checking one side and another the other or like you have your test above. Honestly, I think moving |
Also, since we do not know how big This change has broader implications like checks on lookups for an empty table. (Might have tiny impact on larger tables since they always have to check to support the possibility that nothing is there, and people do seem to care more about even tiny perf deltas on larger tables than about memory optimizing tiny ones..) |
One thing I like about passing the table to functions like mustRehash is that if the decision becomes more complicated, mustRehash will have access to all the data needed. You do this in adix. So if loadFactor, maxProbeDepth, etc are added as initTable options, mustRehash will have access to them without adding new parameters.
What implications? These checks are already present in the table code. For lookups, it checks to see if t.data.len is 0 and returns false (lookup failed). For inserts, it calls checkIfInitialized before calling mustRehash, and if t.data.len is 0, it calls initTable with the defaultInitialSize.
This already happens, as explained, though instead of calling initTable with 1 (which would create a 2-slot table), it uses defaultInitialSize. That's probably right for most use cases, though for the JSON case, creating a 64-slot table may not be ideal. So I had to test that :-) by changing checkIfInitialized to create a 1-item table instead and running the JSON benchmark. It didn't make any difference. |
I stand corrected. :-) It's true having a handle on the table gives you more options. But adix also resizes based on search depth not a guess at load factor to control search depth. So, this slotsNeeded-mustRehash inverse relation is not important. My re-org idea was really contingent upon staying with the current "control bad perf with a load factor guess" instead of "bound it directly, but with harder to predict space usage". No one really responded to that idea. :-) |
Anyway, a TLDR for @narimiran - I am 100% behind a 0/2-slot minimum table & we seem on the way to having a good smooth and minimal growth policy in the old control load factor domain, but there are many possibilities related to how smart/expert we want stdlib tables to be. |
There was a recent `rightSize` change in tables.nim, so the existing value (4) was creating too large tables.
There was a recent
rightSize
change in tables.nim, so the existingvalue (4) was creating too large tables.
See #14995 (comment)