-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
caching: order and use latest node when cache more than one #5197
Conversation
As @giovannipizzi comment in #5178 (comment)
I think the current implementation only cache by order for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I cannot think of any situation in which getting the last cached node would not be the desired effect (specially since all nodes are supposed to be equivalent and currently the only possible difference is if their remote folder is still there or not). More than that, even if there was a situation in which one would like a different node, just getting the first node the querybuilder "arbitrarily" chooses is probably not what you wanted instead. So for me getting always the last created node is already an improvement and has a good chance of always be the right thing to do.
Now, my only question with this is if passing kwargs
here through all these functions is a good idea. I feel like, although versatile, using kwargs
can sometimes end up obscuring the flow of information through the program, and I now personally try to avoid it unless there is no other solution. Moreover, given on what I said in the previous paragraph, I'm not sure we even want to control this from the higher level functions instead of having it always on. So I would maybe just start with _iter_all_same_nodes
always returning an ordered by date list and we can add the versatility later if it is needed.
It would perhaps be good to have confirmation on this idea:
from all the participants in the other discussion (@greschd @sphuber @chrisjsewell @giovannipizzi ). No need to review the code (unless you want to), but priority is just saying if you agree or not with that statement. (* I know it is not exactly "random", but it is unreliable) |
I am not sure to be honest. I don't think it does any harm, but to me it is almost as arbitrary as picking a random one. @unkcpz found a use case where one particular node was being selected that he didn't want to be selected, and so in that case a patch would be to take the last one. But there are plenty of other scenarios imaginable where this wouldn't be the case. Rather, the real solution here should be to exclude that particular node from being a valid cache, which is what the real underlying problem was. So even though this implementation here couldn't hurt, it adds unnecessary complexity to solve an xy-problem. Instead, I think we should just make it easier to exclude nodes from being a valid cache as described in #5189 . I have a PR that is almost finished and would go for that option instead. |
Well, I think that two different problems. It is absolutely good to have an API to control the caching of the processes. But the API still needs to be used by users, I agree that it is always a improvement to have a default (or more than 'random' way of choosing caching node) value that we know which node is used for caching. |
That is the problem though as I see it: choosing the last one is just as random as choosing any other one, from a list of identical nodes. Note the stress on identical, because that is what we are doing here. We have a list of identical objects, and we need to pick one. Why would always picking the last one be any less random than any other one? If they are truly identical, you can't tell anyway. Anyway, if the others also think this is fine to add, I won't stop it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @ramirezfranciscof think it is good to get the opinion of the others on the mechanism first before going to review, but since I already skimmed it, I already leave some comments as to not forget them.
@@ -678,7 +678,7 @@ def store(self, with_transaction: bool = True) -> 'Node': # pylint: disable=arg | |||
self._backend_entity.clean_values() | |||
|
|||
# Retrieve the cached node. | |||
same_node = self._get_same_node() if use_cache else None | |||
same_node = self._get_same_node(order_by={'ctime': 'desc'}) if use_cache else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do with a comment on why this is being ordered.
tests/orm/node/test_calcjob.py
Outdated
return _create_cacheable_node | ||
|
||
@pytest.mark.usefixtures('clear_database_before_test') | ||
def test_caching(create_cacheable_node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this is the best way to test the new behavior. The fact that the node that is created last was used for the caching is no guarantee that this was done because there was a proper ordering. It could have been by random and so it would be easy to introduce a regression that this test doesn't detect.
Maybe what we should do here is to mock Node._iter_all_same_nodes
to the following:
def _iter_all_same_nodes(self, allow_before_store=False, **kwargs):
assert kwargs['order_by'] == {'ctime': 'desc'}
and then just call node.store()
. This will guarantee that the ordering is passed. Of course this ties the test somewhat to the implementation, but at least it is a real unit test that will catch a regression if this were ever to be changed by accident.
If this is the better way to test it, maybe also just move it to tests/orm/node/test_node.py
as we are testing a method of the Node
base class, and it is not something specific to the CalcJobNode
subclass.
I would agree with you that in cases like like selecting a number from a list of integers this is irrelevant, since whether you take the first If you now agree that there is some value in selecting with a consistent criteria, then I think having that criteria be the most recently created node and not another one is less contentious. Not only this is a natural universal selection criteria (together with "alphabetically" and..not many others no? 🤔 ), but in this case has the advantage of having a higher likelihood of getting a node with an existing remote folder. Sure, it does not actually guarantee it, and you could even say the increase in likelihood is not that much really, but still "very small advantage" is better that "no particular advantage whatsoever" for any other criteria I can think of 😅 .
As @unkcpz said, I think these are addressing two separate aspects of the problem and that having both can be useful for different reasons (in particular, the paragraphs above). Also, I don't think this necessarily adds any unnecessary complexity to the code (bah, not sure what you mean by complexity here, but including the ordering of nodes could just be a one line addition and the impact in performance should not be very much). |
I'm not so sure about this: now, we can essentially make a query that can return when the first match is found. When we guarantee using the latest, worst-case (i.e., if there is no index on time[1]) we'd have to go through all nodes and order them. I don't have any insight query performance and what is or isn't indexed, just wanted to point out that this statement is not self-evident. Other than that, I agree that returning the newest node is an improvement over the status quo. [1] As a side note, if caching performance is an issue (it might well be, I never tested that on big DBs!), we should consider adding an index on the hash. |
Mmm you think so? I mean, my logic was that when there is no equivalent hashed node, you are already going through all of the possible calcjobs and checking if the hash coincides, no? So although I understand why you say this would be slower than finding the first hit, I think it is still would be roughly comparable to the most common hash search query. |
I think it could make a big difference if it's a data type with lots of duplicates, like for example This is a slightly exaggerated example, because we actually search for process nodes, not base types. Actually, thinking about this, does anyone have experience with the caching performance on really large DBs? This was always a concern we've pushed to "when it becomes problematic". Compared to the initial design, what makes this somewhat worse is that -- if I remember correctly -- back then, different |
Except that |
3549e43
to
53d45d2
Compare
53d45d2
to
9d9b556
Compare
for more information, see https://pre-commit.ci
My 2 cents: I agree these are 2 independent problems, and (after checking for performance) I'm agreeing that returning the last one is a good change for this PR (but it would be good to do some benchmarking on a large DB?). If we use the
Practically, I would suggest to create a few identical nodes in a otherwise big DB (millions of nodes), and then search for any equivalent one (like now) and the most recent one (like suggested) and measure speed. If there is no real problem, I'd go ahead with this (but push for the API to 'uncache' nodes and document how to use it, e.g. in the case of @unkcpz). |
Thanks @giovannipizzi . I am happy to do the check and performance test and then go and finish this PR. But have the following practical problems.
What is the meaning of
Is there any big DB available somewhere exactly for the similar performance testing or do usually we test it on the production DB that we are working on? If there is no such testing purpose DB, we should have one, I think. |
The ctime = m.DateTimeField(default=timezone.now, db_index=True, editable=False) The
I think the biggest one out there is still the |
As a final clarification: my point was not that we should never order the equivalent by |
Back to the original caching issue where I want to work around with this PR, I now agree with @sphuber this should NOT be the standard way to control the caching. So I personally prefer this be closed. |
I have quite a few databases of different sizes. What # nodes are you looking for? :) |
Close this and redirect it to the discussion #5223. For now, to work around the issue I mentioned in #5178, I query for the same node manually and use the latest node by invalid caching on all other old nodes, first by removing the |
@mbercx thanks! As Giovanni mentioned, maybe 'millions' of nodes? |
I think the |
No description provided.