Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching: order and use latest node when cache more than one #5197

Closed
wants to merge 3 commits into from

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Oct 25, 2021

No description provided.

@unkcpz
Copy link
Member Author

unkcpz commented Oct 25, 2021

As @giovannipizzi comment in #5178 (comment)

I see. Question to discuss: is using the latest node always the 'right' thing to do? (Maybe yes, but happy to get feedback, e.g. from @greschd). Maybe this should be configurable? (I mean in 'verdi config', per profile, where maybe the default is 'pick the latest' but you have options to e.g. pick the oldest, if we think there's a usecase for that).

Otherwise, one other thing to do is to define a function to search for all nodes that have the same hash and remove the hash key for those that match certain rules (e.g.: remove the hash from all duplicates except the newest?) Or this could create problems (maybe we should do this only for CalcJob nodes and not for Data nodes?).

I think the current implementation only cache by order for the CalcJobNode, since the _cachable property is set to False in the Node, and only re-enabled in CalculationNode (which affects CalcJobs and calcfunctions).

Copy link
Member

@ramirezfranciscof ramirezfranciscof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I cannot think of any situation in which getting the last cached node would not be the desired effect (specially since all nodes are supposed to be equivalent and currently the only possible difference is if their remote folder is still there or not). More than that, even if there was a situation in which one would like a different node, just getting the first node the querybuilder "arbitrarily" chooses is probably not what you wanted instead. So for me getting always the last created node is already an improvement and has a good chance of always be the right thing to do.

Now, my only question with this is if passing kwargs here through all these functions is a good idea. I feel like, although versatile, using kwargs can sometimes end up obscuring the flow of information through the program, and I now personally try to avoid it unless there is no other solution. Moreover, given on what I said in the previous paragraph, I'm not sure we even want to control this from the higher level functions instead of having it always on. So I would maybe just start with _iter_all_same_nodes always returning an ordered by date list and we can add the versatility later if it is needed.

@ramirezfranciscof
Copy link
Member

It would perhaps be good to have confirmation on this idea:

getting the latest node is an improvement over getting the first node the QB randomly* found

from all the participants in the other discussion (@greschd @sphuber @chrisjsewell @giovannipizzi ). No need to review the code (unless you want to), but priority is just saying if you agree or not with that statement.

(* I know it is not exactly "random", but it is unreliable)

@sphuber
Copy link
Contributor

sphuber commented Oct 29, 2021

I am not sure to be honest. I don't think it does any harm, but to me it is almost as arbitrary as picking a random one. @unkcpz found a use case where one particular node was being selected that he didn't want to be selected, and so in that case a patch would be to take the last one. But there are plenty of other scenarios imaginable where this wouldn't be the case. Rather, the real solution here should be to exclude that particular node from being a valid cache, which is what the real underlying problem was. So even though this implementation here couldn't hurt, it adds unnecessary complexity to solve an xy-problem. Instead, I think we should just make it easier to exclude nodes from being a valid cache as described in #5189 . I have a PR that is almost finished and would go for that option instead.

@unkcpz
Copy link
Member Author

unkcpz commented Oct 29, 2021

Well, I think that two different problems. It is absolutely good to have an API to control the caching of the processes. But the API still needs to be used by users, I agree that it is always a improvement to have a default (or more than 'random' way of choosing caching node) value that we know which node is used for caching.

@sphuber
Copy link
Contributor

sphuber commented Oct 29, 2021

I agree that it is always a improvement to have a default (or more than 'random' way of choosing caching node)

That is the problem though as I see it: choosing the last one is just as random as choosing any other one, from a list of identical nodes. Note the stress on identical, because that is what we are doing here. We have a list of identical objects, and we need to pick one. Why would always picking the last one be any less random than any other one? If they are truly identical, you can't tell anyway.

Anyway, if the others also think this is fine to add, I won't stop it.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @ramirezfranciscof think it is good to get the opinion of the others on the mechanism first before going to review, but since I already skimmed it, I already leave some comments as to not forget them.

@@ -678,7 +678,7 @@ def store(self, with_transaction: bool = True) -> 'Node': # pylint: disable=arg
self._backend_entity.clean_values()

# Retrieve the cached node.
same_node = self._get_same_node() if use_cache else None
same_node = self._get_same_node(order_by={'ctime': 'desc'}) if use_cache else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a comment on why this is being ordered.

return _create_cacheable_node

@pytest.mark.usefixtures('clear_database_before_test')
def test_caching(create_cacheable_node):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is the best way to test the new behavior. The fact that the node that is created last was used for the caching is no guarantee that this was done because there was a proper ordering. It could have been by random and so it would be easy to introduce a regression that this test doesn't detect.

Maybe what we should do here is to mock Node._iter_all_same_nodes to the following:

def _iter_all_same_nodes(self, allow_before_store=False, **kwargs):
    assert kwargs['order_by'] == {'ctime': 'desc'}

and then just call node.store(). This will guarantee that the ordering is passed. Of course this ties the test somewhat to the implementation, but at least it is a real unit test that will catch a regression if this were ever to be changed by accident.

If this is the better way to test it, maybe also just move it to tests/orm/node/test_node.py as we are testing a method of the Node base class, and it is not something specific to the CalcJobNode subclass.

@ramirezfranciscof
Copy link
Member

ramirezfranciscof commented Oct 29, 2021

That is the problem though as I see it: choosing the last one is just as random as choosing any other one, from a list of identical nodes. Note the stress on identical, because that is what we are doing here. We have a list of identical objects, and we need to pick one. Why would always picking the last one be any less random than any other one? If they are truly identical, you can't tell anyway.

I would agree with you that in cases like like selecting a number from a list of integers this is irrelevant, since whether you take the first 4 or the second 4 you find, they are literally the same constant. The problem is that the stashed nodes aren't actually guaranteed to be completely identical, this is just a "best approximation" we can do with the information we can track. And this is not only because of the remote folder problem: even if they were ran with the same code node, someone can have re-compiled the executable it point to, changed the default environment variables of the computer, etc. (basically any of the "leaks" in provenance we already discussed in the past). Don't get me wrong, I still think it is a good enough approximation to make the stashing feature very useful, but as long as we have these references to external objects that we can't completely track, I think there is considerable value in having some consistent criteria when taking one of many "equivalent" things.

If you now agree that there is some value in selecting with a consistent criteria, then I think having that criteria be the most recently created node and not another one is less contentious. Not only this is a natural universal selection criteria (together with "alphabetically" and..not many others no? 🤔 ), but in this case has the advantage of having a higher likelihood of getting a node with an existing remote folder. Sure, it does not actually guarantee it, and you could even say the increase in likelihood is not that much really, but still "very small advantage" is better that "no particular advantage whatsoever" for any other criteria I can think of 😅 .

So even though this implementation here couldn't hurt, it adds unnecessary complexity to solve an xy-problem. Instead, I think we should just make it easier to exclude nodes from being a valid cache as described in #5189 . I have a PR that is almost finished and would go for that option instead.

As @unkcpz said, I think these are addressing two separate aspects of the problem and that having both can be useful for different reasons (in particular, the paragraphs above). Also, I don't think this necessarily adds any unnecessary complexity to the code (bah, not sure what you mean by complexity here, but including the ordering of nodes could just be a one line addition and the impact in performance should not be very much).

@greschd
Copy link
Member

greschd commented Oct 29, 2021

and the impact in performance should not be very much

I'm not so sure about this: now, we can essentially make a query that can return when the first match is found. When we guarantee using the latest, worst-case (i.e., if there is no index on time[1]) we'd have to go through all nodes and order them.

I don't have any insight query performance and what is or isn't indexed, just wanted to point out that this statement is not self-evident.

Other than that, I agree that returning the newest node is an improvement over the status quo.

[1] As a side note, if caching performance is an issue (it might well be, I never tested that on big DBs!), we should consider adding an index on the hash.

@ramirezfranciscof
Copy link
Member

I'm not so sure about this: now, we can essentially make a query that can return when the first match is found. When we guarantee using the latest, worst-case (i.e., if there is no index on time[1]) we'd have to go through all nodes and order them.

Mmm you think so? I mean, my logic was that when there is no equivalent hashed node, you are already going through all of the possible calcjobs and checking if the hash coincides, no? So although I understand why you say this would be slower than finding the first hit, I think it is still would be roughly comparable to the most common hash search query.

@greschd
Copy link
Member

greschd commented Oct 29, 2021

I think it could make a big difference if it's a data type with lots of duplicates, like for example Int. If you're looking for any Int(1) it will be found fairly quickly, whereas the latest is only fast if there is a time index. Otherwise, it'll be O(N) with the number of Int in the DB.

This is a slightly exaggerated example, because we actually search for process nodes, not base types.

Actually, thinking about this, does anyone have experience with the caching performance on really large DBs? This was always a concern we've pushed to "when it becomes problematic". Compared to the initial design, what makes this somewhat worse is that -- if I remember correctly -- back then, different CalcJobs had different node types, and we would filter by that.

@sphuber
Copy link
Contributor

sphuber commented Oct 29, 2021

I think it could make a big difference if it's a data type with lots of duplicates, like for example Int

Except that Data nodes are not cachable anyway and so are never considered, so the problem of performance doesn't apply here.

@giovannipizzi
Copy link
Member

My 2 cents: I agree these are 2 independent problems, and (after checking for performance) I'm agreeing that returning the last one is a good change for this PR (but it would be good to do some benchmarking on a large DB?). If we use the ctime, and as long as it's indexed (to double check) it should be OK, but performance is always hard to predict (also, if the query for same hash returns few nodes, most probably the DB will anyway decide to do a scan search regardless of the presence or absence of an index).
Two comments:

  • I would still suggest not to rely on this unless you're sure the last one is really always the right thing to do, but suggest to disable caching for nodes that shouldn't (usecase: I have 2 identical calculations run without caching, then I clean only the most recent one - this new approach would fail). But combined with the other suggestion in the PR by Sebastiaan, this would also be solved
  • we're not really checking truly identical nodes, but equivalent nodes, as mentioned by Francisco. In addition, beside the fact that RemoteData are special in this sense (as they link to data that is not part of the hash), even the time itself is not part of the hash (nor the UUID) so they are not really identical, so making a choice makes sense.

Practically, I would suggest to create a few identical nodes in a otherwise big DB (millions of nodes), and then search for any equivalent one (like now) and the most recent one (like suggested) and measure speed. If there is no real problem, I'd go ahead with this (but push for the API to 'uncache' nodes and document how to use it, e.g. in the case of @unkcpz).

@unkcpz
Copy link
Member Author

unkcpz commented Nov 2, 2021

Thanks @giovannipizzi . I am happy to do the check and performance test and then go and finish this PR. But have the following practical problems.

If we use the ctime, and as long as it's indexed (to double check) it should be OK

What is the meaning of ctime is indexed? Indexed in the PostgreSQL database, correct? You may need to elaborate on that since I am not familiar with the implementation of that part of aiida-core.

Practically, I would suggest creating a few identical nodes in a otherwise big DB (millions of nodes), and then search for any equivalent one (like now) and the most recent one (like suggested) and measure speed.

Is there any big DB available somewhere exactly for the similar performance testing or do usually we test it on the production DB that we are working on? If there is no such testing purpose DB, we should have one, I think.

@sphuber
Copy link
Contributor

sphuber commented Nov 2, 2021

What is the meaning of ctime is indexed? Indexed in the PostgreSQL database, correct? You may need to elaborate on that since I am not familiar with the implementation of that part of aiida-core.

The ctime is a column (storing the datetime of a node's creation) on the DbNode table, which is where Nodes are stored. An index can be placed on a column to make queries more efficient. You can see this in the model declaration. For example for Django, in aiida.backends.djsite.db.models.DbNode you can see the ctime is declared as:

ctime = m.DateTimeField(default=timezone.now, db_index=True, editable=False)

The db_index=True means that Django will instruct Postgres to create an index. So that answers @giovannipizzi 's question for Django. Unfortunately, if you look for SqlAlchemy, there is no index on the ctime of a node. It has long been known that the database schemas for both implementations are not identical and @chrisjsewell has been working on a PR to make them identical, before we are dropping Django. So this does complicate the situation for this PR a bit, since you would have to really test for SqlAlchemy, but that currently doesn't have the required indices.

Is there any big DB available somewhere exactly for the similar performance testing or do usually we test it on the production DB that we are working on? If there is no such testing purpose DB, we should have one, I think.

I think the biggest one out there is still the sdb one, but since that is so big, it is very unwieldy to create a clone. Just copying over the repository can take around 4 days. So until we have the new repository, which should make this a lot faster and manageable, we would probably have to find a smaller database. I don't have any medium-sized databases that I could give you unfortunately, but maybe @qiaojunfeng and @mbercx know of some?

@sphuber
Copy link
Contributor

sphuber commented Nov 2, 2021

As a final clarification: my point was not that we should never order the equivalent by ctime to determine the one used as cache source, but that the way that it was presented here was a bad solution to the problem it was trying to solve. Even if we merge this change, I think it should be made very clear that this should in no shape or form be used as a way to control caching as a user. So I am not sure if it should even be mentioned in the caching docs, at least not in the ones that are most important for the user.

@unkcpz
Copy link
Member Author

unkcpz commented Nov 3, 2021

Back to the original caching issue where I want to work around with this PR, I now agree with @sphuber this should NOT be the standard way to control the caching.
Although at the moment I have this PR's code in my production environment, it only serves as a workaround.
If we have #5207 merged to control the caching of specific CalcJob instances, I would do in my workflow, disable the caching every time when the clean() function is called.
For the already cached nodes, I will do the clear_hash() for all the same nodes.

So I personally prefer this be closed.

@mbercx
Copy link
Member

mbercx commented Nov 3, 2021

I don't have any medium-sized databases that I could give you unfortunately, but maybe @qiaojunfeng and @mbercx know of some?

I have quite a few databases of different sizes. What # nodes are you looking for? :)

@unkcpz
Copy link
Member Author

unkcpz commented Nov 8, 2021

Close this and redirect it to the discussion #5223.

For now, to work around the issue I mentioned in #5178, I query for the same node manually and use the latest node by invalid caching on all other old nodes, first by removing the _hash from extra attributes and we can do it now with set is_valid_cache=False for node instance after #5207 merged.

@unkcpz unkcpz closed this Nov 8, 2021
@unkcpz
Copy link
Member Author

unkcpz commented Nov 8, 2021

I have quite a few databases of different sizes. What # nodes are you looking for? :)

@mbercx thanks! As Giovanni mentioned, maybe 'millions' of nodes?

@unkcpz unkcpz deleted the caching-latest branch November 8, 2021 09:12
@mbercx
Copy link
Member

mbercx commented Nov 8, 2021

@mbercx thanks! As Giovanni mentioned, maybe 'millions' of nodes?

I think the 3dd database had around 3-4 million nodes when I synced it to prnmarvelsrv. I'll explain on Slack where you can find its repository and psql dump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants