filesystem testing functions init #146

sl5035 · 2023-02-28T22:09:12Z

Sorry for the delay, but I was having a hard time trying to reconfigure my git workflow since my merges were kind of entangled. Hope this branch resolves the workflow back to normal on track.

sbillinge · 2023-02-28T22:53:12Z

yup, looks good. Make sure you have the no-commit to main pre-commit hook set in the repo just in case.

OK, so now we would like to start to write tests for the methods, but also think about what we want the api to be able to do beyond what it was doing before (which was just getting data from ymls and/or json files.) We probably want to keep that functionality, but add cif reading for example. Also, what will be hand back? I guess powder cif objects? This is what we would want tests for I guess.

sl5035 · 2023-03-01T03:03:18Z

Hello Professor,
I have a quick question about the database object. In regolith we used it to store data in defaultdict(lambda: defaultdict(dict)) type but since we're working only with cif files now I'd figured this nested structure is not necessary. Should we still stick to this datatype or is this something we can change and test?

Also, what is chained_db used for? Is it used to chain multiple databases? I was going through the all_documents() function but still couldn't figure out.

sbillinge

please see inline comments. When these are fixed I will merge this. We will work on each individual test individually on different branches.

sbillinge · 2023-03-01T06:12:46Z

tests/test_fsclient.py

+    pass
+
+
+def test_close():


How about this. I think it more explicitly tests that close is working because it ensures that something is open before it closes it. I think it is probably better to pass in rc explicitly too for greater readibility (I should have done that with open I guess):

def test_close(): fsc = FileSystemClient(rc) assert fsc.open assert fsc.dbs == rc.databases actual = fsc.close() assert fsc.dbs is None assert fsc.closed

I couldn't reference rc since I had to make a new class. For now, I just checked if fsc.dbs is an instanceof nested dicts.

sbillinge · 2023-03-01T06:14:16Z

tests/test_fsclient.py

+
+@pytest.mark.skip("Not written")
+def test_update_one():
+    pass


please make sure all your files have a trailing eol at the end. Set up your PyCharm (or whatever) to do it auomatically.

sbillinge · 2023-03-01T06:15:37Z

tests/test_fsclient.py

+def test_close():
+    fsc = FileSystemClient(rc)
+    fsc.close()
+


I think close this up? Not sure what pep8 has to say about this, but it keeps the tests grouped visually bit better. Do what pep8 says but close up if it has nothign to say.

sbillinge · 2023-03-01T06:29:24Z

Great questions. So now it starts

let's have the conversation this way, what do we want the methods to take as inputs and then return as outputs. I suggest we do the following. Now this has a skeleton test rig and is passing tests I will merge it. Then let's create a branch for each function we are working on.

Let's say we start with find_one since this is maybe the only one we need working initially. I think in general I would like to give it a db and a collection and a filter and have it return (the first) item from that collection in that db that matches that filter.

Under the carpet we have lots of questions. Our basic data are cif files so how do we filter them? Do we want to store everything in json? What happens if someone adds a new cif file? We could have a flow like this:

check the number of entries in our cif collection and the number of cif files in the directory. If they are the same assume there are no new cifs. Collection is valid
Apply filter to the collection to find the entry
get the entry and return it as a dict

If 1) fails, assume there is a new cif file. Parse the cif file into json and add it to the collection.

Of course, there is quite a bit going on there, but this flow may be a good one. We could also check if cifs have been updated. If we obtain a SHA hash from our cif file and store it on disc, we could compare the SHA in the collection associated with each file and the SHA of each file. This will make the find very slow so we may want a different way of doing it. Either only do this once in a while, or check the last write date of the file and compare to one stored in the db if that is quicker, etc. things to think about. Some of these things are gravy....we would like them in a more robust implementation. So to speed up development, make the skeletons of the tests for them with enough notes that we know what we want it to do but don't implement them. For example, the code can be working if we just assume the db is fine and don't do any of these tests. So make the test skeletons to capture this full flow, but only implement the fetch.

Does this make sense?

sbillinge · 2023-03-01T07:05:27Z

please see my comments. We want to minimize our use of cif files, even when we are using the filesystem, because "find_one" would imply loading all the cifs into memory, filtering them, finding one thing and returning it.....v slow! So for sure we will keep some kind of collection like thing (list of dicts) that stores some info about the cifs. Can you suggest a design for how to handle this? I could see it that we json serialize everything in the cifs, or that we save a summary of things from them in json that we want to search over and when we match a filter we actually load the cif file from disc and parse it. Not sure which is best tbh, but why don't you think about it? We don't want to develop heavily on the fs backend, so doing something simple but relatively robust is probably the best option so everything works, but then put our focus on the mpcontribs client.... With this in mind, I suggest saving a summary in the json collection. btw, the way the collections work in regolith they are a dict of the format `{'<id1>': {<document1_contents>}, '<id2>': {<document2_contents>}} but after loading it turns into [{'_id': <id1>, <document1_contents>}, {'_id': <id2>, <document2_contents>}] whic his easier to iterate over I guess. I am not sure if we want to do the same or not. btw2 the `chained_db` builds collections across dbs, so the `people` collection may be in `bg_group` and in `bg_public` and after chaining it becomes just one big combined collection. We don't need this complication here so let's not use that.

…

On Wed, Mar 1, 2023 at 4:03 AM Robin Lee ***@***.***> wrote: Hello Professor, I have a quick question about the database object. In regolith we used it to store data in defaultdict(lambda: defaultdict(dict)) type but since we're working only with cif files now I'd figured this nested structure is not necessary. Should we still stick to this datatype or is this something we can change and test? Also, what is chained_db used for? Is it used to chain multiple databases? I was going through the all_documents() function but still couldn't figure out. — Reply to this email directly, view it on GitHub <#146 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAOWUNP4ALJQRV53OWIY4DWZ24ADANCNFSM6AAAAAAVLJBKXU> . You are receiving this because you commented.Message ID: ***@***.***>

-- Simon Billinge Professor, Columbia University

sbillinge · 2023-03-01T08:13:25Z

OK, I thought some more. Here is a concrete plan. Maybe move this to an issue so we don't lose it when this PR closes.

address the comments and I will merge this
work on find_one have it take a db, a collection and a filter and have it return a document/dict. This is the most general case. This is the general signature of find_one
in the fs client, have it do this on a json serialized version of the cifs. Zach wrote code to do this serialization. To reduce effort, just use his serialization, we don't have to reinvent it. For the test, make a tmpfile and write two cif jsons in there and test that it gets the right one. For a better test, filter on something that we might filter on in the real case (something like ciffile name or sthg)

We need to bite the bullet and build the runcontrol infrastructure too. To keep things simple, make runcontrol to be a dict and pass in what we need. Use the Schema in Regolith (databases and so on). Later we may want to import runcontrol from regolith and reuse all the nice things but that will add a lot of bloat to the dependencies (gooey etc. that we won't be using).

Then we can merge that branch. Then we want to refactor the front-end to use the client, so replace the load_cifs() or whatever with a loop that calls find_one resetting the filter in each iteration of the loop. that can be on a separate branch. It won't need new tests, but all the old tests should still pass when it is working.

and so on....

sl5035 · 2023-03-01T17:48:21Z

Yep makes sense! I'll be asking more questions if they come up while working on the new branch!

filesystem testing functions init

ea2c793

test_open, test_close, and load/dump cif functions

81e499d

sbillinge reviewed Mar 1, 2023

View reviewed changes

test_open and test_close fixed

e80068f

sl5035 mentioned this pull request Mar 1, 2023

filesystem find_one not completed #148

Open

sl5035 closed this by deleting the head repository Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filesystem testing functions init #146

filesystem testing functions init #146

sl5035 commented Feb 28, 2023

sbillinge commented Feb 28, 2023

sl5035 commented Mar 1, 2023

sbillinge left a comment

sbillinge Mar 1, 2023

sl5035 Mar 1, 2023

sbillinge Mar 1, 2023

sbillinge Mar 1, 2023

sbillinge commented Mar 1, 2023

sbillinge commented Mar 1, 2023 via email

sbillinge commented Mar 1, 2023

sl5035 commented Mar 1, 2023

filesystem testing functions init #146

filesystem testing functions init #146

Conversation

sl5035 commented Feb 28, 2023

sbillinge commented Feb 28, 2023

sl5035 commented Mar 1, 2023

sbillinge left a comment

Choose a reason for hiding this comment

sbillinge Mar 1, 2023

Choose a reason for hiding this comment

sl5035 Mar 1, 2023

Choose a reason for hiding this comment

sbillinge Mar 1, 2023

Choose a reason for hiding this comment

sbillinge Mar 1, 2023

Choose a reason for hiding this comment

sbillinge commented Mar 1, 2023

sbillinge commented Mar 1, 2023 via email

sbillinge commented Mar 1, 2023

sl5035 commented Mar 1, 2023