Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filesystem testing functions init #146

Closed
wants to merge 3 commits into from
Closed

filesystem testing functions init #146

wants to merge 3 commits into from

Conversation

sl5035
Copy link

@sl5035 sl5035 commented Feb 28, 2023

Sorry for the delay, but I was having a hard time trying to reconfigure my git workflow since my merges were kind of entangled. Hope this branch resolves the workflow back to normal on track.

@sbillinge
Copy link
Collaborator

yup, looks good. Make sure you have the no-commit to main pre-commit hook set in the repo just in case.

OK, so now we would like to start to write tests for the methods, but also think about what we want the api to be able to do beyond what it was doing before (which was just getting data from ymls and/or json files.) We probably want to keep that functionality, but add cif reading for example. Also, what will be hand back? I guess powder cif objects? This is what we would want tests for I guess.

@sl5035
Copy link
Author

sl5035 commented Mar 1, 2023

Hello Professor,
I have a quick question about the database object. In regolith we used it to store data in defaultdict(lambda: defaultdict(dict)) type but since we're working only with cif files now I'd figured this nested structure is not necessary. Should we still stick to this datatype or is this something we can change and test?

Also, what is chained_db used for? Is it used to chain multiple databases? I was going through the all_documents() function but still couldn't figure out.

Copy link
Collaborator

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline comments. When these are fixed I will merge this. We will work on each individual test individually on different branches.

pass


def test_close():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this. I think it more explicitly tests that close is working because it ensures that something is open before it closes it. I think it is probably better to pass in rc explicitly too for greater readibility (I should have done that with open I guess):

    def test_close():
        fsc = FileSystemClient(rc)
        assert fsc.open
        assert fsc.dbs == rc.databases
        actual = fsc.close()
        assert fsc.dbs is None
        assert fsc.closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't reference rc since I had to make a new class. For now, I just checked if fsc.dbs is an instanceof nested dicts.


@pytest.mark.skip("Not written")
def test_update_one():
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure all your files have a trailing eol at the end. Set up your PyCharm (or whatever) to do it auomatically.

def test_close():
fsc = FileSystemClient(rc)
fsc.close()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think close this up? Not sure what pep8 has to say about this, but it keeps the tests grouped visually bit better. Do what pep8 says but close up if it has nothign to say.

@sbillinge
Copy link
Collaborator

Great questions. So now it starts

let's have the conversation this way, what do we want the methods to take as inputs and then return as outputs. I suggest we do the following. Now this has a skeleton test rig and is passing tests I will merge it. Then let's create a branch for each function we are working on.

Let's say we start with find_one since this is maybe the only one we need working initially. I think in general I would like to give it a db and a collection and a filter and have it return (the first) item from that collection in that db that matches that filter.

Under the carpet we have lots of questions. Our basic data are cif files so how do we filter them? Do we want to store everything in json? What happens if someone adds a new cif file? We could have a flow like this:

  1. check the number of entries in our cif collection and the number of cif files in the directory. If they are the same assume there are no new cifs. Collection is valid
  2. Apply filter to the collection to find the entry
  3. get the entry and return it as a dict

If 1) fails, assume there is a new cif file. Parse the cif file into json and add it to the collection.

Of course, there is quite a bit going on there, but this flow may be a good one. We could also check if cifs have been updated. If we obtain a SHA hash from our cif file and store it on disc, we could compare the SHA in the collection associated with each file and the SHA of each file. This will make the find very slow so we may want a different way of doing it. Either only do this once in a while, or check the last write date of the file and compare to one stored in the db if that is quicker, etc. things to think about. Some of these things are gravy....we would like them in a more robust implementation. So to speed up development, make the skeletons of the tests for them with enough notes that we know what we want it to do but don't implement them. For example, the code can be working if we just assume the db is fine and don't do any of these tests. So make the test skeletons to capture this full flow, but only implement the fetch.

Does this make sense?

@sbillinge
Copy link
Collaborator

sbillinge commented Mar 1, 2023 via email

@sbillinge
Copy link
Collaborator

OK, I thought some more. Here is a concrete plan. Maybe move this to an issue so we don't lose it when this PR closes.

  1. address the comments and I will merge this
  2. work on find_one have it take a db, a collection and a filter and have it return a document/dict. This is the most general case. This is the general signature of find_one
  3. in the fs client, have it do this on a json serialized version of the cifs. Zach wrote code to do this serialization. To reduce effort, just use his serialization, we don't have to reinvent it. For the test, make a tmpfile and write two cif jsons in there and test that it gets the right one. For a better test, filter on something that we might filter on in the real case (something like ciffile name or sthg)

We need to bite the bullet and build the runcontrol infrastructure too. To keep things simple, make runcontrol to be a dict and pass in what we need. Use the Schema in Regolith (databases and so on). Later we may want to import runcontrol from regolith and reuse all the nice things but that will add a lot of bloat to the dependencies (gooey etc. that we won't be using).

Then we can merge that branch. Then we want to refactor the front-end to use the client, so replace the load_cifs() or whatever with a loop that calls find_one resetting the filter in each iteration of the loop. that can be on a separate branch. It won't need new tests, but all the old tests should still pass when it is working.

and so on....

@sl5035
Copy link
Author

sl5035 commented Mar 1, 2023

Yep makes sense! I'll be asking more questions if they come up while working on the new branch!

@sl5035 sl5035 closed this by deleting the head repository Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants