Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the cache in GDriveFileSystem.find #229

Closed
simone-viozzi opened this issue Sep 2, 2022 · 7 comments · Fixed by #328
Closed

Problem with the cache in GDriveFileSystem.find #229

simone-viozzi opened this issue Sep 2, 2022 · 7 comments · Fixed by #328
Labels
bug Something isn't working fs fsspec implementation priority-p1

Comments

@simone-viozzi
Copy link
Contributor

simone-viozzi commented Sep 2, 2022

While working on #222, I discovered that find has a bug with the cache.

Let assume self.path=root/tmp/ and a folder structure like:

root/tmp/
        └── fo1/
           ├── file2.pdf
           └── fo2/
              ├── file3.pdf
              └── fo3/
                 └── file4.pdf

now let's do some tests:

f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']

f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']

and that is correct,
but if we do only find('root/tmp/fo1/fo2'):

f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]

This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path

so in the last example, the content of the cache is:

{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp', 
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1', 
}

I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.

The lines of code involved are:

def find(self, path, detail=False, **kwargs):
bucket, base = self.split_path(path)
seen_paths = set()
dir_ids = [self._ids_cache["ids"].copy()]
contents = []
while dir_ids:
query_ids = {
dir_id: dir_name
for dir_id, dir_name in dir_ids.pop().items()
if posixpath.commonpath([base, dir_name]) == base
if dir_id not in seen_paths
}
if not query_ids:
continue

@shcheklein shcheklein added bug Something isn't working fs fsspec implementation labels Sep 6, 2022
@shcheklein
Copy link
Member

@simone-viozzi yes, thanks for creating the ticket. find is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(

@simone-viozzi
Copy link
Contributor Author

simone-viozzi commented Sep 9, 2022

Any idea of a roadmap to fix it? With this broken #222 is pretty much useless, it will work unexpectedly 90% of the time.

I could try, but I still don't understand how the cache works right now, and how it should work.
If I had to implement the cache mechanism from scratch, I would most likely use something like this

@shcheklein
Copy link
Member

With this broken #222 is pretty much useless, it will work unexpectedly 90% of the time.

Hmm, I think cp doesn't depend on find. Cache itself is not broken, it's find implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp one.

@simone-viozzi
Copy link
Contributor Author

@shcheklein
Copy link
Member

okay, problems that I see:

  • maxdepth is not supported
  • withdirs is not supported (must have for expand path)
  • doesn't check bucket name find('.') and find('etasderwer') return the same result
  • it keeps appending ids to cache every time even for already known
  • cache is not locked (multithreading support) - this is a bit more advanced and less critical to start with

@simone-viozzi
Copy link
Contributor Author

@shcheklein
Copy link
Member

@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fs fsspec implementation priority-p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants