Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added type guards for MemoryFileSystem and name property to MemoryFile class #1574

Merged
merged 14 commits into from
Apr 19, 2024

Conversation

austinsimpsond41
Copy link
Contributor

Hi everyone,

I've made small changes to the MemoryFileSystem class. The goal of this PR is to fix issues that prevent users from using this class with the SimpleDirectoryReader from llama_index.

The first issue is that SimpleDirectoryReader will create Path objects and pass them to fs.open(...), but the MemoryFileSystem class assumes path will be a string. I added a quick call to stringify_path in _strip_protocol to fix this issue.

The second issue is llamaindex assumes the file will have a name property, so I added this to the MemoryFile class.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have time to add a couple of simple tests?
I am not certain why you would use Path with fsspec, since it is really meant for local filesystem files, and universal_pathlib (upath) attempts to do this for fsspec. Still, I'd rather not get in users' way :)

@austinsimpsond41
Copy link
Contributor Author

I'm thinking we can add a test for retrieving a file using Path. Is there anything else I'm missing?

@austinsimpsond41
Copy link
Contributor Author

Hi @martindurant,

I added a test for the Path changes. Please let me know if you want me to do anything else!

@martindurant
Copy link
Member

On requiring a .name, this is not implemented generally. I don't mind adding it here, but I caution against depending on it, especially given the more standard attribute .path.

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>
@austinsimpsond41
Copy link
Contributor Author

Good to know. I'm not using fsspec directly.

I'm looking at the implementation of SimpleDirectoryReader and PDFReader in the llamaindex package. In the SimpleDirectoryReader constructor, it converts any string based paths to pathlib.Path. It uses the local file system interface provided by default from fsspec, which probably explains why it's converting strings to Path.

As for depending on .name, that's happening in the PDFReader class. My guess is that class is written with the assumption that the file is coming from the LocalFileSystem class.

Maybe we can roll back the .name change, and I can submit a PR to llama_index instead to make the PDFReader depend on .path instead. I don't want to add non-standard behavior to your library if the cause of the issue is a misunderstanding in how the library works.

@martindurant
Copy link
Member

Maybe we can roll back the .name change, and I can submit a PR to llama_index instead to make the PDFReader depend on .path instead. I don't want to add non-standard behavior to your library if the cause of the issue is a misunderstanding in how the library works.

OK, that sounds like a plan - let me know what the response is.

@austinsimpsond41
Copy link
Contributor Author

Will do. In the meantime, I rolled back the .name property.

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>
@austinsimpsond41
Copy link
Contributor Author

I pulled your changes where we check if the path is an instance of Path. Would it be more reasonable to only accept PurePosixPaths? I assume that's the format memfs uses anyway. We wouldn't have to worry about the OS specific edge cases.

@martindurant
Copy link
Member

Would it be more reasonable to only accept PurePosixPaths?

I don't suppose people use this, and you would be back in the same situation as before. But yes, memfs in some places (e.g., directory components) assumes the path-string is posix-like.

@austinsimpsond41
Copy link
Contributor Author

People are using the library in this way. I'm essentially trying to fix using this class with memfs.

https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/readers/file/base.py

I don't understand what situation I'm back in, can you please explain?

@martindurant
Copy link
Member

I don't understand what situation I'm back in, can you please explain?

I mean, people passing Path rather than what we expect. If you instead make a case handling PurePosixPath, that would still leave Path unhandled.

@austinsimpsond41
Copy link
Contributor Author

I think the issue is that in some implementations in fsspec accept a Path object in .open. In fact, it's supported in the base AbstractFileSystem class. The first thing _strip_protocol does in AbstractFS is stringify the path. Any class that doesn't override the _strip_protocol method will support Path. In other words, sometimes a Path is expected.

As a result, the code I linked to will either work or not work depending on the selected implementation of AbstractFileSystem.

And I agree, half-handling path may not be as good as I initially thought. The idea was to support a Path object that makes sense for memfs

@martindurant
Copy link
Member

I think you should call stringify_path and don't worry about windows Paths looking strange - if calling code doesn't want that, they can either make their own strings or use posix paths.

@austinsimpsond41
Copy link
Contributor Author

The calling code is using PurePosixPaths if the file system isn't LocalFS. I changed memfs in my latest commit to check if it's a pure posix path and stringify it if it is. Does this work for you?

@martindurant
Copy link
Member

Why not pathlib.PurePath, then? That would catch Path and the expected PurePosixPath.

@austinsimpsond41
Copy link
Contributor Author

Sure, I can make that change.

@austinsimpsond41
Copy link
Contributor Author

There's an issue handling PureWindowsPath. make_posix_path depends on the current os, and not the type of the input. This means that if I call LocalFileSystem._strip_protocol with a PurePosixPath on windows, it will convert the posix path to a windows path and then try to convert it back to posix. I'm ending up with a value like "/c:/myfile/foo/bar" instead of "/myfile/foo/bar". I'd consider changing make_posix_path to check the type of the input, but I want to get your opinion.

@martindurant
Copy link
Member

Is there really any problem with paths looking like "/c:/myfile/foo/bar" ?

@austinsimpsond41
Copy link
Contributor Author

I think so. Intuitively, I'd expect PosixPath("/myfile/foo/bar") and WindowsPath("/myfile/foo/bar") to refer to the same file. I suppose we won't have a situation where someone's using both WindowsPath and PosixPath interchangeably.

Maybe we refactor the test, and do one for windows path and one for posix path?

@martindurant
Copy link
Member

I'd expect PosixPath("/myfile/foo/bar") and WindowsPath("/myfile/foo/bar") to refer to the same file

... in the memoryFS? I'm not sure why we need to make such guarantees.

@austinsimpsond41
Copy link
Contributor Author

Ok, I allowed for "/c:/.." in the paths and made tests for both posix and windows.

@martindurant martindurant merged commit b532078 into fsspec:master Apr 19, 2024
11 checks passed
@austinsimpsond41
Copy link
Contributor Author

@martindurant

Before I disappear back into the woodwork, thank you for taking the time to review my changes and offering feedback. Also, thank you for creating and maintaining this library in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants