Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[red-knot]: Add a VendoredFileSystem implementation #11863

Merged
merged 43 commits into from
Jun 18, 2024
Merged

Conversation

AlexWaygood
Copy link
Member

Summary

Work towards #11653.

This PR adds a VendoredFileSystem implementation to the ruff_db crate, the idea being that this is the filesystem we would use for resolving vendored typeshed stubs.

Some of this feels a bit awkward, but I'm not sure there's a better way. I ran into two problems while writing this PR:

  1. The ZipArchive struct exposed by the zip crate takes &mut self for most of the methods that query the underlying archive for information about files or directories within the archive. This means that in order to query zipped data from trait method implementations that take &self, we end up having to clone the underlying archive. I think that's okay, as the zip docs say that it should be cheap to clone (at least right now), but it didn't fill me with joy.
  2. The camino crate strips trailing slashes in a lot of its methods, as they're not semantically significant for most paths. However, for zip archives, trailing slashes are very significant: if a stdlib/ directory exists inside the zip, zip.by_name("stdlib") will not resolve to that directory (only zip.by_name("stdlib/") will). I've attempted to abstract over this difference to other kinds of paths, so that the VendoredFileSystem is consistent with the other FileSystem implementations in this respect. But it was slightly tricky to do so.

Test Plan

I added lots of parameterized tests.

@AlexWaygood AlexWaygood added the red-knot Multi-file analysis & type inference label Jun 13, 2024
@AlexWaygood AlexWaygood marked this pull request as draft June 13, 2024 19:27
@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 00edf56 to 0cbc590 Compare June 13, 2024 19:31
@AlexWaygood AlexWaygood marked this pull request as ready for review June 13, 2024 20:03
@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 8bc3384 to 4441c9e Compare June 13, 2024 20:04
Copy link
Contributor

github-actions bot commented Jun 13, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

Overall looks good, but we should address the following items before merging:

  • Use regular functions for tests instead of macros
  • Remove FileSystem implementation and use VendoredFilePath and VendoredFilePathBuf (and move the definitions into `vendored.rs)
  • Move the implementation out of the vendored directory.
  • If we can, remove the probing where we append a slash at the end of the path when reading failed. This seems like trial and error.

One open question for me is whether we should create a dedicated ruff_vendored crate that also contains all vendored files (and contains the build.rs). The benefit of that would be that integration tests can use the actual vendored files, which I think would be desired (but use stubed vendored files for unit tests). If that's undesired, I think leaving it in ruff_db is fine.

crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
Comment on lines 144 to 150
/// A path that has been normalized via the `normalize_vendored_path` function.
///
/// Trailing slashes are normalized away by `camino::Utf8PathBuf`s,
/// but trailing slashes are crucial for distinguishing between
/// files and directories inside zip archives.
#[derive(Debug, Clone, PartialEq, Eq)]
struct NormalizedVendoredPath(String);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could combine this with the VendoredFilePath and VendoredFilePathBuf where we require the use of VendoredFilePathBuf::new (that normalizes the string).

crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/file_system/vendored.rs Outdated Show resolved Hide resolved
@AlexWaygood
Copy link
Member Author

  • Move the implementation out of the vendored directory.

I'm guessing you mean file_system directory? :)

@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 821cbef to bd94c60 Compare June 14, 2024 10:51
@AlexWaygood
Copy link
Member Author

One open question for me is whether we should create a dedicated ruff_vendored crate that also contains all vendored files (and contains the build.rs). The benefit of that would be that integration tests can use the actual vendored files, which I think would be desired (but use stubed vendored files for unit tests). If that's undesired, I think leaving it in ruff_db is fine.

I wasn't sure while writing this the extent to which we'd want to tie the VendoredFileSystem implementation to the exact details of the only repository we realistically expect to be vendoring (typeshed). In this implementation, I tried to keep things relatively abstract, so that it could theoretically work with any vendored zip archive. That means that I expect to have to build another layer of abstraction on top of this for working with typeshed specifically, because we have to take the VERSIONS file into account when deciding whether a file "exists" in typeshed. If a file stdlib/foo.pyi exists in the typeshed zip archive but VERSIONS tells us that it hadn't actually been added to the stdlib yet on the Python version the user has selected to be the target version, then the query to our typeshed abstraction should arguably report that the stdlib/foo.pyi doesn't exist at all. That's the reality at runtime on that Python version, after all: if the module hasn't been added to the stdlib yet, the file for that module physically doesn't exist at runtime on that Python version.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks good but I think we can make it much simpler by removing all code that's only necessary for handling directories. From what I understand, we won't need directory support to implement module resolution. All we need is to lookup a specific file path. Or are there operations that need to know about the existence of a directory?


#[derive(Debug)]
pub struct VendoredFileSystem {
inner: VendoredFileSystemInner,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would remove the inner file type. It doesn't provide that much functionality but the nesting becomes somewhat hard to understand. It also allows to remove other wrapper types, reducing the overall code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of one of the other wrapper types in f4f97d4, but I think I would prefer to keep using an inner file type for this. I tried changing it to a type alias, and in my opinion it made the VendoredFileSystem unduly coupled to the inner representation that used a lock/RefCell; I also found it less readable.

crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vfs.rs Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
Comment on lines +161 to +163
fn lookup_path(&mut self, path: &NormalizedVendoredPath) -> Result<ZipFile> {
Ok(self.0.by_name(path.as_str())?)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should lookup_path take a &VendoredPath instead and do all the normalization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the "trailing slashes are semantically significant for zip archives" thing combined with the "ZipFile::by_name() requires a mutable reference" thing, it's hard for me to see how we could do this while keeping the borrow checker happy and avoiding doing path normalization twice. It doesn't seem possible to call self.0.by_name() twice in lookup_path() while keeping the borrow checker happy, unfortunately (unless I'm missing something), and because of the trailing-slash thing, we need to call it twice in lookup_path() if we're going to do the normalization inside lookup_path().

@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 96159d9 to 6702893 Compare June 17, 2024 15:12
@AlexWaygood AlexWaygood requested a review from MichaReiser June 17, 2024 16:32
@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 6702893 to b0fe7c0 Compare June 17, 2024 16:35
@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from b0fe7c0 to f05bcb2 Compare June 17, 2024 16:58
@MichaReiser MichaReiser force-pushed the salsa-memory-resolver branch from 2348fe6 to 2cd137b Compare June 18, 2024 12:05
@MichaReiser MichaReiser requested a review from dhruvmanila as a code owner June 18, 2024 12:05
@MichaReiser MichaReiser force-pushed the salsa-memory-resolver branch from 2cd137b to a43a8e6 Compare June 18, 2024 12:07
@AlexWaygood AlexWaygood removed the request for review from dhruvmanila June 18, 2024 12:09
Base automatically changed from salsa-memory-resolver to main June 18, 2024 12:12
crates/ruff_db/src/metadata.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Outdated Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Show resolved Hide resolved
crates/ruff_db/src/vendored.rs Show resolved Hide resolved
Comment on lines 28 to 34
let normalized = normalize_vendored_path(path);
let inner_locked = self.inner.lock();
let mut archive = inner_locked.borrow_mut();
archive.lookup_path(&normalized).is_ok()
|| archive
.lookup_path(&normalized.with_trailing_slash())
.is_ok()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe just self.metadata(path).is_some()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't resist micro-optimising this 😄 (this avoids calling zip_file.crc32() and zip_file.is_dir(), which aren't necessary if you only need to know whether the path exists or not)

AlexWaygood and others added 2 commits June 18, 2024 15:02
Co-authored-by: Micha Reiser <micha@reiser.io>
@AlexWaygood AlexWaygood force-pushed the vendored-filesystem branch from 38990d5 to 5d511bb Compare June 18, 2024 14:18
@AlexWaygood AlexWaygood enabled auto-merge (squash) June 18, 2024 14:18
@AlexWaygood AlexWaygood disabled auto-merge June 18, 2024 15:39
@AlexWaygood AlexWaygood enabled auto-merge (squash) June 18, 2024 15:43
@AlexWaygood AlexWaygood merged commit 1d73d60 into main Jun 18, 2024
16 checks passed
@AlexWaygood AlexWaygood deleted the vendored-filesystem branch June 18, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
red-knot Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants