Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gs: use 3-char prefixes #6759

Merged
merged 1 commit into from
Oct 11, 2021
Merged

gs: use 3-char prefixes #6759

merged 1 commit into from
Oct 11, 2021

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Oct 7, 2021

Resolves #6691. Normally it should have inherited this from the ObjectFSWrapper, but this seems like a leftover.

Resolves #6691. Normally it should have inherited this from the `ObjectFSWrapper`, but this seems like a leftover.
@isidentical isidentical requested a review from a team as a code owner October 7, 2021 15:40
@isidentical isidentical requested review from pared, efiop and pmrowla and removed request for pared October 7, 2021 15:40
@shcheklein
Copy link
Member

Quick question/suggestion - is there a way/should we to add a test for this?

@isidentical
Copy link
Contributor Author

Quick question/suggestion - is there a way/should we to add a test for this?

We already have 2 tests for this, but they cover the other way around.

@pytest.mark.parametrize("remote", full_clouds, indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote, monkeypatch):
# Related: https://github.com/iterative/dvc/issues/6089
fs_type = type(dvc.cloud.get_remote_odb("upstream").fs)
monkeypatch.setattr(fs_type, "_ALWAYS_TRAVERSE", True, raising=False)
monkeypatch.setattr(fs_type, "LIST_OBJECT_PAGE_SIZE", 256, raising=False)
# foo's md5 checksum is 00411460f7c92d2124a67ea0f4cb5f85
# bar's md5 checksum is 0000000018e6137ac2caab16074784a6
tmp_dir.dvc_gen({"foo": "363", "bar": "jk8ssl"})
dvc.push()
clean(["foo", "bar"], dvc)
stats = dvc.pull()
assert stats["fetched"] == 2
assert set(stats["added"]) == {"foo", "bar"}
@pytest.mark.parametrize("remote", full_clouds, indirect=True)
def test_pull_no_00_prefix(tmp_dir, dvc, remote, monkeypatch):
# Related: https://github.com/iterative/dvc/issues/6244
fs_type = type(dvc.cloud.get_remote_odb("upstream").fs)
monkeypatch.setattr(fs_type, "_ALWAYS_TRAVERSE", True, raising=False)
monkeypatch.setattr(fs_type, "LIST_OBJECT_PAGE_SIZE", 256, raising=False)
# foo's md5 checksum is 14ffd92a6cbf5f2f657067df0d5881a6
# bar's md5 checksum is 64020400f00960c0ef04052547b134b3
tmp_dir.dvc_gen({"foo": "dvc", "bar": "cml"})
dvc.push()
clean(["foo", "bar"], dvc)
stats = dvc.pull()
assert stats["fetched"] == 2
assert set(stats["added"]) == {"foo", "bar"}

@efiop
Copy link
Contributor

efiop commented Oct 7, 2021

Very interesting that this broke because of 2, as I would expect that to be less efficient but still work 🤔 @pmrowla do we have some logic flaw or am I missing something?

On a side note, once we migrate to fs.find I guess we could default to searching for 00/0* by default (non-obj-clouds will just search 00/ anyway) to simplify the logic.

@pmrowla
Copy link
Contributor

pmrowla commented Oct 11, 2021

Very interesting that this broke because of 2, as I would expect that to be less efficient but still work 🤔 @pmrowla do we have some logic flaw or am I missing something?

It seems like there is probably still some bug in the gcsfs prefix implementation somewhere (that doesn't like prefix len 2) rather than this being on DVC's end.

@pmrowla pmrowla merged commit 8767f42 into master Oct 11, 2021
@pmrowla pmrowla deleted the gs-unset-traverse-prefix branch October 11, 2021 07:16
@efiop
Copy link
Contributor

efiop commented Oct 11, 2021

@pmrowla Was hoping to include a test here, strange that this hasn't been caught by our suit. Or do you think it should be tested on gcsfs side?

@pmrowla
Copy link
Contributor

pmrowla commented Oct 11, 2021

I think it needs to be tested on the gcsfs side. We use a mix of both 2 and 3 as prefix lengths on all of our other filesystems without seeing this issue.

@efiop efiop added the bugfix fixes bug label Oct 11, 2021
@nikitaorlovpicsart
Copy link

after update start seening this issue on random files

@efiop
Copy link
Contributor

efiop commented Oct 19, 2021

@nikitaorlovpicsart , could you elaborate?

@nikitaorlovpicsart
Copy link

  1. we use gs as remote
  2. we use most recent release
  3. issue mentioned in issue is still same for us, if we pull file directly, it works, but when we call pull in root folder, we get fail.

@nikitaorlovpicsart
Copy link

Every time all files start with 00

@efiop
Copy link
Contributor

efiop commented Oct 22, 2021

@nikitaorlovpicsart Could you show full verbose (-v) error you are getting? Also dvc doctor output, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc pull returns "failed to pull data" when the data exists on remote
5 participants