Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative path is incorrect when using Dataset.add_external_files for local file without the file:// prefix #1313

Open
geometrikal opened this issue Aug 14, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@geometrikal
Copy link

Describe the bug

Dataset.add_external_file will allow adding external files from a source url without the file:// prefix, e.g. /path/to/my/file.csv. It will create the link correctly in the metadat.json by prepending file:// however it will remove characters from the file name when creating the relative link.

To reproduce

dataset = Dataset.create(
        dataset_project="tes",
        dataset_name="test"
)
dataset.add_external_files("/mnt/f/0123456789.txt")
dataset.finalize(auto_upload=True)
print(dataset.list_files())
# ['789.txt']

Expected behaviour

Either throw an exception or correctly create the relative path

Environment

  • Server type self hosted
  • ClearML SDK Version 1.16.3
  • ClearML Server Version 1.16.1-499 • API: 2.30
  • Python Version 3.10
  • OS Linux
@geometrikal geometrikal added the bug Something isn't working label Aug 14, 2024
@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @geometrikal ! Thanks for reporting this. We will get back to you once this is fixed.

@d-vignesh
Copy link
Contributor

Hi @eugen-ajechiloae-clearml i would like to work on this. Can i pick it up. I am new to the repo, can you please help with some guidance on what has to be done here.

@eugen-ajechiloae-clearml
Copy link
Collaborator

eugen-ajechiloae-clearml commented Sep 5, 2024

Hi @d-vignesh ! Yes, all contributions are welcome, and no-one picked this one up as far as I'm aware.
It looks like when adding a file using add_external_files, the file name will be stripped of the first len("file://") characters. I I believe that we don't really support paths that are not prefixed by file:// at all, so, in this function:

def _add_external_files(

I would add a check: if the file is a local path without file:// then get the absolute path and add the file:// prefix.
This should be tested using absolute paths, relative paths, on both Windows and Linux/MacOS.

@d-vignesh
Copy link
Contributor

Hi @eugen-ajechiloae-clearml can be please clarify on below doubt,
when a source_url without file:// prefix is provided,

  1. the code defaults to file:// as base url and hence the metadata is having the expected name value
  2. during constructing the relative path
link = remote_object.get("name")
relative_path = link[len(source_url):]
if not relative_path:
     relative_path = source_url.split("/")[-1]

for example, if the source_url is /Users/docs/file1.txt, the link value will be file://Users/docs/file1.txt. Hence the relative path will be assigned as le1.txt in line 2. Instead if we use relative_path = source_url.split("/")[-1] always we should be getting the correct relative path. What scenario is the line relative_path = link[len(source_url):] handling in this case?
image

@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @d-vignesh ! There are 2 cases when adding external files: you can either add a "directory" or a direct path. In case the source URL is the direct path, we fallback to relative_path = source_url.split("/")[-1]

@d-vignesh
Copy link
Contributor

thank you @eugen-ajechiloae-clearml , got it. Please verify my understanding, that user can give both absolute path /User/project/file1.txt or relative path ./file1.txt if in project directory. In relative path case we need to get the absolute path as /User/project/file1.txt . Then finally append the file:// prefix. This modified path will then be used as the source_url.

@d-vignesh
Copy link
Contributor

Assuming the above understanding is right, the below logic is working for all scenarios

source_url = '../data/file2.txt'
parsed = urlparse(source_url)
if parsed.scheme == "":
    abs_path = os.path.abspath(source_url)
    source_url = f"file://{abs_path}"
print(source_url)

Adding this check at the beginning of the add_external_file function seems to resolve the issue. Am i missing any case?
image

@d-vignesh
Copy link
Contributor

Hi @eugen-ajechiloae-clearml , created a PR for the fix #1326. Can you please have a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants