Allow for filepaths to include #161

rhatdan · 2024-10-07T19:21:58Z

When playing with omlmd I tried to push a file which contained a ":" and oras choked on it. I belive that you should check if the file exists and only split off the last : looking for options, not the first colon.

When playing with omlmd I tried to push a file which contained a ":" and oras choked on it. I belive that you should check if the file exists and only split off the last : looking for options, not the first colon. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>

halfline · 2024-10-07T19:56:12Z

oras/utils/fileio.py

@@ -346,7 +346,8 @@ def split_path_and_content(ref: str) -> PathAndOptionalContent:
    : return: A Tuple of the path in the reference, and the content-type if one found,
              otherwise None.
    """
-    if ":" not in ref:
+
+    if os.path.exists(ref) or ":" not in ref:


Another case to consider handling here... ref's that start with colon. I don't think they should be treated as a naked content type, right? Instead, I guess the entire ref should be considered as a path?

@halfline you mean like :<content-type> ? Wouldn't that akin to an empty / missing file and a content type (is that allowed / is there a use case for it)? I'm wondering if we should specifically not allow that (e.g., if the path doesn't exist but the string starts with : maybe we need to raise an error.

so in my mind, if someone e.g. does

╎❯ omlmd push http://quay.io/foo/foo:latest '.file-that-doesnt-exist'

and

╎❯ omlmd push http://quay.io/foo/foo:latest ':file-that-doesnt-exist'

in both cases the error message should be the same and the overall behavior should be the same.

But that's just a driveby subjective opinion. I haven't looked deeply into how split_path_and_content is used or could be used.

If the path doesn't exist and the colon is being used weirdly like that to not separate a filepath from a content type, I think an error should be raised.

I agree if the file does not exist with or with the colon removed then raise the ENOENT error.
If the file exists with the colon then use it. If the file exists with the colon and a legitimate option then just use the file and let the user deal with it. IE change the names of the file.

@rhatdan the discussion below is getting out of scope. If you'd like to add any additional tweaks based on the discussion here, please do. You'll also need to bump the version in oras/version.py and add a line to the changelog. When those are done I'll review once more and we can merge, and follow up discussion about reference formats, etc. should go in an issue.

tarilabs

LGTM

Proposed test

@pytest.mark.with_auth(False)
def test_file_contains_column(tmp_path, registry, credentials, target):
    """
    Test for file containing column symbol
    """
    client = oras.client.OrasClient(hostname=registry, insecure=True)
    artifact = os.path.join(here, "artifact.txt")
    assert os.path.exists(artifact)

    try:
        contains_column = here / "some:file"
        with open(contains_column, "w") as f:
            f.write("hello world some:file")

        res = client.push(
            files=[contains_column],
            target=target
        )
        assert res.status_code in [200, 201]

        files = client.pull(target, outdir=tmp_path / "download")
        download = str(tmp_path / "download/some:file")
        assert download in files
        assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
            download
        )
    finally:
        contains_column.unlink()

    try:
        contains_column = here / ":somefile"
        with open(contains_column, "w") as f:
            f.write("hello world :somefile")

        res = client.push(
            files=[contains_column],
            target=target
        )
        assert res.status_code in [200, 201]

        files = client.pull(target, outdir=tmp_path / "download")
        download = str(tmp_path / "download/:somefile")
        assert download in files
        assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
            download
        )
    finally:
        contains_column.unlink()

    with pytest.raises(FileNotFoundError):
        client.push(
            files=[":doesnotexist"],
            target=target
        )

    with pytest.raises(FileNotFoundError):
        client.push(
            files=[".doesnotexist"],
            target=target
        )

( I don't have rights to add myself to this PR )

Other considerations

With this patch, this test case on main starts to fail:

oras-py/oras/tests/test_provider.py

Lines 133 to 137 in 8f52216

    
           testref = "path/to/config:application/vnd.oci.image.config.v1+json:extra" 
        
           remote = oras.provider.Registry(hostname=registry, insecure=True) 
        
           ref, content_type = remote._parse_manifest_ref(testref) 
        
           assert ref == "path/to/config" 
        
           assert content_type == "application/vnd.oci.image.config.v1+json:extra"

but honestly that looks to me a wrong test case since it seems to me it's not a valid mime-type 🤔 should we drop that test case? i.e.: I'm not sure the trailing :extra would make it into a valid mime-type application/vnd.oci.image.config.v1+json:extra or I haven't found the requirement that motivates it

( I don't have rights to add myself to this PR )

HTH

vsoch · 2024-10-08T10:25:34Z

@tarilabs akin to our discussion above, I think we should allow maximally one :, and then if there is another one (and the filepath does not exist) an error should be raised. So your last example above (the newly failing test) should raise that error I think.

tarilabs · 2024-10-08T11:06:21Z

@vsoch we could transform the already existing test case on main, failing with this PR, as a failure case. If I understand that correctly, I agree.

I do not concur we should allow maximally one :, as it seems to me in case of Win you want to allow C:\file:mime and in other cases we want to allow my:file:containing:columns:mime.

Edit: or you meant "allow maximally one :" ...as a separator?

    with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
        client.push(
            files=["does:not:exists:mime"],
            target=target
        )

    with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
        client.push(
            files=["does:not:exists:mime+ext"],
            target=target
        )

Let me know how could I help on this PR ! Thanks for the early feedback

halfline · 2024-10-08T11:08:01Z

@tarilabs akin to our discussion above, I think we should allow maximally one :, and then if there is another one (and the filepath does not exist) an error should be raised. So your last example above (the newly failing test) should raise that error I think.

Again just driveby commenting with limited context so feel free to disregard but...

why would you allow one colon but not more than one colon? what if a filename has two? (eg machineid:hash-type:hash) what if the ref has one and a content type (eg hash-type:hash:content-type)?

if the plan is to delegitimize certain filenames to avoid ambiguity with filenames with embedded content types then why allow embedded colons at all?

Special case for just Dan's case seems strange to me, it should either allow colons in general or not all imo

also, is there any security ramifications? is this ever used in untrusted scenarios? just thinking about the case where a content type could be spoofed by having files named file and file:wrongtype on disk at the time of the call

vsoch · 2024-10-08T14:01:52Z

why would you allow one colon but not more than one colon? what if a filename has two? (eg machineid:hash-type:hash) what if the ref has one and a content type (eg hash-type:hash:content-type)?

More than one colon is fine if the file is found to exist. It's the cases where you've parsed a mimetype and there is an extra colon not associated with a file that should raise an error. That's likely a mistake.

halfline · 2024-10-08T15:48:20Z

More than one colon is fine if the file is found to exist. It's the cases where you've parsed a mimetype and there is an extra colon not associated with a file that should raise an error. That's likely a mistake.

But if the file doesn't exist it ultimately raises an error anyway right (regardless of colons)? I guess the question is should, missing;file:type and missing:file:type both ultimately lead to FileNotFoundError or should missing:file:type return a ValueError instead.

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

So perhaps this code should be validating the content type against the spec as well, and the kind of error returned perhaps should depend on if the content type is valid.

tarilabs · 2024-10-08T15:57:08Z

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

Is that applicable here? 🤔

... If defined, the value MUST comply with RFC 6838, including the naming requirements in its section 4.2, ...

(source)

it seems to me it's only the media-type itself, not the parameters typically found in content-type.
wdyt?

vsoch · 2024-10-08T17:12:10Z

If the path starts with a colon, fail right away.
If there are one or more colons, split by the colon, just once (as the PR does here)
Then assume the first part is a media type, the second part is a file.
If the file does not exist, error.
If the media type is empty, also error.

We can discuss those more complex cases when they actually appear. I don't think it makes sense to solve a problem that doesn't exist yet.

halfline · 2024-10-08T17:32:32Z

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

Is that applicable here? 🤔

I think the main use case for parameters is legacy text files (e.g. ISO-8859-1 or Windows-1252 instead of utf-8), maybe not applicable, or worth supporting...no idea.

Complications with parameters aside, I guess my main point, though, was, this merge request is focusing on validating the filename and doing I/O calls to make heuristic decisions, but it might be worthwhile to validate the media type too. Knowing if the media type is valid or invalid is useful in its own right, and can also potentially inform the heuristics as well.

vsoch · 2024-10-08T17:41:04Z

Complications with parameters aside, I guess my main point, though, was, this merge request is focusing on validating the filename and doing I/O calls to make heuristic decisions, but it might be worthwhile to validate the media type too. Knowing if the media type is valid or invalid is useful in its own right, and can also potentially inform the heuristics as well.

Totally agree! It's out of scope for this PR, but it would be great to open an issue (if you don't want to work on it) or PR directly to tackle it.

halfline · 2024-10-08T17:45:42Z

If the path starts with a colon, fail right away.

okay, but note files can legitimately start with colons, may be none that will ever go through these functions, i'm don't know. (but I have some in my home dir right now for instance)

2. If there are one or more colons, split by the colon, just _once_ (as the PR does here)

makes sense to me

3. Then assume the first part is a media type, the second part is a file.

assume you mean that the other way around, but makes sense

4. If the file does not exist, error.

makes sense

5. If the media type is empty, also error.

I still think it probably makes sense to do more validation than just "is empty" since the docs say the format is a hard requirement. I wouldn't be surprised if there is some regex already crafted out there that validates to spec.

We can discuss those more complex cases when they actually appear. I don't think it makes sense to solve a problem that doesn't exist yet.

yea sorry not trying to derail things for cases that don't matter. maybe i'm pushing for over engineering here...

vsoch · 2024-10-08T17:48:54Z

We can move the "starts with" to be later, and yes other order is what I meant. Apologies - I keep switching state between deep programming and trying to be articulate - doesn't always work super well.

And generally whatever you think is best practice I am happy to review.

halfline · 2024-10-08T19:14:05Z

And generally whatever you think is best practice I am happy to review.

So I don't want to invest the time to build and test this right now, but my suggestion would be something like this, (though maybe slightly different if you and @tarilabs decide media type parameters aren't worthy of supporting)

 def split_path_and_content(ref: str) -> PathAndOptionalContent:
     """
     Parse a string containing a path and an optional content
 
     Examples
     --------
     <path>:<content-type>
     path/to/config:application/vnd.oci.image.config.v1+json
     /dev/null:application/vnd.oci.image.config.v1+json
     C:\\myconfig:application/vnd.oci.image.config.v1+json
 
     Or,
     <path>
     /dev/null
     C:\\myconfig
 
     :param ref: the manifest reference to parse (examples above)
     :type ref: str
     : return: A Tuple of the path in the reference, and the content-type if one found,
               otherwise None.
     """
-    if ":" not in ref:
-        return PathAndOptionalContent(ref, None)
-
-    if pathlib.Path(ref).drive:
-        # Running on Windows and Path has Windows drive letter in it, it definitely has
-        # one colon and could have two or feasibly more, e.g.
-        # C:\test.tar
-        # C:\test.tar:application/vnd.oci.image.layer.v1.tar
-        # C:\test.tar:application/vnd.oci.image.layer.v1.tar:somethingelse
-        #
-        # This regex matches two colons in the string and returns everything before
-        # the second colon as the "path" group and everything after the second colon
-        # as the "context" group.
-        # i.e.
-        # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar)
-        # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar:somethingelse)
-        # But C:\test.tar along will not match and we just return it as is.
-        path_and_content = re.search(r"(?P<path>.*?:.*?):(?P<content>.*)", ref)
-        if path_and_content:
-            return PathAndOptionalContent(
-                path_and_content.group("path"), path_and_content.group("content")
-            )
-        return PathAndOptionalContent(ref, None)
-    else:
-        path_content_list = ref.split(":", 1)
-        return PathAndOptionalContent(path_content_list[0], path_content_list[1])
+
+    # RFC 6838 Section 4.2: Media Type Syntax
+    # https://tools.ietf.org/html/rfc6838#section-4.2
+    #
+    # Media types are defined as:
+    #
+    # type-name = restricted-name
+    # subtype-name = restricted-name
+    # restricted-name = restricted-name-first *126restricted-name-chars
+    # restricted-name-first  = ALPHA / DIGIT
+    # restricted-name-chars  = ALPHA / DIGIT / "!" / "#" / "$" / "&" / "-" / "^" / "_"
+    #
+    # So valid characters are letters, digits, and the symbols: ! # $ & - ^ _
+    # Additionally, parameters can be added, which are not specified in detail in RFC 6838.
+    # For parameters, we'll accept any token or quoted string after a semi
+
+    RESTRICTED_NAME_FIRST = r"(?:[A-Za-z0-9])"
+    RESTRICTED_NAME_CHARS = r"[A-Za-z0-9!#$&\-\^_]"
+    RESTRICTED_NAME = rf"{RESTRICTED_NAME_FIRST}{RESTRICTED_NAME_CHARS}{{0,126}}"
+    QUOTED_STRING = r'"(?:\\.|[^"\\])*"'
+    VALUE = rf"(?:{RESTRICTED_NAME}|{QUOTED_STRING})"
+    PARAMETER = rf"(?:;\s*{RESTRICTED_NAME}\s*=\s*{VALUE})"
+    MEDIA_TYPE = rf"{RESTRICTED_NAME}/{RESTRICTED_NAME}(?:{PARAMETER})*"
+
+    media_type_regex = re.compile(rf"{MEDIA_TYPE}\Z")
+
+    if not ref:
+        raise ValueError("Invalid manifest reference")
+
+    path = pathlib.Path(ref)
+    drive = path.drive
+
+    path_candidate = ref
+    content_candidate = ''
+    search_start = len(drive)
+    search_end = len(ref)
+    while search_end >= 0:
+        search_end = ref.rfind(':', search_start, search_end)
+
+        if search_end >= 0:
+            path_candidate = ref[:search_end]
+            content_candidate = ref[search_end + 1:]
+
+        if os.path.exists(path_candidate):
+            if content_candidate:
+                if media_type_regex.fullmatch(content_candidate):
+                    return PathAndOptionalContent(path_candidate, content_candidate)
+                else:
+                    raise ValueError(f"Invalid media type '{content_candidate}'.")
+            else:
+                return PathAndOptionalContent(path_candidate, None)
+
+    raise FileNotFoundError(f"Filename '{ref}' does not exist.")

tarilabs · 2024-10-08T19:33:47Z

decide media type parameters aren't worthy of supporting

To clarify, I don't think is a decision here, I simply believe the OCI spec I've liked is not supporting params. Did you have a chance to look at the source spec I've linked in #161 (comment) and would you reach the same conclusion?

halfline · 2024-10-09T00:06:42Z

decide media type parameters aren't worthy of supporting

To clarify, I don't think is a decision here, I simply believe the OCI spec I've liked is not supporting params. Did you have a chance to look at the source spec I've linked in #161 (comment) and would you reach the same conclusion?

my reading of the source you pointed me to says the media type must follow rfc 6838. It then directs the reader to section 4.2, with language like "including section 4.2", not "limited to section 4.2". I mean the parameter bit is the very next section, 4.3, and it's a clear continuation of 4.2, so I don't know why it wouldn't be part of things, but that's just my hot take.

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

as the column does not belong to the file Signed-off-by: tarilabs <matteo.mortari@gmail.com>

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

tarilabs · 2024-10-10T07:45:59Z

thank you @vsoch @rhatdan

followup to: - oras-project/oras-py#161 - containers/omlmd#22 Signed-off-by: tarilabs <matteo.mortari@gmail.com>

rhatdan requested review from vsoch and SteveLasker as code owners October 7, 2024 19:21

rhatdan mentioned this pull request Oct 7, 2024

Allow pushing from one type of AI Image to an OCI Image containers/ramalama#245

Merged

halfline reviewed Oct 7, 2024

View reviewed changes

tarilabs reviewed Oct 8, 2024

View reviewed changes

tarilabs added 2 commits October 9, 2024 08:31

add tests from PR review

3d82f52

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

remove test case agreed likely a mistake

2188a58

as the column does not belong to the file Signed-off-by: tarilabs <matteo.mortari@gmail.com>

tarilabs mentioned this pull request Oct 9, 2024

adding test cases per PR review rhatdan/oras-py#1

Merged

bump the version in oras/version.py and add a line to the changelog

c72f79f

Signed-off-by: tarilabs <matteo.mortari@gmail.com>

rhatdan force-pushed the colon branch from 461c73a to c72f79f Compare October 9, 2024 10:09

halfline mentioned this pull request Oct 9, 2024

ref splitting code could perform more complete validation of content type #162

Open

vsoch approved these changes Oct 9, 2024

View reviewed changes

vsoch merged commit 66d57d3 into oras-project:main Oct 9, 2024
5 checks passed

tarilabs mentioned this pull request Oct 15, 2024

support : in model source containers/omlmd#22

Merged

tarilabs added a commit to tarilabs/ramalama that referenced this pull request Oct 15, 2024

deps: bump omlmd to incorporate oras-py changes

a566460

followup to: - oras-project/oras-py#161 - containers/omlmd#22 Signed-off-by: tarilabs <matteo.mortari@gmail.com>

tarilabs mentioned this pull request Oct 15, 2024

deps: bump omlmd to incorporate oras-py changes containers/ramalama#311

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for filepaths to include #161

Allow for filepaths to include #161

rhatdan commented Oct 7, 2024

halfline Oct 7, 2024

vsoch Oct 7, 2024

halfline Oct 7, 2024

vsoch Oct 7, 2024

rhatdan Oct 8, 2024

vsoch Oct 9, 2024

tarilabs left a comment •

edited

Loading

vsoch commented Oct 8, 2024

tarilabs commented Oct 8, 2024 •

edited

Loading

halfline commented Oct 8, 2024 •

edited

Loading

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024 •

edited

Loading

tarilabs commented Oct 8, 2024 •

edited

Loading

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

tarilabs commented Oct 8, 2024

halfline commented Oct 9, 2024

tarilabs commented Oct 10, 2024

	testref = "path/to/config:application/vnd.oci.image.config.v1+json:extra"
	remote = oras.provider.Registry(hostname=registry, insecure=True)
	ref, content_type = remote._parse_manifest_ref(testref)
	assert ref == "path/to/config"
	assert content_type == "application/vnd.oci.image.config.v1+json:extra"

Allow for filepaths to include #161

Allow for filepaths to include #161

Conversation

rhatdan commented Oct 7, 2024

halfline Oct 7, 2024

Choose a reason for hiding this comment

vsoch Oct 7, 2024

Choose a reason for hiding this comment

halfline Oct 7, 2024

Choose a reason for hiding this comment

vsoch Oct 7, 2024

Choose a reason for hiding this comment

rhatdan Oct 8, 2024

Choose a reason for hiding this comment

vsoch Oct 9, 2024

Choose a reason for hiding this comment

tarilabs left a comment • edited Loading

Choose a reason for hiding this comment

Proposed test

Other considerations

vsoch commented Oct 8, 2024

tarilabs commented Oct 8, 2024 • edited Loading

halfline commented Oct 8, 2024 • edited Loading

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024 • edited Loading

tarilabs commented Oct 8, 2024 • edited Loading

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

vsoch commented Oct 8, 2024

halfline commented Oct 8, 2024

tarilabs commented Oct 8, 2024

halfline commented Oct 9, 2024

tarilabs commented Oct 10, 2024

tarilabs left a comment •

edited

Loading

tarilabs commented Oct 8, 2024 •

edited

Loading

halfline commented Oct 8, 2024 •

edited

Loading

halfline commented Oct 8, 2024 •

edited

Loading

tarilabs commented Oct 8, 2024 •

edited

Loading