Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Blob Storage Scaler doesn't list blobs recursively #1789

Closed
jasonpaige opened this issue May 6, 2021 · 18 comments · Fixed by #2846
Closed

Azure Blob Storage Scaler doesn't list blobs recursively #1789

jasonpaige opened this issue May 6, 2021 · 18 comments · Fixed by #2846
Assignees
Labels
feature All issues for new features that have been committed to

Comments

@jasonpaige
Copy link

jasonpaige commented May 6, 2021

Proposal

I'm not sure if this is a bug in the current implementation but given the default values, if I upload a blob to foo/bar/blob.txt the scaler will not "see" the file in order to count it. I think (next to 0 Go knowledge) this is because

props, err := containerURL.ListBlobsHierarchySegment(ctx, azblob.Marker{}, blobDelimiter, listBlobsSegmentOptions)
calls ListBlobsHierarchySegment whereas ListBlobsFlatSegment tells the Azure API to "flatten" the list of files on the server side prior to returning the list.

If this is a bug then it would be great to get this fixed or docs updated to make this clear. However, if this is intended behaviour it would be great to have this as a new feature whereby a developer can pass a switch to the trigger metadata:

  triggers:
    - type: azure-blob
      metadata:
        blobContainerName: mycontainer
        blobCount: "5"
        blobPrefix: ""
        blobDelimiter: "/"
        ignoreBlobHierarchy: true

Use-Case

when I upload a blob which includes a directory structure, it is still included in the blobCount to trigger the Scale Target.

Given this directory tree:

.
├── baz.txt
├── bin.txt
└── foo
    ├── bar
    │   └── world.txt
    └── hello.txt

The call to GetAzureBlobListLength would return 2
With the proposed feature in place, GetAzureBlobListLength would return 4.

Anything else?

No response

@jasonpaige jasonpaige added feature-request All issues for new features that have not been committed to needs-discussion labels May 6, 2021
@tomkerkhove
Copy link
Member

Thanks for the report! @ahmelsayed can you look into this?

@timja
Copy link

timja commented May 7, 2021

Based on my knowledge of the Java SDK list by hierarchy is non recursive.

If you want to list everything under a hierarchy you need to check the isPrefix field in the response and then recursively call the the list by hierarchy function until isPrefix is false.

@jasonpaige
Copy link
Author

Looking at the REST API the Go lib is using that sounds like the same https://docs.microsoft.com/en-gb/rest/api/storageservices/enumerating-blob-resources

The following would have been in the response

<BlobPrefix>  
    <Name>myfolder/</Name>  
</BlobPrefix>

Which indeed needs to be fed into a subsequent request.

Reading the Azure docs for the java sdk you can see the subtle differences between calling the
Hierarchy https://docs.microsoft.com/en-us/java/api/com.azure.storage.blob.blobcontainerclient.listblobsbyhierarchy?view=azure-java-stable
vs Flattened https://docs.microsoft.com/en-us/java/api/com.azure.storage.blob.blobcontainerclient.listblobs?view=azure-java-stable

Looking back at the REST API docs which both these libraries will be calling underneath, the only difference appears to be that the Hierarchy provides a delimiter parameter.

I tested this out using the az rest command: az rest --url https://*******.blob.core.windows.net/testing\?restype\=container\&comp\=list

This returned the following once I'd uploaded some test files in sub directories:

<?xml version=“1.0” encoding=“utf-8"?>
<EnumerationResults ContainerName=“https://*******.blob.core.windows.net/testing”>
    <Blobs>
        <Blob>
            <Name>subdir/subsub/test6.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/subdir/subsub/test6.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:43 GMT</Last-Modified>
                <Etag>0x8D9116DC7117ABE</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>subdir/test4.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/subdir/test4.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:33 GMT</Last-Modified>
                <Etag>0x8D9116DC0F607C8</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>subdir/test5.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/subdir/test5.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:36 GMT</Last-Modified>
                <Etag>0x8D9116DC2A71325</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>test1.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test1.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:21 GMT</Last-Modified>
                <Etag>0x8D9116DBA09FFAB</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>test2.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test2.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:18 GMT</Last-Modified>
                <Etag>0x8D9116DB7C9E0E1</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>test3.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test3.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:25 GMT</Last-Modified>
                <Etag>0x8D9116DBBFFE8C6</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
    </Blobs>
    <NextMarker />
</EnumerationResults>

Hitting this endpoint would yield the result I'm looking for of 6 instead of 3 which us returned when I call with the delimiter=/ like this:
az rest --url “https://*******.blob.core.windows.net/testing?restype=container&comp=list&delimiter=/&prefix=”

<?xml version=“1.0” encoding=“utf-8"?>
<EnumerationResults ContainerName=“https://*******.blob.core.windows.net/testing”>
    <Delimiter>/</Delimiter>
    <Blobs>
        <BlobPrefix>
            <Name>subdir/</Name>
        </BlobPrefix>
        <Blob>
            <Name>test1.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test1.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:21 GMT</Last-Modified>
                <Etag>0x8D9116DBA09FFAB</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>test2.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test2.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:18 GMT</Last-Modified>
                <Etag>0x8D9116DB7C9E0E1</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
        <Blob>
            <Name>test3.txt</Name>
            <Url>https://*******.blob.core.windows.net/testing/test3.txt</Url>
            <Properties>
                <Last-Modified>Fri, 07 May 2021 15:35:25 GMT</Last-Modified>
                <Etag>0x8D9116DBBFFE8C6</Etag>
                <Content-Length>0</Content-Length>
                <Content-Type>text/plain</Content-Type>
                <Content-Encoding />
                <Content-Language />
                <Content-MD5>1B2M2Y8AsgTpgAmY7PhCfg==</Content-MD5>
                <Cache-Control />
                <BlobType>BlockBlob</BlobType>
                <LeaseStatus>unlocked</LeaseStatus>
            </Properties>
        </Blob>
    </Blobs>
    <NextMarker />
</EnumerationResults>

Passing the parameter for delimiter= yields the same as the first result (6), however, I think I have tried passing this via the helm values previously with no joy. I'm going to attempt again once I sort my local minikube out....

@jasonpaige
Copy link
Author

Yup so testing using the following, I still don't get any jobs running when i upload a blob to a subdir

apiVersion: v1
kind: Secret
metadata:
  name: blob-storage-connection-string
  namespace: kedadebug
data:
  connection-string: ********************************
---
apiVersion: keda.k8s.io/v1alpha1
kind: TriggerAuthentication
metadata:
  name: az-test-blob-auth
  namesapce: kedadebug
spec:
  secretTargetRef:
    - key: connection-string
      name: blob-storage-connection-string
      parameter: connection
---
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: blob-scaledobject
  namespace: kedadebug
spec:
  scaleType: job
  pollingInterval: 5
  cooldownPeriod:  10
  minReplicaCount: 0
  maxReplicaCount: 5
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 600
    backoffLimit: 6
    template:
      spec:
        restartPolicy: "OnFailure"
        containers:
          - image: docker.io/library/hello-world
            name: hello-world
            imagePullPolicy: IfNotPresent
  triggers:
    - type: azure-blob
      metadata:
        # Required
        blobContainerName: testing
        # Optional
        blobCount: "1" # default 5
        blobPrefix: ""
        blobDelimiter: ""
      authenticationRef:
        name: az-test-blob-auth

Which would suggest to be that somewhere the value for blobDelimiter is being populated back with the default value of /

@jasonpaige
Copy link
Author

Again, not a golang dev by any stretch but this looks like the potential culprit:

if val, ok := config.TriggerMetadata["blobDelimiter"]; ok && val != "" {
meta.blobDelimiter = val
}

I think a value of "" is legitimate for config.TriggerMetadata["blobDelimiter"] so that I can count all of the blobs in a container, regardless of what structure they may be in (in sub directories although there's not exactly directories in azure blob store but not a prefix at least)

@kevinmatthews-kpmg
Copy link

It would be good to be able to a put a 'folder' (dynamic name) in the blob storage (with multiple files inside) and for that to count as 1 - maybe optional so it counts a folder as 1 or counts all of the files individually... i know they aren't folders as such, but its how azure represents them.

The use case behind this is we are using cognitive services and putting the results in to the blob storage for the keda function to take care of, but it puts folders in with random long names with files inside it.

@joachimgoris
Copy link

This would also be useful for us. We process batch requests and based on the amount of blobs scale our functions. We separate our blobs in folders so batches don't get mixed.
Making it possible for the blob scaler to find blobs recursively would simplify our process.

ahmelsayed added a commit that referenced this issue Aug 16, 2021
Closes #1789

Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
ahmelsayed added a commit that referenced this issue Aug 16, 2021
Closes #1789

Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
@ahmelsayed
Copy link
Contributor

Regarding recursive listing of blobs, @jasonpaige is right. The delimiter is valid as "" and setting that should allow recursive listing.

Regarding your scenario @kevinmatthews-kpmg, I'm assuming what you mean is for example if you have a container foo that looks like

/foo/root.txt

/foo/somefolder/run1/result.1.txt
/foo/somefolder/run1/retult.2.txt

You want to be able to count the "folders" under somefolder, which will be 1 or

/foo/root.txt

/foo/somefolder/run1/result.1.txt
/foo/somefolder/run1/retult.2.txt

/foo/somefolder/run2/result.1.txt
/foo/somefolder/run2/retult.2.txt

Would be 2 (run1 and run2), right?

type: azure-blob
metadata:
  blobContainerName: foo
  blobPrefix: "somefolder"
  blobDelimiter: "/"
  count: "blobs" # "blobs" for the current behavior, or "prefixes" to count "folders" under /{blobContainerName}/{blobPrefix}.....{blobDelimiter}

The prefixes count for the above config would be 2 (run1, run2) and the count for

type: azure-blob
metadata:
  blobContainerName: foo
  blobPrefix: ""
  blobDelimiter: "/"
  count: "blobs" # "blobs" for the current behavior, or "prefixes" to count "folders" under /{blobContainerName}/{blobPrefix}.....{blobDelimiter}

will be 1 (somefolder). If that matches the scenario you're describing @kevinmatthews-kpmg then we can open another issue to track this.

For this issue, I opened this PR: #2036

One complication: it seems that the scaler was appending blobDelimiter to blobPrefix, which isn't right. To avoid having a breaking change, I append the default blobDelimiter, i.e: /, to the blobPrefix if it's not there. But we might consider removing that in the future. It'll be a breaking change though, so we might only do it for v3?

@tomkerkhove
Copy link
Member

One complication: it seems that the scaler was appending blobDelimiter to blobPrefix, which isn't right. To avoid having a breaking change, I append the default blobDelimiter, i.e: /, to the blobPrefix if it's not there. But we might consider removing that in the future. It'll be a breaking change though, so we might only do it for v3?

Sounds good to me, let's open an issue for tracking this

@tomkerkhove
Copy link
Member

tomkerkhove commented Aug 17, 2021

You want to be able to count the "folders" under somefolder, which will be 1 or

/foo/root.txt

/foo/somefolder/run1/result.1.txt
/foo/somefolder/run1/retult.2.txt

/foo/somefolder/run2/result.1.txt
/foo/somefolder/run2/retult.2.txt

Would be 2 (run1 and run2), right?

type: azure-blob
metadata:
  blobContainerName: foo
  blobPrefix: "somefolder"
  blobDelimiter: "/"
  count: "blobs" # "blobs" for the current behavior, or "prefixes" to count "folders" under /{blobContainerName}/{blobPrefix}.....{blobDelimiter}

@ahmelsayed Instead of counting folders, I think we should support scaling based on blob count and container count which go recusively through all sub-containers.

If I configure somefolder, I want to be able to scale to 4 since there are 4 text files. In container mode, that would be 2 since I have run1 & run2.

@zroubalik
Copy link
Member

One complication: it seems that the scaler was appending blobDelimiter to blobPrefix, which isn't right. To avoid having a breaking change, I append the default blobDelimiter, i.e: /, to the blobPrefix if it's not there. But we might consider removing that in the future. It'll be a breaking change though, so we might only do it for v3?

Sounds good to me, let's open an issue for tracking this

I wouldn't be afraid to introduce this breaking change in the next release (with proper documentation), if y'all think that this change is something that would be beneficial for users in v2.

@kevinmatthews-kpmg
Copy link

For my scenario, we use cognitive services and the output of which is stored in a storage account, however it outputs it in the container with a random string for the "folder"... so lets say the structure is as follows:

/cog1234/index.json
/cog1234/files/test.json

/cog4531/index.json
/cog4531/files/test.json

I'd want the above to count the "folders" in the top level so in this case 2, this would allow me to fire up to 2 pods from the scaler, read the index.json file within each and process whatever files are in the files directory inside the top level directory.

Does that make sense?

Thanks

Kevin

@tomkerkhove
Copy link
Member

I certainly think it does, but it's different from @jasonpaige & @joachimgoris their scenario who want to scale based on the amount of blobs and not the containers.

So since this was originally reported for blob count, I'd recommend creating a new feature request and link to this one for context so we track both.

@ahmelsayed are you up for implementing both?

@tomkerkhove
Copy link
Member

We've agreed to make the following changes:

  • Add globPattern: <glob> option that takes place when specified, instead of the rest
    • This allows people to choose between using glob or delimiter/prefix, but not container name since that's required
  • Add recursive: "true" which ignores the delimiter, when specified

@jasonpaige
Copy link
Author

Thanks so much everyone, great to see this getting added to v2.5 as well! 👍
I'll close this issue off in light of #2036

@zroubalik
Copy link
Member

@jasonpaige thanks, but let's keep this open until the PR is not merged :)

@stale
Copy link

stale bot commented Oct 17, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 17, 2021
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 17, 2021
@tomkerkhove tomkerkhove added feature All issues for new features that have been committed to and removed feature-request All issues for new features that have not been committed to labels Oct 17, 2021
@tomkerkhove tomkerkhove added the stale All issues that are marked as stale due to inactivity label Oct 17, 2021
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 17, 2021
@tomkerkhove tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022
@tomkerkhove tomkerkhove moved this from To Do to Proposed in Roadmap - KEDA Core Feb 11, 2022
@v-shenoy
Copy link
Contributor

I can take this up, @kedacore/keda-contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature All issues for new features that have been committed to
Projects
Archived in project
9 participants