Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make collection name unique at a project level #2865

Open
christad92 opened this issue Jan 24, 2025 · 7 comments
Open

Make collection name unique at a project level #2865

christad92 opened this issue Jan 24, 2025 · 7 comments

Comments

@christad92
Copy link

christad92 commented Jan 24, 2025

Currently, we enforce uniqueness for collection names at an instance level but it would be great to keep the uniqueness at that project level. No two collections in a project should have the same name but two or more collections in an OpenFn instance can have the same name.

When sharing a collection with another project, we should block the share if the receiving project has a collection that has the same name as the shared collection.

@github-project-automation github-project-automation bot moved this to New Issues in v2 Jan 24, 2025
@christad92 christad92 changed the title Allow users to edit the ID of the collection when autogenerated ID exists Make collection name unique at a project level Jan 24, 2025
@christad92
Copy link
Author

@theroinaochieng please I would like for us to prioritize this. It is a blocker for users.

Image

@theroinaochieng
Copy link
Contributor

Hi Ayo, this is well noted. I'll have someone in the team pick this up.

@theroinaochieng theroinaochieng moved this from New Issues to Ready in v2 Feb 14, 2025
@theroinaochieng theroinaochieng moved this from Ready to Backlog in v2 Feb 14, 2025
@taylordowns2000 taylordowns2000 moved this from Backlog (Tech) to Product BL in v2 Feb 17, 2025
@stuartc
Copy link
Member

stuartc commented Feb 19, 2025

@christad92 could you elaborate why this is a blocker for users? The error is clear that the collection already exists.

The reason why a collection is globally unique is for a few reasons.

When accessing a collection via the job, the user specifies it using its name: collections.get('my-specific-collection').

It would be easy to determine which collection the job is referring to via the project the job exists in.

However given the existence of #2855, where users would be able to share collections between projects; we run into issues of disambiguation.

You would see two collections with the same name in the UI, and job code would not be able to distinguish which one is being referred to.

This can be solved using namespacing, so in a job you would write something like: collections.get('my-project-name/my-specific-collection').

This would also impact the CLI, the current access pattern is like this:

openfn collections get my-collection \* --token $MY_OPENFN_PAT

This would have issues even if the collection wasn't shared with other projects, since you're accessing the collection as a user - we'd have to disambiguate from other projects they are collaborators in.

Again, this can be solved with namespacing. But this would be a breaking change for the CLI and jobs.

We could perhaps try and naively look up the collection name for the user or project and throw an error if it's ambiguous, thats better but still leaves users to have jobs or cli commands suddenly failing because someone made a collection in some other project they don't spend much time in.

So yeah, this is a little bit of a minefield.

Does this change the appetite for this request?

@christad92
Copy link
Author

christad92 commented Feb 19, 2025

Thanks @stu

  • Yes, the message is clear but the error is confusing. "When you say it is a duplicate collection, what do you mean? I didn't create a credential and my list of available credentials is empty." My point here is that we should treat collection names basically like we do with workflows in a project.
  • About sharing collections, we preempted this and have recommended that we discourage receiving a shared collection if it has the same name as any of the existing collections in my project.

No, it doesn't change the appetite of the request. We should make collection names unique at project level not across the instance.

@stuartc
Copy link
Member

stuartc commented Feb 24, 2025

Dropping some implementation notes here, to be turned into issues once discussed:

Allow Collections API to handle ambiguous collection names

We need to change the router to be able to resolve the difference between:

/api/collections/my-project-name/my-specific-collection
/api/collections/my-specific-collection
/api/collections/my-project-name/my-specific-collection/key123
/api/collections/my-specific-collection/key123
  1. Authenticate PAT or RunToken
  2. Find collection using project name (in this order)
    1. From the path
    2. From token context
      Which is either the user or the run's project
  3. In the event of more than one collection matching the query, return a 409 Conflict

Provisioner Changes

Ensure we have good errors? I think this is fine already.

Collections Adaptor Changes

We need to check that if a user writes the following operation:

collections.get('my-project-name/my-specific-collection')

That the request is correctly built to call /api/collections/my-project-name/my-specific-collection

CLI Changes

When a user calls:

openfn collections get my-collection \* --token $MY_OPENFN_PAT

And there are two my-collection collections available to the user, display a nice error explaining as such.
The user can follow up with either:

openfn collections get project-name/my-collection ...

or:

openfn collections get my-collection ... --project project-name

This leads to another concern, in that there is no uniqueness requirement on a projects name, and in this case (without other optimisations) the user will need to provide a UUID to disambiguate

I prefer the project-name/my-collection pattern since it will be 1:1 with how you write a job. But I'd like to get more information about how else a --project flag would be used.

@taylordowns2000
Copy link
Member

taylordowns2000 commented Feb 25, 2025

Thank you @stuartc and @theroinaochieng ! Here's the mockup from our call today with @christad92 .

This is really an epic that touches credential sharing, the cli, the collections list view, and the way collections are referenced in runs. It's 3-4 days of work.

Image

@josephjclark
Copy link
Contributor

Here's a perspective on this. I'm not sure it's been considered.

This has implications for managing prod/staging versions of projects! Something I've been thinking about a lot this week.

My staging project uses a collection called record-cache. That string is hard-coded in my workflow code: collections.get('record-cache').

When we deploy that project to prod, we want to use exactly the same code as on staging. But we almost definitely don't want to use the same actual collection.

So this I think is a vote for:

  1. Giving a collection a full/qualified name, of the form <project-name>/<collection-name>
  2. Allowing a step to reference the "local" or "relative" name
  3. So that by default a collection is bound to a project
  4. This also allows us to cross-reference a collection in another project (if we have permission)

We can't rename a project, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Product Backlog
Development

No branches or pull requests

5 participants