Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database format for registry ( Mongodb ) #1

Open
henilp105 opened this issue Dec 2, 2022 · 44 comments
Open

Database format for registry ( Mongodb ) #1

henilp105 opened this issue Dec 2, 2022 · 44 comments
Labels
area: database Related to the package database

Comments

@henilp105
Copy link
Member

henilp105 commented Dec 2, 2022

I have considered the following format for Mongo database :

  1. fpmregistry as the Database.
  2. It would have 3 collections named users, packages,namespaces.
  3. suggested format for users collection
{
    "name": "username",
    "email": "email",
    "password": "password hash",
    "role":["user","maintainer","admin"],
    "createdAt":"pythondatetime object",
    "lastLogin": "pythondatetime object",
    "lastLogout": "pythondatetime object",
    "authorOf": ["_id of package1","_id of package2"], 
    "maintainerOf": ["_id of package1", "_id of package2"],
    "pendingRequest":["_id of package1", "_id of package2"],
    "sessionId": "uuid",
}

4.suggested format for packages collection

{
    "name": "package name",
    "namespace":"_id of namespace",
    "isDeprecated": "boolean",
    "tarball":"aws s3 object name",
    "version": "version",
    "license": "license name",
    "createdAt": "pythondatetime object",
    "updatedAt":"pythondatetime object",
    "author": "_id object",
    "maintainers": ["_id of user1","_id of user2"],
    "copyright": "copyright statement",
    "description": "package description",
    "tags":["t1","t2"],
    "dependencies": ["_id of package1","_id of package2"],
}

5.suggested format for namespaces collection

{
    "namespace": "namespace name",
    "createdAt":"pythondatetime object",
    "createdBy": "_id of user",
    "description": "namespace description",
    "tags":["t1","t2"],
    "authors": ["id of a1", "_id of a2"],
    "packages":["_id of package1","_id of package2"],
}

I would like to request all the Community members to review and add suggestions on this format of database.

Thanks and Regards,
Henil

CC @awvwgk @fortran-lang/admins @fortran-lang/fpm

@awvwgk
Copy link
Member

awvwgk commented Dec 2, 2022

We were thinking about having proper namespaces for all packages to avoid name collisions and name squatting. Like for example a fortran-lang namespace under which the fpm package is published by a user admin42. Would it make sense to have namespaces as a separate collection?

@henilp105
Copy link
Member Author

@awvwgk sure sir, either we could make a new collection namespace and link all the packes into it or we could also add namespace to the packages collection as a field ?

@awvwgk awvwgk added the area: database Related to the package database label Dec 2, 2022
@everythingfunctional
Copy link
Member

"maintainer": "caomaco@gmail.com"

I think you might not want to include that here, as you've listed the packages a user is maintainer of, and you don't want duplicate, potentially conflicting info in your database.

@awvwgk
Copy link
Member

awvwgk commented Dec 2, 2022

I think you might not want to include that here, as you've listed the packages a user is maintainer of, and you don't want duplicate, potentially conflicting info in your database.

This refers to the entry gathered from the manifest (https://fpm.fortran-lang.org/en/spec/manifest.html#project-maintainer). I also don't really like to specify it in the manifest (see fortran-lang/fpm-registry#24). If it feels redundant or conflicting in the manifest, maybe we should drop it from fpm.

@milancurcic
Copy link
Member

I think the packages table should include a releases field, and the keys from the manifest should be taken out of it and placed in a new table called manifests, as the manifests will vary between different releases of the same package.

@henilp105
Copy link
Member Author

@milancurcic sir, may be we could add another field in the packages collection named manifests so that different packages may have different keys so we could directly store that in json/ mongo array. I have planned to add a new row to package collection for a new version. (maybe , I am mixing the releases and the versions here ?)

@urbanjost
Copy link

I see other repositories that distinguish between plugins, applications, and libraries. Of course, a single fpm(1) package may include all of these; but wondering if there should be a separate list for an entry. In a related vein, "fpm install" should allow executables to be selected (does not make as much sense for libraries and modules), potentially with a rename option.

@henilp105
Copy link
Member Author

@urbanjost sir, we ( me and @minhqdao and @awvwgk ) had discussed this yesterday itself, we had decided that currently the API would give the options to CLI which in turn would handle them by showing the options in CLI. @minhqdao for the CLI side I currently I am giving the API with json response and options of packages.

@urbanjost
Copy link

That would be fine; I do not necessarily think multiple repositories are required; just some way to search and display by those categories. So the output of a query will be JSON? That makes sense for allowing applications to process the information, but I hope a default beautifier will be available to show the results as a nice ASCII table at a minimum. Having to install a bunch of infrastructure to just see the result of a query in a nice format would be a significant drawback, I would think. Great to know you have been considering the topic. Good luck!

@henilp105
Copy link
Member Author

I do not necessarily think multiple repositories are required;

We have been thinking of making our own global package repository ( like PyPI, crates.io ) , this repo would work as a backend repository to that website and also to the local registry which would be integrated into fpm by @minhqdao .

So the output of a query will be JSON? That makes sense for allowing applications to process the information, but I hope a default beautifier will be available to show the results as a nice ASCII table at a minimum.

Response to a API calls ( please Refer #5 ) to the backend would be a json , but it would be rendered beautifully by a website or a fpm cli.

Having to install a bunch of infrastructure to just see the result of a query in a nice format would be a significant drawback, I would think.

The CLI would be integrated into fpm directly and the website can be easily accessible by a browser, so we would not have to install any infrastructure to get the result of the query.

@minhqdao
Copy link
Contributor

minhqdao commented Dec 8, 2022

A few remarks on the users collection:

  1. I think it would be great if we stayed uniform with either camel case or using underscores, but not mix it (createdAt and is_admin -> created_at and is_admin). I think the standard here is to use underscores only.
  2. With login_at and logout_at, you mean last_login and last_logout? If yes, the latter are more clear.
  3. Instead of is_admin, I'd assign a role. In addition to the admin role we could add more if necessary, such as editor, maintainer etc.
  4. What is packages supposed to contain? Please clarify.
  5. Instead of is_maintainer, which implies a bool, I'd call it maintainer_of.

@minhqdao
Copy link
Contributor

minhqdao commented Dec 8, 2022

Considering the packages collection:

  1. Do we need two different ids, one autogen by MongoDB and one by ourselves?

  2. Do we still need a name if there is a namespace reference?

  3. Shouldn't author only be part of packages, not namespace, to avoid duplication?

  4. Shouldn't maintainers be assigned to the entire namespace, instead of individual packages?

  5. "user": "_id object"

We are not going so save all the users who downloaded the package, are we?

  1. "maintainer": "caomaco@gmail.com",

You could use the user id to link those two and avoid duplication.

  1. "git-tag": "v1.7.0"

Don't forget to mark certain fields as optional. As the registry is supposed to be independent from git, the specification of a git tag should not be mandatory.

@minhqdao
Copy link
Contributor

minhqdao commented Dec 8, 2022

I see other repositories that distinguish between plugins, applications, and libraries. Of course, a single fpm(1) package may include all of these; but wondering if there should be a separate list for an entry. In a related vein, "fpm install" should allow executables to be selected (does not make as much sense for libraries and modules), potentially with a rename option.

Yes, I agree. I think packages should contain a type field which is either plugin, application, library etc. That would make searching easier. However, I'm not sure where to get the information from, which type a fpm project is. 😅

@henilp105
Copy link
Member Author

henilp105 commented Dec 8, 2022

@minhqdao Thanks for the suggstions, some points to ponder :

I think it would be great if we stayed uniform with either camel case or using underscores, but not mix it (createdAt and is_admin -> created_at and is_admin). I think the standard here is to use underscores only.

1.The convention for Mongodb is Camelcase.

Instead of is_admin, I'd assign a role. In addition to the admin role we could add more if necessary, such as editor, maintainer etc.

  1. I had considered adding a role, but that could give unwanted access to entire cluster/collection, we would have to very specific regarding the roles thus ,I had considered it would be best to discuss it then add it. ( I would want to be extra cautious in security aspects as it would be opensource registry. 😅 )

What is packages supposed to contain? Please clarify.

  1. it is supposed to contain various versions of packages. ( refer API Routes for website and local registry/fpm #5 )

Do we need two different ids, one autogen by MongoDB and one by ourselves?

  1. No, but I had added that to ease the dependency resolution on your side.

Do we still need a name if there is a namespace reference?

  1. yes, there might be multiple different packages in a single namespace.

@henilp105
Copy link
Member Author

henilp105 commented Dec 8, 2022

@minhqdao , Some more points to refer :

As the registry is supposed to be independent from git, the specification of a git tag should not be mandatory.

  1. It would be mandatory only for the taking the latest version from github/gitlab thus ,this tag is used.

We are not going so save all the users who downloaded the package, are we?

  1. No, It would be a non authenticated GET request only, We might consider saving the number of users/get request to a particular package, but we won't be saving the users (specifically).

Shouldn't author only be part of packages, not namespace, to avoid duplication?

  1. Author of namespce and a package might be different , there might be multiple authors of the namespace and this redundancy is actually caused due to the inherent use in the fpm (refer above).

Shouldn't maintainers be assigned to the entire namespace, instead of individual packages?

  1. maintainers of namespce and a package might be different , there might be multiple maintainers of the namespace.

Yes, I agree. I think packages should contain a type field which is either plugin, application, library etc. That would make searching easier. However, I'm not sure where to get the information from, which type a fpm project is. 😅

  1. I have found a solution by adding a new field to the package collection tags, this would resolve this problem of searching.

@minhqdao
Copy link
Contributor

minhqdao commented Dec 8, 2022

  1. The convention for Mongodb is Camelcase.

The point was to be consistent. Then let's change all the remaining keys to camel case: id_admin -> isAdmin, sessionid -> sessionId, package_id -> packageId, dev-dependencies -> devDependencies, git-tag -> gitTag.

  1. Also I think the maintainer field in namespace should be maintainers (plural) to indicate that it's an array.

  2. Collection names in MongoDB are supposed to be plural, hence it should be namespaces instead of namespace.

  3. What about loginAt and logoutAt vs. lastLogin and lastLogout? The former sound quite arbitrary.

  4. I had considered adding a role, but that could give unwanted access to entire cluster/collection, we would have to very specific regarding the roles thus ,I had considered it would be best to discuss it then add it.

I can't see the point how isAdmin is safer than a role, but I'm happy to talk about it later during the call.

  1. No, but I had added that to ease the dependency resolution on your side.

I don't quite understand how a second id would help "dependency resolution on my side". Could you please clarify that a little further for me?

  1. No, It would be a non authenticated GET request only, We might consider saving the number of users/get request to a particular package, but we won't be saving the users (specifically).

This seems to be part of analytics, so I thought we had decided to defer it until later. Besides, if it's not actual users with ids as indicated, then it should be named something as numberOfDownloads or downloadsCount.

  1. I have found a solution by adding a new field to the package collection tags, this would resolve this problem of searching.

I'm not sure whether an application (or a plugin) can be, e.g., a library, too. If yes, then using an array is definitely the way to go. However, I'm still in favor of calling it types (or type, respectively) because it is less generic than tags.

@henilp105
Copy link
Member Author

@minhqdao Some points to refer :

1.Also I think the maintainer field in namespace should be maintainers (plural) to indicate that it's an array.
2.Collection names in MongoDB are supposed to be plural, hence it should be namespaces instead of namespace.
3.What about loginAt and logoutAt vs. lastLogin and lastLogout? The former sound quite arbitrary.

  1. Surely, But sir the Major purpose was of discussing database Schema not to focus on the nomenclature. names can always be easily changed.

I'm not sure whether an application (or a plugin) can be, e.g., a library, too. If yes, then using an array is definitely the way to go. However, I'm still in favor of calling it types (or type, respectively) because it is less generic than tags.

  1. Yes, Considering all those cases I had added an array.

This seems to be part of analytics, so I thought we had decided to defer it until later. Besides, if it's not actual users with ids as indicated, then it should be named something as numberOfDownloads or downloadsCount.

  1. This has been deferred thus I have not added any fields to the packages collection. ( I had been answering as it had been asked by you)

I can't see the point how isAdmin is safer than a role, but I'm happy to talk about it later during the call.

  1. As in Mongodb by granting a role admin the user may get the access to inadvertently delete the entire cluster/collection while our field is only acceptable on our code, which would restrict the user the ability to write/edit the entire cluster.

I don't quite understand how a second id would help "dependency resolution on my side". Could you please clarify that a little further for me?

  1. I am considering to not allow access to _id field from any collection on the API side thus, Mongodb's auto gen id wont be accessible to the API on your side , for the version resolution on the backend side I would be using _id field, Thus for different versions of the dependencies It would be easier for you to find/fetch the dependencies on your side. ( Refer API Routes for website and local registry/fpm #5 for more details.)

I am majorly targeting the APIs currently but I would like to have an Idea How would you integrate them into the terminal/fpm ?
( for consistency I have added json as the response format, But would be happy like to make/add custom APIs for fpm/terminal side for more access and I am also considering to add rate-limiting factors later .)

Thanks and Regards,
Henil Panchal

@everythingfunctional
Copy link
Member

Based on discussions so far, I had a couple of thoughts.

In my experience designing database schemas (which is fairly limited), it's (usually) better for each item to say which collection it belongs to than for a collection to say which items it contains. Kind of the opposite from the way you'd typically design data structures in a programming language. And for many-to-many relationships it's often beneficial to have a separate table to track those (i.e. a user may maintain many packages, and packages may have many maintainers).

You will almost instantly want users with different levels of permissions/access. Having a role field will be easier to extend. I'm guessing you'll initially want roles like admin (can change user roles), curators (can move packages between namespaces, change maintainers, etc.), and users (can create namespaces and packages). At some point the permissions system gets real complicated, and you'll probably want a many-to-many relationship between users and roles, you'll have kinds of operations, and you'll just have a single point of entry "is this user allowed to perform this kind of action" that deals with all the complexity, but you can probably get by initially with just a "role" field for a user.

@henilp105
Copy link
Member Author

@everythingfunctional Sure sir we can add them, actually we had discussed this point in the last meeting in brief and we had agreed upon 3 Roles ( user,maintainer,admin). But I would still be cautious about granting the various privileges to roles ( we would have to discuss them).

@minhqdao
Copy link
Contributor

minhqdao commented Dec 8, 2022

Surely, But sir the Major purpose was of discussing database Schema not to focus on the nomenclature. names can always be easily changed.

I was just thinking that it's easier for you to have it there right away instead of postponing part of the discussion for another time.

Considering the response format, I think that json should be the standard. The main goal in the beginning will be the resolution of a package, and the entire document should be parsed as a json. Everything that follows (deserialization of the JSON, mapping to objects, handling of the objects, such as retrieving the tarball from the bucket via its url etc.) will be done in the frontend.

However, I'm still somewhat puzzled by releases. Are these meant to be dev, beta, stable etc.? So they would also have different tarballs?

Custom APIs will make sense later, e.g. in combination with fpm search. We could parse a nicely formatted response right away instead of a JSON that would require formatting in the frontend.

@henilp105
Copy link
Member Author

Considering the response format, I think that json should be the standard. The main goal in the beginning will be the resolution of a package, and the entire document should be parsed as a json. Everything that follows (deserialization of the JSON, mapping to objects, handling of the objects, such as retrieving the tarball from the bucket via its url etc.) will be done in the frontend.

  1. I have made a unique API for fetching and processing the packages on the backend side so imo these things won't be required to be done on the frontend side ( detailed in API Routes for website and local registry/fpm #5) also I would not want to sent the entire document of a collection as a response, instead only the required information as json and we had also discussed doing it in the fpm/terminal side in the last meeting initially.

However, I'm still somewhat puzzled by releases. Are these meant to be dev, beta, stable etc.? So they would also have different tarballs?

  1. I am a bit confused (refer above discussion with @milancurcic sir regarding this) , thus had also added this to today's agenda

Thanks and Regards,
Henil Panchal

@henilp105
Copy link
Member Author

@fortran-lang/admins @fortran-lang/fpm @minhqdao

  1. Should we keep gitTag and git url in packages collection, or should we keep it like pypi or crates where we keep only packages which are uploaded by users only?
  2. Should we consider having file like fpm.lock like rust's cargo.lock or requirements.txt in python , so we can ease the dependency resolution on the backend side ?
  3. Should we make it compulsory for a user to either make a new namespace for his package or to link package to a pre-existing namespace ?

@everythingfunctional
Copy link
Member

  • Should we keep gitTag and git url in packages collection, or should we keep it like pypi or crates where we keep only packages which are uploaded by users only?

Not really. Where a package tarball comes from is not the registry's responsibility. I think we should allow maintainers to specify url that gets displayed on a package's page for informational purposes, but it shouldn't have any influence on system behavior.

  • Should we consider having file like fpm.lock like rust's cargo.lock or requirements.txt in python , so we can ease the dependency resolution on the backend side ?

I think that's a feature that would be useful for developers, but shouldn't be used for version constraint resolution by the registry. It would result in unresolvable version conflicts.

  • Should we make it compulsory for a user to either make a new namespace for his package or to link package to a pre-existing namespace ?

I think the answer is yes, all packages must reside in a namespace.

@minhqdao
Copy link
Contributor

Should we keep gitTag and git url in packages collection, or should we keep it like pypi or crates where we keep only packages which are uploaded by users only?

The idea of the registry is that its packages do not depend on any git repository. However, I would prefer having a git link so the user can easily find the repository in case the user needs more information, wants to open an issue, contribute etc.

@everythingfunctional
Copy link
Member

However, I would prefer having a git link so the user can easily find the repository in case the user needs more information, wants to open an issue, contribute etc.

I agree, but I wouldn't name it git. While projects using git and hosted on GitHub are likely to be the most common, there's no requirement that a package use them. I'd just call it homepage, or something similar.

@minhqdao
Copy link
Contributor

So will different releases have different version numbers? I'd prefer that for simpler version resolution.

@henilp105
Copy link
Member Author

henilp105 commented Dec 12, 2022

@minhqdao yes, I had also been considering the same .

@milancurcic
Copy link
Member

Sorry for the confusion and late response. This has probably been resolved through the discussion so far. What I meant by "release" is synonymous with "version", e.g. "0.1.0", "0.1.1", "0.2.0" etc. This would map to a git tag if the package is maintained on GitHub. I don't know if there's a need for a "release" that is separate from "version".

@minhqdao
Copy link
Contributor

Yes, then a release would be same as a version here and we can drop release. Every version has its own tarball and manifest, @henilp105 has taken care of that.

Seems like we'd now stick to the major.minor.patch format + semantic versioning?

@perazz
Copy link

perazz commented Dec 18, 2022

I think one missing point of this discussion is what level of integration the backend database should enable with fpm. MongoDB has a C api, so it would be straightforward to implement Fortran wrappers. However that is additional work, while other choices (SQLite) have Fortran APIs already

@henilp105
Copy link
Member Author

@perazz we had considered that we would use the backend (flask) to interact with the mongodb and then use REST APIs to communicate with fpm, like when user searches for a package a GET would be sent to the backend (flask) which would search in mongo , resolve the dependencies on the backend side and then return the results and packages.

@everythingfunctional
Copy link
Member

I think one missing point of this discussion is what level of integration the backend database should enable with fpm.

I'm going to guess there are security implications for why we wouldn't want to expose the database directly.

@perazz
Copy link

perazz commented Dec 18, 2022

So the dependencies wouldn’t be sorted out by fpm, but the package maintainer would need to specify them in the manifest? That seems what msys does afaict

@milancurcic
Copy link
Member

Indeed, a client such as fpm or a browser frontend should never access a centralized DB directly because that would require the DB secrets (access keys) to be made available to the user. To solve this problem, a backend holds the DB secrets and sits between the client and the DB.

@arteevraina
Copy link
Member

  1. Package Entity.
{
  "name": "package name",
  "createdAt": "pythondatetime object",
}

Does this createdAt field inside the Package Model gives the DateTime Object of the first time this package was created by the author? If that is so, then I think we should also think of a way to monitor the publishing date of each release instead of saving that of a package as a whole. So, every version should have a publishedAt datetime object connected to it that is what I think.

  1. We should think to add one more field to the Package Entity called likesCount where each authorized user can like the package (from the registry frontend) and this will help us in ranking the more popular packages of the same category based upon the likesCount. pub.dev registry also has this feature that helps the developer trust a package before using it.

  2. If for example, a User object has a "role": ["maintainer"], then should we never expect the value of "maintainerOf" key to be an empty array of the same User object?

@arteevraina
Copy link
Member

Taking some inspiration from pub.dev again, How about having isDiscontinued flag inside the Package Class?

This might be useful for the packages that are already published and other projects are dependent on them but since they cannot be maintained, Author or any user with some amount of authorization can mark it as discontinued.

@minhqdao
Copy link
Contributor

Yes, I also like the idea of an isDeprecated flag and a likesCount. These are also things that can easily be added later on.

@perazz
Copy link

perazz commented Dec 20, 2022

@awvwgk sure sir, either we could make a new collection namespace and link all the packes into it or we could also add namespace to the packages collection as a field ?

I like the idea of a namespace to be a tag/field that could be either used to identify unique packages with the same name, or group packages into collections.

@arteevraina
Copy link
Member

@awvwgk sure sir, either we could make a new collection namespace and link all the packes into it or we could also add namespace to the packages collection as a field ?

I like the idea of a namespace to be a tag/field that could be either used to identify unique packages with the same name, or group packages into collections.

Looks good. But, I am not able to understand whether that would help @minhqdao and you in fpm side actually other than searching.

For example: Just like @awvwgk mentioned different orgs have their own stdlib so then there will be a conflict on the fpm side.

@minhqdao
Copy link
Contributor

I think we've agreed on having namespaces collections and grouping the packages as their children, which looks like a good solution to me.

In fpm, we'd do sth like mapping a dependency to a dependency type and adding an additional namespace field. Then we'd make sure that there are no two dependencies with both same package and namespace name. Something like that.

@arteevraina
Copy link
Member

arteevraina commented Jan 9, 2023

We are thinking to add pending_requests [list] as a field to the users class. This list will contain the package_id's of the packages whose author has asked the user to join in as a maintainer.

This will help to approve and reject the request in the backend.

So, for example :

if the user approves to join in as maintainer for package_id -> package_id gets remove from the pending_requests list and that package_id gets added to the maintainerOf list.

else -> just remove the package_id from pending_requests list.

@minhqdao @perazz @henilp105

@henilp105
Copy link
Member Author

henilp105 commented Jan 10, 2023

if the user approves to join in as maintainer for package_id -> package_id gets remove from the pending_requests list and that package_id gets added to the maintainerOf list.

@arteevraina we would also have to add the user's _id to the package's maintainers field.

@henilp105
Copy link
Member Author

@arteevraina we would also have to make the user's name field unique as we would have to show the user's page on /users/<username> with the packages and details (if user permits).

@arteevraina
Copy link
Member

arteevraina commented Mar 7, 2023

I think it makes sense to maintain a list of admins & maintainers id's in a namespace document. We already have roles fiels in user's document but that does not indicates which namespace the user is the maintainer/admin. It's just that user is a maintainer/admin but not with enough information regarding the namespace.

@henilp105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: database Related to the package database
Projects
Development

No branches or pull requests

8 participants