Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issue with missing feature to request "all offers for a known asset ID" #10

Closed
matgnt opened this issue Jan 30, 2023 · 13 comments
Assignees

Comments

@matgnt
Copy link
Collaborator

matgnt commented Jan 30, 2023

After receiving access to the repo last week, I'm now trying to dig deeper into the protocol spec. One question I have is regarding scalability of requests when the AssetId is already known. In the 'old' specification I understood, that it was part of the IDS protocol to request 'all offers for 1 asset' with adding the 'requestedEelment' to the request to the catalog service.

Ref:
https://github.com/International-Data-Spaces-Association/IDS-G/tree/main/Communication/sequence-diagrams/data-connector-to-data-connector

With the 'new' spec, it looks to me like this part is 'outsourced' to the implementation part with the very general 'ids:filter' expression
https://github.com/International-Data-Spaces-Association/ids-specification/blob/main/catalog/message/catalog.request.message.json

I would see this as a very crucial part for scaling. Catalogs can become VERY big. Transferring the whole content even if the requester already knows the asset id is a problem. And outsourcing this scalability critical part should not be desired from my perspective.

My proposal is NOT to define all filters in IDS, but specify a way to 'filter' for one specific assetId only.

Any thoughts on this? Unfortunately I can not join the Thursday meeting this week because of a Catena-X Workshop. Maybe you can comment here if you have thoughts on this.

Thanks in advance,
Matthias Binzer

@jimmarino
Copy link
Contributor

jimmarino commented Feb 1, 2023

This can be handled by a filter, so I don't think it is necessary to define any additional mechanism.

If the client already knows the dataset id, why does it need to make a further catalog request? In DCAT, offers are referenced from the Dataset (and are generally contained, although with RDF this can be expanded), so a client will always know which offers are available for a data set once it has access to the latter. All subsequent operations can be performed with just the asset id and offer id.

Also, the client presumably would know about a data set from a previous request and can cache it if needed.

@matgnt
Copy link
Collaborator Author

matgnt commented Feb 2, 2023

Hi @jimmarino
yes this is exactly the issue here. The spec says

The CatalogRequestMessage may have a filter property which contains an implementation-specific query or filter expression type supported by the catalog service.

And without this implementation specific filter, I could never get the offer id for a specific asset WITHOUT fetching the catalog, (and parsing it on consumer side...), right? And as you also say, the offer id is required for further operations (negotiation).

@jimmarino
Copy link
Contributor

That's up to the specific implementation to support. Given the wide variety of query languages, filter expressions, etc., the testing burden this would entail, and higher priorities, we decided to refrain from standardizing the contents of the filter expression. Implementations can provide something similar to what you are describing and be fully spec compliant.

Note also that at some point, the client will need to obtain the catalog from the provider, either explicitly via a request or through some form of "implicit" context. Do you have a concrete use case that details the scenario you are alluding to?

@juliapampus
Copy link
Contributor

juliapampus commented Feb 2, 2023

The catalog request provides a property filter:

{
  "@type": "ids:CatalogRequestMessage",
  "ids:filter": {}
}

The spec does not pre-define what filters are allowed. If you have a catalog implemented that allows searching for assets by ID, the request could look like that:

"@type": "ids:CatalogRequestMessage",
"ids:filter": {
   "id": "ASSET_ID"
}

Or you may pre-define that an SQL statement is allowed. Then the response would not have to be a catalog with n objects but would countain only one asset:

{
  "@type": "dcat:Catalog",
  [...]
  "dcat:dataset": [
    {
      "@id": "ASSET_ID",
   [...]

It is just not part of the basic spec, but is allowed by the filter property.

@matgnt
Copy link
Collaborator Author

matgnt commented Feb 2, 2023

I think it could be also an option to NOT make this a 'filter' - because of the variety of filter expressions you mentioned. I would agree here. But then we need to find another solution to get all offers for a specific asset id.

The use case is how we currently do it in bigger parts of Catena-X. We have digital twins, registered in the twin registry (all according to Platform I40 'Asset Administration Shell' (AAS) specification...). Now, those twins get an EDC in front of it. That means, a big part of what the 'catalog' would do in a pure EDC world, is done in the AAS Registry. Namely finding the Asset Id. That means we always know the asset id, but of course, not the offer id - since this is dynamically created.

Fetching the whole catalog - as we do it right now - is an overhead that doesn't make sense. Doing this in pages and even caching doesn't solve the root cause of the problem. Now, we have ways to filter for the asset id, BUT this is EDC implementation specific - and this is what I think is not good. We would heavily depend on IMPLEMENTATION decisions instead of PROTOCOL decisions for a major part of our solution.

@juliapampus
Copy link
Contributor

juliapampus commented Feb 2, 2023

IMO it's not implementation specific, because C-X could define how a catalog request filtering for an assetID - not on implementation level, but on protocol level - has to look like. And it could be defined e.g. that 4 filters are pre-defined, 1 is mandatory to support (filtering by assetId) and the connectors (resp. catalog services) are allowed to implement x use-case-/system-sepcific filters. So any connector in your project could be implemented and used, following the "C-X-flavored" message schemes. This is the idea of IDS only defining the core and allowing for adoptions.

Still, I can follow that a filter by ID may be the most basic filter that should be pre-defined and set as mandatory by default...

@juliapampus
Copy link
Contributor

juliapampus commented Feb 2, 2023

You will have the same need for a C-X flavored vocabulary as an IDS dataset allows any attribute, but maybe in C-X your assets are domain-specific and systems on both sides need to be able to process the information they receive. Same goes for the policies. That is a question of how to design the interfaces/levels of adoption. And we have agreed at the very beginning that in the first step of the spec only such things are specified that are absolutely essential for a proper communication.

@jimmarino
Copy link
Contributor

jimmarino commented Feb 2, 2023

The issue with defining what is inside the filter attribute is that it is a lot of work and difficult to get right, particularly given all of the other issues that need to be solved. We can't just define a "simple" mechanism. Consider the following questions that would arise:

  1. How are RDF namespaces handled when using a property?
  2. Do we only want to specify one language?
  3. Why not point to a standard query mechanism like SPARQL?
  4. If we do that, do we want to require implementations to support SPARQL?
  5. Why not Cypher or a SQL dialect?
  6. If one of the previous were used, how would they map to RDF (which is based on)?
  7. If we invent a filter language, what is the formal grammar?
  8. If we allow multiple filter languages, how is the filter language expressed in the request message?
  9. How would a filter language be tested as part of compliance verification?

By leaving this implementation specific, we allow future work to define filter expressions. I think it's also important to take a very conservative view of what to standardize to avoid premature standardization without concrete implementation experience. This approach also provides for implementation-specific innovation, which is important for standards to get uptake.

@ssteinbuss
Copy link
Member

This seems to be a data-space-specific extension point. That should be highlighted in the document that this (and maybe others) extension point exists. For the time being, we might want to add examples to the document that guide the reader.

@ssteinbuss
Copy link
Member

This discussion might take a while, @ssteinbuss will add an infobox for the time being.

@juliapampus
Copy link
Contributor

juliapampus commented Feb 23, 2023

Catalog Protocol will be extended by a new message type:

{
   "@type":  "ids:DatasetRequestMessage",
   "@id": messageId,
   "ids:dataSet": datasetId
}

Catalog Protocol Https Binding will map this message type to

GET <connector>/catalog/dataset/:id 

with a response object of type dcat:Dataset

@ssteinbuss
Copy link
Member

@sebbader-sap request to have an additional UML picture, and will provide it ;-)

@matgnt
Copy link
Collaborator Author

matgnt commented Mar 29, 2023

Closing this issue, since the changes have been merged into the 0.8 release already.

@matgnt matgnt closed this as completed Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants