From 487fcbf0e3b526ede4c92be329c46ae25783391b Mon Sep 17 00:00:00 2001 From: honghaoq Date: Mon, 18 Apr 2022 18:35:54 -0700 Subject: [PATCH 01/15] Indexer FIP --- FIPS/IndexerFIP.md | 227 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 FIPS/IndexerFIP.md diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md new file mode 100644 index 000000000..a99428b4b --- /dev/null +++ b/FIPS/IndexerFIP.md @@ -0,0 +1,227 @@ +# Indexer FIP + +[Indexer FIP](https://www.notion.so/7d156be0cd5643269241cf2d2e13f25e) + +## **Simple Summary** + +Indexers store mappings of content multi-hashes to provider data records. A client that wants to know where a piece of information is stored can query the indexer, using the CID or multi-hash of the content, and receive a provider record that tells where the client can retrieve the content and via which data transfer protocols. +Content can be queried at [cid.contact](http://cid.contact). + +## **Abstract** + +Filecoin stores a great deal of data, but without proper indexing of that data, clients cannot perform efficient retrieval. To improve Filecoin content discoverability, Indexer nodes are developed to store mappings of CIDs to content providers for content lookup upon retrieval request. + +To provide index to indexer node, there is an [Indexer Provider](https://lotus.filecoin.io/storage-providers/operate/index-provider/) service that runs alongside with Storage Provider, which tells the Indexer what contents this storage provider has. Index Provider sends updates to the Indexer via a series of [Advertisement](https://github.com/filecoin-project/storetheindex/blob/main/api/v0/ingest/schema/schema.ipldsch) messages. Each message references a previous advertisement so that as a whole it forms an advertisement chain. The state of the indexer is a function of consuming this chain from the initial Advertisement to the latest one, and delta of indexer content is updated periodically. + +Indexer enables content discovery within Filecoin network today, it is also a fundamental component toward the end goal of universal retrieval across IPFS/Filecoin, i.e. to be able to efficiently retrieve Filecoin content from IPFS. Combining Indexer work for content discovery with IPFS/Filecoin interoperability work for content transfer, we aim to reach the end state of efficient and universal retrieval across IPFS and Filecoin network. + +## **Change Motivation** + +There are many reasons why Indexer is critical to the success of Filecoin network: + +1. **Content Discovery and Retrieval**: +Running Indexer is required for efficient content discovery and retrieval for Filecoin network, since Filecoin needs a CID to Storage Provider mapping to enable content retrieval. It is also an important step toward retrieval across IPFS and Filecoin, as Indexer will index both IPFS and Filecoin data and support efficient and universal retrieval across IPFS and Filecoin network as the end state. +2. **Better usage and growth of the network**: +By making data more accessible to the network, we could increase Filecoin data usage, which helps drive up adoption of the network to onboard more data, and then make more data discoverable and retrievable - it is a positive growth flywheel. + +## **Terminology** + +Before we dive into the design details, here are a list of concepts we need to cover for Indexer: + +- **Advertisement**: A record available from a publisher that contains, a link to a chain of multihash blocks, the CID of the previous advertisement, and provider-specific content metadata that is referenced by all the multihashes in the linked multihash blocks. The provider data is identified by a key called a context ID. +- **Announce Message**: A message that informs indexers about the availability of an advertisement. This is usually sent via gossip pubsub, but can also be sent via HTTP. An announce message contains the advertisement CID it is announcing, which allows indexers to ignore the announce if they have already indexed the advertisement. The publisher's address is included in the announce to tell indexers where to retrieve the advertisement from. +- **Context ID**: A key that, for a provider, uniquely identifies content metadata. This allows content metadata to be updated or delete on the indexer without having to refer to it using the multihashes that map to it. +- **Entries:** Represents the list of multihashes the content hosted by the storage provider. The entries are represented as an IPLD DAG, divided across a set of interlinked IPLD nodes, referred to as Entries Chunk. +- **Gossip PubSub**: Publish/subscribe communications over a libp2p gossip mesh. This is used by publishers to broadcast Announce Messages to all indexers that are subscribed to the topic that the announce message is sent on. For production publishers and indexers, this topic is `"/indexer/ingest/mainnet"`. +- **Indexer**: A network node that keeps a mappings of multihashes to provider records. +- **Metadata**: Provider-specific data that a retrieval client gets from an indexer query and passed to the provider when retrieving content. This metadata is used by the provider to identify and find specific content and deliver that content via the protocol (e.g. graphsync) specified in the metadata. +- **Provider**: Also called a Storage Provider, this is the entity from which content can be retrieved by a retrieval client. When multihashes are looked up on an indexer, the responses contain provider that provide the content referenced by the multihashes. A provider is identified by a libp2p peer ID. +- **Publisher**: This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID. +- **Retrieval Addresses:** A list of addresses included in each advertisement that points to where you can retrieve from the advertised content. +- **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. +- **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). + +## **Specification** + +**Data providers** have local indices of their content. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide. + +**Indexer nodes** want to discover data providers and to track updates provided by the data providers. They also want to handle incoming content requests and resolve them to providers who have that content. Indexers may reroute client requests to other indexers if they do not handle that content. + +**Clients** want to issue query style requests for content to an indexer node and receive a set of providers that have that content. The response should include any other information about how to choose between providers as well as information that the provider may have requested to present to them to authenticate the request? (e.g. deal ID) + +Indexer nodes discover new data providers and receive notification of content updates, by receiving notifications published on a known gossip pub-sub topic. Indexers can also discover this by looking at on-chain storage provider activity. The notifications let indexers know that new data is *available* and that index entries can be fetched from an existing or a new provider. The indexer decides if and when to pull the actual index data, and which kind of index, from a provider, based on its own policies pertaining to the data provider. + +**Overview design:** + +![](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/44c88146-a25c-4732-92df-1c98b814bf63/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220419%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220419T012946Z&X-Amz-Expires=86400&X-Amz-Signature=0f8321e348f5a6a8a275b069d179460f51ef208a7c1ef0987e149e6eda34055e&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Untitled.png%22&x-id=GetObject) + +**Data Provier Interface:** + +Data providers will have local indices of their content. Abstractly, we can think of the resources exposed by a data provider as a tree of the following: + +```jsx +Catalog (List of Indexes) +Catalog/ (Individual Index) +Catalog//multihashes (list of multihashes in the Index) + +Catalog//TBD (semantic for selection) +``` + +A provider has a Catalog of one or more indices, where the current index is the set of all multihashes known to the provider, and the other indices are past versions of it, each representing the multihashes at some previous time. Semantically, changes to the catalog of indexes by the provider are seen as happening in a globally ordered log of additions and revocations, each referencing the previous action. lndexers are able to track these changes to keep their view of a provider's Index current. + +**Advertisements:** + +Index providers will publish advertisements on new indices on the `ContentIndex` gossipsub pub-sub channel, which indexer nodes can subscribe to. The format of these advertisement messages of catalog entries looks like this: + +```jsx +{ Index: , + Previous: , + Provider: , + Signature: +} +``` + +When an indexer sees an advertisement, the indexer checks to see if it already has the latest Index ID. If it does, then the advertisement is ignored. Otherwise, the indexer sends a request to the data provider to get the set of changes starting from the previous Index ID in the advertisement up to the current index ID in the advertisement. The indexer internally resolves the libp2p.PeerID to current libp2p endpoints. + +Besides receiving advertisements over pub-sub, the indexer can request the latest advertisement directly from a data provider. This is useful when indexing for a discovered data provider that is not publishing advertisements. + +Note: The signature is computed over the current and the previous ID in each entry. This guarantees that the order is correct according to the signing provider. + +An index ID is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. + +**Advertisement Chain:** + +The indexer keeps a record of the advertised Index multihashes it has received data for. The indexer continues walking the chain of Index multihashes backward until it receives a response containing an Advertisement with the `Previous` multihash being the last one the indexer has already seen, or when the end of the chain is reached (no Previous). When all the missing links in the chain of Indexes have been received, these are applied, in order, to update the indexer's records of multihashes and providers. + +**Index Request-Index Provider Response:** + +Over the Provider-Indexer protocol, an indexer requests the following: + +``` +{ "prev_index_id": , + "index_id": , + "start": n, // optional - start at this record (for pagination) + "count": n, // optional - maximum number of records to fetch +} +``` + +And indexer receives a possibly paginated response that has a subset of the change logs and the associated advertisement message: + +``` +{ "totalEntries": n, // size of provider's catalog + "error": "", // optional. indicate the request could not be served + "advertisement": { + Index: n3 , + Previous: n2 , + Provider: , Signature: + }, + "start": n, // starting entry number + "entries": [ + {"add_multihashes": [, ...], "metadata": "???", + "del_multihashes": [, ...]}, + {"add_multihashes": [, ...], "metadata": "???"}, +} +``` + +This is a set of changes that are applied to the index data, to go from the `prev_multihash` to the multihash in the request. Each entry contains a set of new multihashes that the provider provides, and possibly a set of multihashes to remove that are no longer provided. A limited-size (≤ 100 bytes) metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded data as per the protocol. The content of the meta data is up to the provider, but if more then the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. + +It is worth noting that the same multihash may appear in entries of several different advertisements of a provider. For instance, if a Filecoin Storage Provider receives a new deal that includes a CID, that has a multihash, of some content it is already indexing, it may notify this update to the indexer node by including an entry for that multihash with new metadata (e.g. the new expiration date of the deal) in the next advertisement. +When the indexer node comes across an advertisement that includes a multihash that it is already indexing for the provider, it will simply update the metadata with that of the latest advertisement. Finally, if all the deals that include that multihash in a Storage Provider expire, it can notify the indexer of this by simply adding that multihash in the list of `del_multihashes` for the next advertisement. + +The response may be paginated either by the client requesting a maximum number of entries or by the server delivering up to a configured maximum number of entries. To get the remaining entries, a subsequent request is made with its `start` set to the number of entries from the previous response. So, if `totalEntries` equals `100` and the response contained 50 entries, the follow-up request should specify `50` for `start` to get the remaining entries. This continues until the indexer has the complete response. + +**Client Interface:** + +The client interface is what indexer clients use to ask an indexer which providers are able to provide content identified by multihashes. Queries for a multihash should locate the providers within 10s of milliseconds. + +A client sends a request containing a set of multihashes. The indexer responds with a list of provider_ids for each multihash that was requested. + +Response data: + +``` +{ + "multihash_providers": { + : [ + {"provider_id": , "metadata": {???} }, ..], + : [ + {"provider_id": , "metadata": {???} }, ..], + : null, + ... + }, + "providers": [ + { "ID": , + "Addrs": [, ..] + }, + ... + ], + "providers_signature": +} +``` + +Response data contains encoded values returned from the in-memory cache. Responses may be paginated if there are a large number of values. + +## **Design Rationale** + +Key design considerations are: + +- **Indexer uses multihashes, not entire CIDs, to refer to indexed content** +The codec in a CID is independent of the content that is indexed or how the content is retrieved from a provider. +- **Indexer keeps an in-memory cache of multihashes that have been asked for by clients, unless configured for in-memory only operation** +Most multihashes in an index will not be requested, since clients will generally only ask for the top-level multihash for any object. Therefore only the set that is actually requested by clients is kept in memory. The remaining will be kept in space-efficient secondary storage. A generation cache eviction mechanism for the in-memory cache prevents unbounded growth. +- **[storethehash](https://github.com/ipld/go-storethehash) will serve as persistent storage** +Designed to efficiently store hash data, requires only two disk reads to get any data stored for a multihash. This is being evaluated against more mature systems because of a high confidence assumption that it will save significant space. However, that savings needs to be very significant to outweigh the risk and development cost of a new storage facility. The indexer will be constructed to allow different implementations to be chosen as appropriate for different deployments. Having an embedded storage implementation may reduce admin work and allow the indexer to be used more easily as a library. +- **Indexer will track multihash distribution and usage** +The indexer will keep statistics on multihashes that can be used to predict the growth rate of data stored and cached as well as the distribution of multihashes over different providers. The usage of multihashes will also be tracked and will show the demand (by clients) for various multihahses. +- **Indexer will not index `IDENTITY` Multihashes** +The Indexer will not index `IDENTITY` multihashes and filters out any such multihash from the received advertisements. See: [Appendix C, Handling of `IDENTITY` Multihashes](https://www.notion.so/Indexer-Node-Design-fbd1e7d3110c4b1fb154b31f2585e6ff). +- **Load minimization for Storage Provider** + + The indexer updates over graphsync, meaning once a day indexer connects and ask for delta of new CIDs found available on the nodes since last day, and store them to multiple providers, while indexer only need to get it from one of them. This saves bandwidth by deduping when CIDs are provided from multiple providers, so it will only take relatively small amount of bandwidth compared to actual data (Storage Provider can expect minimal bandwidth impact), with at most one or two additional connections that are short-lived for indexing. + + +## **Backwards Compatibility** + +This FIP does not change actors behavior so it does not require any Filecoin network update. + +## **Test Cases** + +The following test cases should be covered: + +- Index Provider functionality test: + - Ability to index storage deals + - Storage Provider able to publish index to an Indexer node +- Scalability test: indexer need to be able to handle index storage at scale (hundreds of billions of index) +- Availability test: + - 3-9s node uptime + - Responsiveness to indexer advertisement queries +- Latency test: + - Indexer query response latency: <300ms p95 + - Indexer ingestion latency: <60 sec p95 + +## **Security Considerations** + +This FIP does not touch underlying proofs or security. + +## **Incentive Considerations** + +No change to incentives. In the future this could support retrieval incentive. + +## **Product Considerations** + +No change to product considerations, except that increased content discoverability and retrieval capability is a net improvement to the Filecoin network. + +In the near future, the product should include controls on what contents can be retrieved, so Storage Providers would have the ability to turn off content that they don’t want to be accessible for retrieval. + +## **Implementation** + +Indexer: + +[https://github.com/filecoin-project/storetheindex](https://github.com/filecoin-project/storetheindex) + +Index Provider: + +[https://github.com/filecoin-project/index-provider](https://github.com/filecoin-project/index-provider) + +## **Copyright** + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). \ No newline at end of file From 78dbce4de3c778438b609d2ab6e2f7d5ac7657db Mon Sep 17 00:00:00 2001 From: Andrew Gillis Date: Sat, 23 Apr 2022 05:46:48 -0700 Subject: [PATCH 02/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 218 +++++++++++++++++++++++---------------------- 1 file changed, 113 insertions(+), 105 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index a99428b4b..b17bfb46d 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -2,29 +2,29 @@ [Indexer FIP](https://www.notion.so/7d156be0cd5643269241cf2d2e13f25e) -## **Simple Summary** +## Simple Summary Indexers store mappings of content multi-hashes to provider data records. A client that wants to know where a piece of information is stored can query the indexer, using the CID or multi-hash of the content, and receive a provider record that tells where the client can retrieve the content and via which data transfer protocols. Content can be queried at [cid.contact](http://cid.contact). -## **Abstract** +## Abstract -Filecoin stores a great deal of data, but without proper indexing of that data, clients cannot perform efficient retrieval. To improve Filecoin content discoverability, Indexer nodes are developed to store mappings of CIDs to content providers for content lookup upon retrieval request. +Filecoin stores a great deal of data, but without proper indexing of that data, clients cannot perform efficient retrieval. To improve Filecoin content discoverability, Indexer nodes are developed to store mappings of CID multi-hashes to content provider records, for content lookup upon retrieval request. -To provide index to indexer node, there is an [Indexer Provider](https://lotus.filecoin.io/storage-providers/operate/index-provider/) service that runs alongside with Storage Provider, which tells the Indexer what contents this storage provider has. Index Provider sends updates to the Indexer via a series of [Advertisement](https://github.com/filecoin-project/storetheindex/blob/main/api/v0/ingest/schema/schema.ipldsch) messages. Each message references a previous advertisement so that as a whole it forms an advertisement chain. The state of the indexer is a function of consuming this chain from the initial Advertisement to the latest one, and delta of indexer content is updated periodically. +To provide index data to indexer nodes, there is the [Indexer Provider](https://lotus.filecoin.io/storage-providers/operate/index-provider/) that provides a software library for building an index data provider, and an index provider service that can run alongside with Storage Provider. Either of these may be used to tell the Indexer what content is retrievable from a storage provider. The Index Provider library or service sends updates to the Indexer via a series of [Advertisement](https://github.com/filecoin-project/storetheindex/blob/main/api/v0/ingest/schema/schema.ipldsch) messages. Each message references a previous advertisement so that as a whole it forms an advertisement chain. The indexer operates by consuming this chain from the oldest Advertisement not yet consumed, to the latest one. The Indexer fetches new advertisements periodically or in response to a notification from the Index Provider. Indexer enables content discovery within Filecoin network today, it is also a fundamental component toward the end goal of universal retrieval across IPFS/Filecoin, i.e. to be able to efficiently retrieve Filecoin content from IPFS. Combining Indexer work for content discovery with IPFS/Filecoin interoperability work for content transfer, we aim to reach the end state of efficient and universal retrieval across IPFS and Filecoin network. -## **Change Motivation** +## Change Motivation -There are many reasons why Indexer is critical to the success of Filecoin network: +Primary reasons why Indexer is critical to the success of Filecoin network: 1. **Content Discovery and Retrieval**: Running Indexer is required for efficient content discovery and retrieval for Filecoin network, since Filecoin needs a CID to Storage Provider mapping to enable content retrieval. It is also an important step toward retrieval across IPFS and Filecoin, as Indexer will index both IPFS and Filecoin data and support efficient and universal retrieval across IPFS and Filecoin network as the end state. 2. **Better usage and growth of the network**: -By making data more accessible to the network, we could increase Filecoin data usage, which helps drive up adoption of the network to onboard more data, and then make more data discoverable and retrievable - it is a positive growth flywheel. +Making data more accessible to the network increases Filecoin data usage. This drives adoption of the network to onboard more data, resulting in more data being discoverable and retrievable - it is a positive growth flywheel. -## **Terminology** +## Terminology Before we dive into the design details, here are a list of concepts we need to cover for Indexer: @@ -36,148 +36,156 @@ Before we dive into the design details, here are a list of concepts we need to c - **Indexer**: A network node that keeps a mappings of multihashes to provider records. - **Metadata**: Provider-specific data that a retrieval client gets from an indexer query and passed to the provider when retrieving content. This metadata is used by the provider to identify and find specific content and deliver that content via the protocol (e.g. graphsync) specified in the metadata. - **Provider**: Also called a Storage Provider, this is the entity from which content can be retrieved by a retrieval client. When multihashes are looked up on an indexer, the responses contain provider that provide the content referenced by the multihashes. A provider is identified by a libp2p peer ID. -- **Publisher**: This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID. +- **Publisher**: Also referred to as _Index Provider_, This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID. - **Retrieval Addresses:** A list of addresses included in each advertisement that points to where you can retrieve from the advertised content. - **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. - **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). -## **Specification** +## Actors -**Data providers** have local indices of their content. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide. +**Data Providers** store data and make it available for retrieval clients to retrieve. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide and update their internal location of content. -**Indexer nodes** want to discover data providers and to track updates provided by the data providers. They also want to handle incoming content requests and resolve them to providers who have that content. Indexers may reroute client requests to other indexers if they do not handle that content. +**Index Providers (Publishers)** maintain a history of changes to content stored by a Data Provider, and present the sequence of changes to Indexers as advertisements. Index Providers sent notifications to Indexers to announce that new advertisements are available. Usually a Data Provider is also its own Index Provider, but these can be different entities. -**Clients** want to issue query style requests for content to an indexer node and receive a set of providers that have that content. The response should include any other information about how to choose between providers as well as information that the provider may have requested to present to them to authenticate the request? (e.g. deal ID) +**Indexer nodes** receive Advertisement announcements published by Index Providers, allowing Indexers to discover Data Providers and to be notified of updates to the Data Provider content. The announcements let Indexers know that new advertisements are available. Indexers retrieve Advertisements from the Index Providers, to get Data Provider information and associated index multihash data. The Indexer decides if and when to fetch index data from an Index Provider based on the policies the Indexer is configured with. Indexers may reroute client requests to other indexers if they do not handle that content. -Indexer nodes discover new data providers and receive notification of content updates, by receiving notifications published on a known gossip pub-sub topic. Indexers can also discover this by looking at on-chain storage provider activity. The notifications let indexers know that new data is *available* and that index entries can be fetched from an existing or a new provider. The indexer decides if and when to pull the actual index data, and which kind of index, from a provider, based on its own policies pertaining to the data provider. +**Clients** want to issue query style requests for content to an Indexer node and receive a set of Data Provider records that inform the client how and where to retrieve that content. The response from an Indexer contains a set of Data Provider records, each having the Provider's ID and addresses. Each record contains the protocol that the client must use to retrieve the data (e.g. graphsync) as well as other information that the client presents to the Provider. This additiona data is used by the Provider to retrieve the content, and may consist of a deal ID or other lookup keys specific to the Provider. If multiple Providers provide the same content, the client may choose based on input from a reputation system, network response time, location, or any other information available to the client. -**Overview design:** +## Specification -![](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/44c88146-a25c-4732-92df-1c98b814bf63/Untitled.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220419%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220419T012946Z&X-Amz-Expires=86400&X-Amz-Signature=0f8321e348f5a6a8a275b069d179460f51ef208a7c1ef0987e149e6eda34055e&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Untitled.png%22&x-id=GetObject) +### Data Provier Interface: -**Data Provier Interface:** +Data providers maintain local records of the CIDs of the content they store and the changes to this content. Providers must be able to present this as an ordered series of changes to sets of multihashes over time. For indexing, a multihash extracted from a CID is used identify content, since indexed content does not track how the content is encoded. The source CID is the CID of a merkle-tree of content hosted by the provider. + +Each addition of content is represented a set of multihashes accompanied by context information (metadata), that is within the domain of interpretation of the Provider, and an unique identifier (context ID) that identifies this context. Therefore a context ID links a set of multihashes to metadata that pertains to that set of multihashes. Removal of content is done by deletion of a context ID, which represents removing a set of multihashes and metadata identified by that context ID. Metadata identified by context ID may also be replaced by new metadata, as the information pertaining to the associated set of multihashes changes (change of location, storage deal, etc.). -Data providers will have local indices of their content. Abstractly, we can think of the resources exposed by a data provider as a tree of the following: +Each of these changes is a presented as a separate record, that is linked to the previous record, forming an ordered log changes to the Provider's content. Indexers track these changes to keep their view of the Provider's content in sync with the Provider. -```jsx -Catalog (List of Indexes) -Catalog/ (Individual Index) -Catalog//multihashes (list of multihashes in the Index) +### Advertisements: -Catalog//TBD (semantic for selection) -``` - -A provider has a Catalog of one or more indices, where the current index is the set of all multihashes known to the provider, and the other indices are past versions of it, each representing the multihashes at some previous time. Semantically, changes to the catalog of indexes by the provider are seen as happening in a globally ordered log of additions and revocations, each referencing the previous action. lndexers are able to track these changes to keep their view of a provider's Index current. +An Advertisement is a data structure that packages information about a change to Provider content. The Advertisement contains the Provider ID and addresses, content metadata, context ID, a link to a chain of multihash blocks, and a link to the previous Advertisement. The Advertisement is also signed by the Provider or publisher of the Advertisement, using a signature computed over all of these fields. -**Advertisements:** +Each Advertisement is uniquely identified by a content ID (CID) that is used to retrieve that Advertisement from the Index Provider. This makes the advertisement an immutable record. The link to the chain of multihash blocks and each link in the chain is also a CID, making the chain of multihash blocks immutable as well. The Advertisement is what is communicated from an Index Provider to an Indexer to supply the Indexer with index data. -Index providers will publish advertisements on new indices on the `ContentIndex` gossipsub pub-sub channel, which indexer nodes can subscribe to. The format of these advertisement messages of catalog entries looks like this: +Advertisement as IPLD schema: +``` +# EntryChunk captures a chunk in a chain of entries that collectively contain the multihashes +# advertised by an Advertisement. +type EntryChunk struct { + # Entries represent the list of multihashes in this chunk. + Entries [Bytes] + # Next is an optional link to the next entry chunk. + Next optional Link +} -```jsx -{ Index: , - Previous: , - Provider: , - Signature: +# Advertisement signals availability of content to the indexer nodes in form of a chunked list of +# multihashes, where to retrieve them from, and over protocol they are retrievable. +type Advertisement struct { + # PreviousID is an optional link to the previous advertisement. + PreviousID optional Link + # Provider is the peer ID of the host that provides this advertisement. + Provider String + # Addresses is the list of multiaddrs as strings from which the advertised content is retrievable. + Addresses [String] + # Signature is the signature of this advertisement. + Signature Bytes + # Entries is a link to the chained EntryChunk instances that contain the multihashes advertised. + Entries Link + # ContextID is the unique identifier for the collection of advertised multihashes. + ContextID Bytes + # Metadata captures contextual information about how to retrieve the advertised content. + Metadata Bytes + # IsRm, when true, specifies that the content identified by ContextID is no longer retrievalbe from the provider. + IsRm Bool } ``` -When an indexer sees an advertisement, the indexer checks to see if it already has the latest Index ID. If it does, then the advertisement is ignored. Otherwise, the indexer sends a request to the data provider to get the set of changes starting from the previous Index ID in the advertisement up to the current index ID in the advertisement. The indexer internally resolves the libp2p.PeerID to current libp2p endpoints. +- The ContextID is limited to a maximum of 64 bytes. +- The Metadata is limited to a maximum of 1024 bytes (1KiB). + +The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the meta data is up to the provider, but if more then the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. -Besides receiving advertisements over pub-sub, the indexer can request the latest advertisement directly from a data provider. This is useful when indexing for a discovered data provider that is not publishing advertisements. -Note: The signature is computed over the current and the previous ID in each entry. This guarantees that the order is correct according to the signing provider. +### Announcement -An index ID is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. +Index Providers announce the availability of new Advertisements by publishing a notification on a known gossipsub pub-sub channel, `/indexer/ingest/mainnet`, that indexer nodes subscribe to. Alternatively, the Index Provider may post an announcement message directly to an indexer over HTTP. Typically, gossip pub-sub is used and is preferred because the Index Providers will not necessarily know how to contact the Indexers, but will know the filecoin chain nodes that will relay gossip pub-sub to Indexers. -**Advertisement Chain:** +After receiving an announcement, the Indexer checkes if it has already retrieved the announced Advertisement, and if so, ignores the announcement. If the Advertisement has not yet been retrieved, the Indexer contacts the Index Provider, at an address supplied in the announcement, to retrieve the announced Advertisement. The addresses in the announcement can be libp2p addresses or an HTTP address, and the Indexer uses the addresses to retrieve the Advertisement over graphsync or HTTP respectively. + +Announcement message as golang struct: +```go +{ + Cid: cid.Cid + Addrs [][]byte + ExtraData []byte `json:",omitempty"` +} +``` -The indexer keeps a record of the advertised Index multihashes it has received data for. The indexer continues walking the chain of Index multihashes backward until it receives a response containing an Advertisement with the `Previous` multihash being the last one the indexer has already seen, or when the end of the chain is reached (no Previous). When all the missing links in the chain of Indexes have been received, these are applied, in order, to update the indexer's records of multihashes and providers. +The extra data is used by Index Providers to pass identity data to filecoin nodes in order for the filecoin nodes allow the announcement to be forwarded over gossip pub-sub. -**Index Request-Index Provider Response:** +### Advertisement Chain +Once the Indexer has received an Advertisement it checks if the previous Advertisement has already been retrieved, and if not, retrieves it. This continues until all previously unseen Advertisements are retrieved or until there are no more Advertisements to retrieve, i.e. the end of the chain is reached. -Over the Provider-Indexer protocol, an indexer requests the following: +After the entire chain of unprocessed Advertisements has been retrieved, the Indexer walks the chain in order from oldest to newest and retrieves the chain of multihash blocks linked to by each advertisement. A multihash block is a chunk of the multihashes in the change set with a link to the next block. Splitting all the total multihashes into blocks enables block-based data transfer mechanisms to fetch the multihash data and servies as a pagination mechanism for other transports. ``` -{ "prev_index_id": , - "index_id": , - "start": n, // optional - start at this record (for pagination) - "count": n, // optional - maximum number of records to fetch -} + Oldest Newest ++----------+ +--------+ +--------+ +| Ad A | | Ad B | | Ad C | +| prev=nil |<----| prev=A |<----| prev=B | ++----+-----+ +---+----+ +---+----+ + | | | + V V V +[multihashes] [multihashes] [multihashes] + | | | + V V V +[multihashes] [multihashes] [multihashes] ``` -And indexer receives a possibly paginated response that has a subset of the change logs and the associated advertisement message: +### Index Data Storage + +All of the multihashes in the multihash blocks are read and stored in the indexer as a mapping of multihashes to a list of providerID-contextID in the Advertisement, and each providerID-contextID is mapped to its metadata record. This allows a multihash to resolve to a multiple provider, context ID, metadata records. It also allows a providerID-contextID to be used to identify metadata records to update and delete. ``` -{ "totalEntries": n, // size of provider's catalog - "error": "", // optional. indicate the request could not be served - "advertisement": { - Index: n3 , - Previous: n2 , - Provider: , Signature: - }, - "start": n, // starting entry number - "entries": [ - {"add_multihashes": [, ...], "metadata": "???", - "del_multihashes": [, ...]}, - {"add_multihashes": [, ...], "metadata": "???"}, -} +Multihash ---+ +-----------------------+ + | | ProviderID-ContextID--|----> Metadata +Multihash ---+---> | ProviderID-ContextID--|----> Metadata + | +-----------------------+ +Multihash ---+ ``` -This is a set of changes that are applied to the index data, to go from the `prev_multihash` to the multihash in the request. Each entry contains a set of new multihashes that the provider provides, and possibly a set of multihashes to remove that are no longer provided. A limited-size (≤ 100 bytes) metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded data as per the protocol. The content of the meta data is up to the provider, but if more then the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. +The Data Provider addresses from the Advertisement are stored separately, and are updated with each advertisement that has a different retrieval address for the Data Provider. When the Indexer responds to a client query, it adds the current Data Provider addresses to each data Provider record in the response. -It is worth noting that the same multihash may appear in entries of several different advertisements of a provider. For instance, if a Filecoin Storage Provider receives a new deal that includes a CID, that has a multihash, of some content it is already indexing, it may notify this update to the indexer node by including an entry for that multihash with new metadata (e.g. the new expiration date of the deal) in the next advertisement. -When the indexer node comes across an advertisement that includes a multihash that it is already indexing for the provider, it will simply update the metadata with that of the latest advertisement. Finally, if all the deals that include that multihash in a Storage Provider expire, it can notify the indexer of this by simply adding that multihash in the list of `del_multihashes` for the next advertisement. +When an Advertisement is received that has a ProviderID-ContextID that is already stored in the indexer but different metadata, the indexer updates the metadata that the ProviderID-ContextID maps to. -The response may be paginated either by the client requesting a maximum number of entries or by the server delivering up to a configured maximum number of entries. To get the remaining entries, a subsequent request is made with its `start` set to the number of entries from the previous response. So, if `totalEntries` equals `100` and the response contained 50 entries, the follow-up request should specify `50` for `start` to get the remaining entries. This continues until the indexer has the complete response. - -**Client Interface:** +### Client Interface The client interface is what indexer clients use to ask an indexer which providers are able to provide content identified by multihashes. Queries for a multihash should locate the providers within 10s of milliseconds. A client sends a request containing a set of multihashes. The indexer responds with a list of provider_ids for each multihash that was requested. -Response data: - -``` +Find Response data: +```json { - "multihash_providers": { - : [ - {"provider_id": , "metadata": {???} }, ..], - : [ - {"provider_id": , "metadata": {???} }, ..], - : null, - ... - }, - "providers": [ - { "ID": , - "Addrs": [, ..] - }, - ... - ], - "providers_signature": + "MultihashResults": [ + { + "Multihash": "multihash-string", + "ProviderResults": [ + { + "ContextID": "context-id-bytes", + "Metadata": "metadata-bytes", + "Provider": { + "ID": "peer-id-string", + "Addrs": ["multiaddr-string", "multiaddr-string"] + } + } + ] + } + ] } ``` -Response data contains encoded values returned from the in-memory cache. Responses may be paginated if there are a large number of values. - -## **Design Rationale** - -Key design considerations are: - -- **Indexer uses multihashes, not entire CIDs, to refer to indexed content** -The codec in a CID is independent of the content that is indexed or how the content is retrieved from a provider. -- **Indexer keeps an in-memory cache of multihashes that have been asked for by clients, unless configured for in-memory only operation** -Most multihashes in an index will not be requested, since clients will generally only ask for the top-level multihash for any object. Therefore only the set that is actually requested by clients is kept in memory. The remaining will be kept in space-efficient secondary storage. A generation cache eviction mechanism for the in-memory cache prevents unbounded growth. -- **[storethehash](https://github.com/ipld/go-storethehash) will serve as persistent storage** -Designed to efficiently store hash data, requires only two disk reads to get any data stored for a multihash. This is being evaluated against more mature systems because of a high confidence assumption that it will save significant space. However, that savings needs to be very significant to outweigh the risk and development cost of a new storage facility. The indexer will be constructed to allow different implementations to be chosen as appropriate for different deployments. Having an embedded storage implementation may reduce admin work and allow the indexer to be used more easily as a library. -- **Indexer will track multihash distribution and usage** -The indexer will keep statistics on multihashes that can be used to predict the growth rate of data stored and cached as well as the distribution of multihashes over different providers. The usage of multihashes will also be tracked and will show the demand (by clients) for various multihahses. -- **Indexer will not index `IDENTITY` Multihashes** -The Indexer will not index `IDENTITY` multihashes and filters out any such multihash from the received advertisements. See: [Appendix C, Handling of `IDENTITY` Multihashes](https://www.notion.so/Indexer-Node-Design-fbd1e7d3110c4b1fb154b31f2585e6ff). -- **Load minimization for Storage Provider** - - The indexer updates over graphsync, meaning once a day indexer connects and ask for delta of new CIDs found available on the nodes since last day, and store them to multiple providers, while indexer only need to get it from one of them. This saves bandwidth by deduping when CIDs are provided from multiple providers, so it will only take relatively small amount of bandwidth compared to actual data (Storage Provider can expect minimal bandwidth impact), with at most one or two additional connections that are short-lived for indexing. - +A Find result has a list of MultihashResults. Each element of that list contains a Multihash and a list of ProviderResults for that multihash. Each Provider result has a ContextID, Metadata, and Provider. The Provider has an ID and a list of Addrs. ## **Backwards Compatibility** @@ -224,4 +232,4 @@ Index Provider: ## **Copyright** -Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). \ No newline at end of file +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From f0c3ccd40f4e8eb82d44110679202bc56e6245e9 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Tue, 26 Apr 2022 15:05:20 -1000 Subject: [PATCH 03/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index b17bfb46d..1391570be 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -1,6 +1,12 @@ -# Indexer FIP - -[Indexer FIP](https://www.notion.so/7d156be0cd5643269241cf2d2e13f25e) +--- +fip: "0034" +title: Indexer +author: willscott, gammazero, honghao +status: Final +type: Technical (Core) +created: 2022-04-26 +spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files +--- ## Simple Summary From 6ad18a13415589247a3b16337c3aac857b03bb1b Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Tue, 26 Apr 2022 15:06:02 -1000 Subject: [PATCH 04/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index 1391570be..f8e09f3ad 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -1,7 +1,7 @@ --- fip: "0034" title: Indexer -author: willscott, gammazero, honghao +author: willscott, gammazero, honghaoq status: Final type: Technical (Core) created: 2022-04-26 From 1f6ed4fd504ba3b0a4785f9e0676de7d57183723 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Sat, 30 Apr 2022 20:59:59 -0700 Subject: [PATCH 05/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 31 ++++--------------------------- 1 file changed, 4 insertions(+), 27 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index f8e09f3ad..c53e3ec7a 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -111,6 +111,7 @@ type Advertisement struct { The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the meta data is up to the provider, but if more then the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. +An index entry is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. ### Announcement @@ -164,33 +165,6 @@ The Data Provider addresses from the Advertisement are stored separately, and ar When an Advertisement is received that has a ProviderID-ContextID that is already stored in the indexer but different metadata, the indexer updates the metadata that the ProviderID-ContextID maps to. -### Client Interface - -The client interface is what indexer clients use to ask an indexer which providers are able to provide content identified by multihashes. Queries for a multihash should locate the providers within 10s of milliseconds. - -A client sends a request containing a set of multihashes. The indexer responds with a list of provider_ids for each multihash that was requested. - -Find Response data: -```json -{ - "MultihashResults": [ - { - "Multihash": "multihash-string", - "ProviderResults": [ - { - "ContextID": "context-id-bytes", - "Metadata": "metadata-bytes", - "Provider": { - "ID": "peer-id-string", - "Addrs": ["multiaddr-string", "multiaddr-string"] - } - } - ] - } - ] -} -``` - A Find result has a list of MultihashResults. Each element of that list contains a Multihash and a list of ProviderResults for that multihash. Each Provider result has a ContextID, Metadata, and Provider. The Provider has an ID and a list of Addrs. ## **Backwards Compatibility** @@ -236,6 +210,9 @@ Index Provider: [https://github.com/filecoin-project/index-provider](https://github.com/filecoin-project/index-provider) +Indexer Design: +[https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png](https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png) + ## **Copyright** Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 48664bfab3addfe17a3ca73aad85d38a3f99fb48 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Sat, 30 Apr 2022 21:05:29 -0700 Subject: [PATCH 06/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index c53e3ec7a..f99e7d28c 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -1,10 +1,11 @@ --- fip: "0034" -title: Indexer +title: Indexer Protocol for Filecoin Content Discovery author: willscott, gammazero, honghaoq status: Final type: Technical (Core) created: 2022-04-26 +discussion: https://github.com/filecoin-project/FIPs/discussions/337 spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files --- @@ -211,6 +212,7 @@ Index Provider: [https://github.com/filecoin-project/index-provider](https://github.com/filecoin-project/index-provider) Indexer Design: + [https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png](https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png) ## **Copyright** From 6b2f2e9d7e37d421aeaf6599f7a31c405469b7ac Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Tue, 3 May 2022 16:41:04 -0700 Subject: [PATCH 07/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index f99e7d28c..6c9b6393a 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -2,7 +2,7 @@ fip: "0034" title: Indexer Protocol for Filecoin Content Discovery author: willscott, gammazero, honghaoq -status: Final +status: Draft type: Technical (Core) created: 2022-04-26 discussion: https://github.com/filecoin-project/FIPs/discussions/337 @@ -12,7 +12,6 @@ spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files ## Simple Summary Indexers store mappings of content multi-hashes to provider data records. A client that wants to know where a piece of information is stored can query the indexer, using the CID or multi-hash of the content, and receive a provider record that tells where the client can retrieve the content and via which data transfer protocols. -Content can be queried at [cid.contact](http://cid.contact). ## Abstract @@ -48,7 +47,7 @@ Before we dive into the design details, here are a list of concepts we need to c - **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. - **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). -## Actors +## Parties **Data Providers** store data and make it available for retrieval clients to retrieve. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide and update their internal location of content. @@ -185,7 +184,7 @@ The following test cases should be covered: - Responsiveness to indexer advertisement queries - Latency test: - Indexer query response latency: <300ms p95 - - Indexer ingestion latency: <60 sec p95 + - Indexing latency: <60 sec p95 ## **Security Considerations** From 9060472ed94551fc1e810c2e7c25e5c6d897d6c5 Mon Sep 17 00:00:00 2001 From: Kaitlin Beegle <46908964+kaitlin-beegle@users.noreply.github.com> Date: Mon, 23 May 2022 17:15:34 -0400 Subject: [PATCH 08/15] Update FIPS/IndexerFIP.md Co-authored-by: Will --- FIPS/IndexerFIP.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index 6c9b6393a..0af19698c 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -109,7 +109,14 @@ type Advertisement struct { - The ContextID is limited to a maximum of 64 bytes. - The Metadata is limited to a maximum of 1024 bytes (1KiB). -The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the meta data is up to the provider, but if more then the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. +The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the metadata is up to the provider, but if more than the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. + +Graphsync is the most commonly used protocol for retrieving content from a content provider. The filecoin-graphsync transport metadata is currently defined as follows: + +Uvarint protcol `0x0910` (TransportGraphsyncFilecoinv1 in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv#L134)). This is followed by a CBOR-encoded struct of: +- PieceCID, a link +- VerifiedDeal, boolean +- FastRetrieval, boolean An index entry is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. From fe05fd965bb5485b73bbfba044684adf3ede21aa Mon Sep 17 00:00:00 2001 From: Kaitlin Beegle <46908964+kaitlin-beegle@users.noreply.github.com> Date: Thu, 23 Jun 2022 16:56:44 -0700 Subject: [PATCH 09/15] Update FIPS/IndexerFIP.md Co-authored-by: Andrew Gillis --- FIPS/IndexerFIP.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index 0af19698c..61fd9026b 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -114,9 +114,15 @@ The metadata field contains provider-specific data that pertains to the set of m Graphsync is the most commonly used protocol for retrieving content from a content provider. The filecoin-graphsync transport metadata is currently defined as follows: Uvarint protcol `0x0910` (TransportGraphsyncFilecoinv1 in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv#L134)). This is followed by a CBOR-encoded struct of: -- PieceCID, a link -- VerifiedDeal, boolean -- FastRetrieval, boolean +```go +type GraphsyncFilecoinV1 struct { + // PieceCID identifies the piece this data can be found in + PieceCID cid.Cid + // VerifiedDeal indicates if the deal is verified + VerifiedDeal bool + // FastRetrieval indicates whether the provider claims there is an unsealed copy + FastRetrieval bool +} An index entry is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. From e269e7e8609b8139d003cec019e56413912f19e3 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Fri, 24 Jun 2022 14:40:07 -0700 Subject: [PATCH 10/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index 61fd9026b..e40dcecdb 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -1,9 +1,9 @@ --- -fip: "0034" +fip: "0038" title: Indexer Protocol for Filecoin Content Discovery author: willscott, gammazero, honghaoq status: Draft -type: Technical (Core) +type: FRC created: 2022-04-26 discussion: https://github.com/filecoin-project/FIPs/discussions/337 spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files From 6b8bb9b3c2903f31c7ca08ac12039fdd920627e6 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Fri, 24 Jun 2022 16:59:36 -0700 Subject: [PATCH 11/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 15 --------------- 1 file changed, 15 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index e40dcecdb..453e1e7ee 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -184,21 +184,6 @@ A Find result has a list of MultihashResults. Each element of that list contain This FIP does not change actors behavior so it does not require any Filecoin network update. -## **Test Cases** - -The following test cases should be covered: - -- Index Provider functionality test: - - Ability to index storage deals - - Storage Provider able to publish index to an Indexer node -- Scalability test: indexer need to be able to handle index storage at scale (hundreds of billions of index) -- Availability test: - - 3-9s node uptime - - Responsiveness to indexer advertisement queries -- Latency test: - - Indexer query response latency: <300ms p95 - - Indexing latency: <60 sec p95 - ## **Security Considerations** This FIP does not touch underlying proofs or security. From 331b138d240ec010e4e4d896dc4b1c1fe0394996 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Fri, 24 Jun 2022 17:02:26 -0700 Subject: [PATCH 12/15] Update IndexerFIP.md --- FIPS/IndexerFIP.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md index 453e1e7ee..b2ae7bbcf 100644 --- a/FIPS/IndexerFIP.md +++ b/FIPS/IndexerFIP.md @@ -30,7 +30,10 @@ Running Indexer is required for efficient content discovery and retrieval for Fi 2. **Better usage and growth of the network**: Making data more accessible to the network increases Filecoin data usage. This drives adoption of the network to onboard more data, resulting in more data being discoverable and retrievable - it is a positive growth flywheel. -## Terminology + +## Specification + +### Terminology Before we dive into the design details, here are a list of concepts we need to cover for Indexer: @@ -47,7 +50,7 @@ Before we dive into the design details, here are a list of concepts we need to c - **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. - **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). -## Parties +### Parties **Data Providers** store data and make it available for retrieval clients to retrieve. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide and update their internal location of content. @@ -57,7 +60,6 @@ Before we dive into the design details, here are a list of concepts we need to c **Clients** want to issue query style requests for content to an Indexer node and receive a set of Data Provider records that inform the client how and where to retrieve that content. The response from an Indexer contains a set of Data Provider records, each having the Provider's ID and addresses. Each record contains the protocol that the client must use to retrieve the data (e.g. graphsync) as well as other information that the client presents to the Provider. This additiona data is used by the Provider to retrieve the content, and may consist of a deal ID or other lookup keys specific to the Provider. If multiple Providers provide the same content, the client may choose based on input from a reputation system, network response time, location, or any other information available to the client. -## Specification ### Data Provier Interface: From e4d92c30b252eab7eb09dae082b6a94e95ad4154 Mon Sep 17 00:00:00 2001 From: honghaoq Date: Fri, 24 Jun 2022 18:52:00 -0700 Subject: [PATCH 13/15] adding indexer frc --- FRCs/frc-indexer.md | 219 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 219 insertions(+) create mode 100644 FRCs/frc-indexer.md diff --git a/FRCs/frc-indexer.md b/FRCs/frc-indexer.md new file mode 100644 index 000000000..faf677cb6 --- /dev/null +++ b/FRCs/frc-indexer.md @@ -0,0 +1,219 @@ +--- +fip: "0038" +title: Indexer Protocol for Filecoin Content Discovery +author: willscott, gammazero, honghaoq +status: Draft +type: FRC +created: 2022-06-24 +discussion: https://github.com/filecoin-project/FIPs/discussions/337 +spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files +--- + +## Simple Summary + +Indexers store mappings of content multi-hashes to provider data records. A client that wants to know where a piece of information is stored can query the indexer, using the CID or multi-hash of the content, and receive a provider record that tells where the client can retrieve the content and via which data transfer protocols. + +## Abstract + +Filecoin stores a great deal of data, but without proper indexing of that data, clients cannot perform efficient retrieval. To improve Filecoin content discoverability, Indexer nodes are developed to store mappings of CID multi-hashes to content provider records, for content lookup upon retrieval request. + +To provide index data to indexer nodes, there is the [Indexer Provider](https://lotus.filecoin.io/storage-providers/operate/index-provider/) that provides a software library for building an index data provider, and an index provider service that can run alongside with Storage Provider. Either of these may be used to tell the Indexer what content is retrievable from a storage provider. The Index Provider library or service sends updates to the Indexer via a series of [Advertisement](https://github.com/filecoin-project/storetheindex/blob/main/api/v0/ingest/schema/schema.ipldsch) messages. Each message references a previous advertisement so that as a whole it forms an advertisement chain. The indexer operates by consuming this chain from the oldest Advertisement not yet consumed, to the latest one. The Indexer fetches new advertisements periodically or in response to a notification from the Index Provider. + +Indexer enables content discovery within Filecoin network today, it is also a fundamental component toward the end goal of universal retrieval across IPFS/Filecoin, i.e. to be able to efficiently retrieve Filecoin content from IPFS. Combining Indexer work for content discovery with IPFS/Filecoin interoperability work for content transfer, we aim to reach the end state of efficient and universal retrieval across IPFS and Filecoin network. + +## Change Motivation + +Primary reasons why Indexer is critical to the success of Filecoin network: + +1. **Content Discovery and Retrieval**: +Running Indexer is required for efficient content discovery and retrieval for Filecoin network, since Filecoin needs a CID to Storage Provider mapping to enable content retrieval. It is also an important step toward retrieval across IPFS and Filecoin, as Indexer will index both IPFS and Filecoin data and support efficient and universal retrieval across IPFS and Filecoin network as the end state. +2. **Better usage and growth of the network**: +Making data more accessible to the network increases Filecoin data usage. This drives adoption of the network to onboard more data, resulting in more data being discoverable and retrievable - it is a positive growth flywheel. + + +## Specification + +### Terminology + +Before we dive into the design details, here are a list of concepts we need to cover for Indexer: + +- **Advertisement**: A record available from a publisher that contains, a link to a chain of multihash blocks, the CID of the previous advertisement, and provider-specific content metadata that is referenced by all the multihashes in the linked multihash blocks. The provider data is identified by a key called a context ID. +- **Announce Message**: A message that informs indexers about the availability of an advertisement. This is usually sent via gossip pubsub, but can also be sent via HTTP. An announce message contains the advertisement CID it is announcing, which allows indexers to ignore the announce if they have already indexed the advertisement. The publisher's address is included in the announce to tell indexers where to retrieve the advertisement from. +- **Context ID**: A key that, for a provider, uniquely identifies content metadata. This allows content metadata to be updated or delete on the indexer without having to refer to it using the multihashes that map to it. +- **Entries:** Represents the list of multihashes the content hosted by the storage provider. The entries are represented as an IPLD DAG, divided across a set of interlinked IPLD nodes, referred to as Entries Chunk. +- **Gossip PubSub**: Publish/subscribe communications over a libp2p gossip mesh. This is used by publishers to broadcast Announce Messages to all indexers that are subscribed to the topic that the announce message is sent on. For production publishers and indexers, this topic is `"/indexer/ingest/mainnet"`. +- **Indexer**: A network node that keeps a mappings of multihashes to provider records. +- **Metadata**: Provider-specific data that a retrieval client gets from an indexer query and passed to the provider when retrieving content. This metadata is used by the provider to identify and find specific content and deliver that content via the protocol (e.g. graphsync) specified in the metadata. +- **Provider**: Also called a Storage Provider, this is the entity from which content can be retrieved by a retrieval client. When multihashes are looked up on an indexer, the responses contain provider that provide the content referenced by the multihashes. A provider is identified by a libp2p peer ID. +- **Publisher**: Also referred to as _Index Provider_, This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID. +- **Retrieval Addresses:** A list of addresses included in each advertisement that points to where you can retrieve from the advertised content. +- **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. +- **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). + +### Parties + +**Data Providers** store data and make it available for retrieval clients to retrieve. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide and update their internal location of content. + +**Index Providers (Publishers)** maintain a history of changes to content stored by a Data Provider, and present the sequence of changes to Indexers as advertisements. Index Providers sent notifications to Indexers to announce that new advertisements are available. Usually a Data Provider is also its own Index Provider, but these can be different entities. + +**Indexer nodes** receive Advertisement announcements published by Index Providers, allowing Indexers to discover Data Providers and to be notified of updates to the Data Provider content. The announcements let Indexers know that new advertisements are available. Indexers retrieve Advertisements from the Index Providers, to get Data Provider information and associated index multihash data. The Indexer decides if and when to fetch index data from an Index Provider based on the policies the Indexer is configured with. Indexers may reroute client requests to other indexers if they do not handle that content. + +**Clients** want to issue query style requests for content to an Indexer node and receive a set of Data Provider records that inform the client how and where to retrieve that content. The response from an Indexer contains a set of Data Provider records, each having the Provider's ID and addresses. Each record contains the protocol that the client must use to retrieve the data (e.g. graphsync) as well as other information that the client presents to the Provider. This additiona data is used by the Provider to retrieve the content, and may consist of a deal ID or other lookup keys specific to the Provider. If multiple Providers provide the same content, the client may choose based on input from a reputation system, network response time, location, or any other information available to the client. + + +### Data Provier Interface: + +Data providers maintain local records of the CIDs of the content they store and the changes to this content. Providers must be able to present this as an ordered series of changes to sets of multihashes over time. For indexing, a multihash extracted from a CID is used identify content, since indexed content does not track how the content is encoded. The source CID is the CID of a merkle-tree of content hosted by the provider. + +Each addition of content is represented a set of multihashes accompanied by context information (metadata), that is within the domain of interpretation of the Provider, and an unique identifier (context ID) that identifies this context. Therefore a context ID links a set of multihashes to metadata that pertains to that set of multihashes. Removal of content is done by deletion of a context ID, which represents removing a set of multihashes and metadata identified by that context ID. Metadata identified by context ID may also be replaced by new metadata, as the information pertaining to the associated set of multihashes changes (change of location, storage deal, etc.). + +Each of these changes is a presented as a separate record, that is linked to the previous record, forming an ordered log changes to the Provider's content. Indexers track these changes to keep their view of the Provider's content in sync with the Provider. + +### Advertisements: + +An Advertisement is a data structure that packages information about a change to Provider content. The Advertisement contains the Provider ID and addresses, content metadata, context ID, a link to a chain of multihash blocks, and a link to the previous Advertisement. The Advertisement is also signed by the Provider or publisher of the Advertisement, using a signature computed over all of these fields. + +Each Advertisement is uniquely identified by a content ID (CID) that is used to retrieve that Advertisement from the Index Provider. This makes the advertisement an immutable record. The link to the chain of multihash blocks and each link in the chain is also a CID, making the chain of multihash blocks immutable as well. The Advertisement is what is communicated from an Index Provider to an Indexer to supply the Indexer with index data. + +Advertisement as IPLD schema: +``` +# EntryChunk captures a chunk in a chain of entries that collectively contain the multihashes +# advertised by an Advertisement. +type EntryChunk struct { + # Entries represent the list of multihashes in this chunk. + Entries [Bytes] + # Next is an optional link to the next entry chunk. + Next optional Link +} + +# Advertisement signals availability of content to the indexer nodes in form of a chunked list of +# multihashes, where to retrieve them from, and over protocol they are retrievable. +type Advertisement struct { + # PreviousID is an optional link to the previous advertisement. + PreviousID optional Link + # Provider is the peer ID of the host that provides this advertisement. + Provider String + # Addresses is the list of multiaddrs as strings from which the advertised content is retrievable. + Addresses [String] + # Signature is the signature of this advertisement. + Signature Bytes + # Entries is a link to the chained EntryChunk instances that contain the multihashes advertised. + Entries Link + # ContextID is the unique identifier for the collection of advertised multihashes. + ContextID Bytes + # Metadata captures contextual information about how to retrieve the advertised content. + Metadata Bytes + # IsRm, when true, specifies that the content identified by ContextID is no longer retrievalbe from the provider. + IsRm Bool +} +``` + +- The ContextID is limited to a maximum of 64 bytes. +- The Metadata is limited to a maximum of 1024 bytes (1KiB). + +The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the metadata is up to the provider, but if more than the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. + +Graphsync is the most commonly used protocol for retrieving content from a content provider. The filecoin-graphsync transport metadata is currently defined as follows: + +Uvarint protcol `0x0910` (TransportGraphsyncFilecoinv1 in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv#L134)). This is followed by a CBOR-encoded struct of: +```go +type GraphsyncFilecoinV1 struct { + // PieceCID identifies the piece this data can be found in + PieceCID cid.Cid + // VerifiedDeal indicates if the deal is verified + VerifiedDeal bool + // FastRetrieval indicates whether the provider claims there is an unsealed copy + FastRetrieval bool +} + +An index entry is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. + +### Announcement + +Index Providers announce the availability of new Advertisements by publishing a notification on a known gossipsub pub-sub channel, `/indexer/ingest/mainnet`, that indexer nodes subscribe to. Alternatively, the Index Provider may post an announcement message directly to an indexer over HTTP. Typically, gossip pub-sub is used and is preferred because the Index Providers will not necessarily know how to contact the Indexers, but will know the filecoin chain nodes that will relay gossip pub-sub to Indexers. + +After receiving an announcement, the Indexer checkes if it has already retrieved the announced Advertisement, and if so, ignores the announcement. If the Advertisement has not yet been retrieved, the Indexer contacts the Index Provider, at an address supplied in the announcement, to retrieve the announced Advertisement. The addresses in the announcement can be libp2p addresses or an HTTP address, and the Indexer uses the addresses to retrieve the Advertisement over graphsync or HTTP respectively. + +Announcement message as golang struct: +```go +{ + Cid: cid.Cid + Addrs [][]byte + ExtraData []byte `json:",omitempty"` +} +``` + +The extra data is used by Index Providers to pass identity data to filecoin nodes in order for the filecoin nodes allow the announcement to be forwarded over gossip pub-sub. + +### Advertisement Chain +Once the Indexer has received an Advertisement it checks if the previous Advertisement has already been retrieved, and if not, retrieves it. This continues until all previously unseen Advertisements are retrieved or until there are no more Advertisements to retrieve, i.e. the end of the chain is reached. + +After the entire chain of unprocessed Advertisements has been retrieved, the Indexer walks the chain in order from oldest to newest and retrieves the chain of multihash blocks linked to by each advertisement. A multihash block is a chunk of the multihashes in the change set with a link to the next block. Splitting all the total multihashes into blocks enables block-based data transfer mechanisms to fetch the multihash data and servies as a pagination mechanism for other transports. + +``` + Oldest Newest ++----------+ +--------+ +--------+ +| Ad A | | Ad B | | Ad C | +| prev=nil |<----| prev=A |<----| prev=B | ++----+-----+ +---+----+ +---+----+ + | | | + V V V +[multihashes] [multihashes] [multihashes] + | | | + V V V +[multihashes] [multihashes] [multihashes] +``` + +### Index Data Storage + +All of the multihashes in the multihash blocks are read and stored in the indexer as a mapping of multihashes to a list of providerID-contextID in the Advertisement, and each providerID-contextID is mapped to its metadata record. This allows a multihash to resolve to a multiple provider, context ID, metadata records. It also allows a providerID-contextID to be used to identify metadata records to update and delete. + +``` +Multihash ---+ +-----------------------+ + | | ProviderID-ContextID--|----> Metadata +Multihash ---+---> | ProviderID-ContextID--|----> Metadata + | +-----------------------+ +Multihash ---+ +``` + +The Data Provider addresses from the Advertisement are stored separately, and are updated with each advertisement that has a different retrieval address for the Data Provider. When the Indexer responds to a client query, it adds the current Data Provider addresses to each data Provider record in the response. + +When an Advertisement is received that has a ProviderID-ContextID that is already stored in the indexer but different metadata, the indexer updates the metadata that the ProviderID-ContextID maps to. + +A Find result has a list of MultihashResults. Each element of that list contains a Multihash and a list of ProviderResults for that multihash. Each Provider result has a ContextID, Metadata, and Provider. The Provider has an ID and a list of Addrs. + +## **Backwards Compatibility** + +This FIP does not change actors behavior so it does not require any Filecoin network update. + +## **Security Considerations** + +This FIP does not touch underlying proofs or security. + +## **Incentive Considerations** + +No change to incentives. In the future this could support retrieval incentive. + +## **Product Considerations** + +No change to product considerations, except that increased content discoverability and retrieval capability is a net improvement to the Filecoin network. + +In the near future, the product should include controls on what contents can be retrieved, so Storage Providers would have the ability to turn off content that they don’t want to be accessible for retrieval. + +## **Implementation** + +Indexer: + +[https://github.com/filecoin-project/storetheindex](https://github.com/filecoin-project/storetheindex) + +Index Provider: + +[https://github.com/filecoin-project/index-provider](https://github.com/filecoin-project/index-provider) + +Indexer Design: + +[https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png](https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png) + +## **Copyright** + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From f4f8ea07adcf954a2b145ba0c26d309f4cfb53b2 Mon Sep 17 00:00:00 2001 From: honghaoq Date: Fri, 24 Jun 2022 18:56:20 -0700 Subject: [PATCH 14/15] adding indexer frc --- FIPS/IndexerFIP.md | 219 --------------------------------------------- 1 file changed, 219 deletions(-) delete mode 100644 FIPS/IndexerFIP.md diff --git a/FIPS/IndexerFIP.md b/FIPS/IndexerFIP.md deleted file mode 100644 index b2ae7bbcf..000000000 --- a/FIPS/IndexerFIP.md +++ /dev/null @@ -1,219 +0,0 @@ ---- -fip: "0038" -title: Indexer Protocol for Filecoin Content Discovery -author: willscott, gammazero, honghaoq -status: Draft -type: FRC -created: 2022-04-26 -discussion: https://github.com/filecoin-project/FIPs/discussions/337 -spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files ---- - -## Simple Summary - -Indexers store mappings of content multi-hashes to provider data records. A client that wants to know where a piece of information is stored can query the indexer, using the CID or multi-hash of the content, and receive a provider record that tells where the client can retrieve the content and via which data transfer protocols. - -## Abstract - -Filecoin stores a great deal of data, but without proper indexing of that data, clients cannot perform efficient retrieval. To improve Filecoin content discoverability, Indexer nodes are developed to store mappings of CID multi-hashes to content provider records, for content lookup upon retrieval request. - -To provide index data to indexer nodes, there is the [Indexer Provider](https://lotus.filecoin.io/storage-providers/operate/index-provider/) that provides a software library for building an index data provider, and an index provider service that can run alongside with Storage Provider. Either of these may be used to tell the Indexer what content is retrievable from a storage provider. The Index Provider library or service sends updates to the Indexer via a series of [Advertisement](https://github.com/filecoin-project/storetheindex/blob/main/api/v0/ingest/schema/schema.ipldsch) messages. Each message references a previous advertisement so that as a whole it forms an advertisement chain. The indexer operates by consuming this chain from the oldest Advertisement not yet consumed, to the latest one. The Indexer fetches new advertisements periodically or in response to a notification from the Index Provider. - -Indexer enables content discovery within Filecoin network today, it is also a fundamental component toward the end goal of universal retrieval across IPFS/Filecoin, i.e. to be able to efficiently retrieve Filecoin content from IPFS. Combining Indexer work for content discovery with IPFS/Filecoin interoperability work for content transfer, we aim to reach the end state of efficient and universal retrieval across IPFS and Filecoin network. - -## Change Motivation - -Primary reasons why Indexer is critical to the success of Filecoin network: - -1. **Content Discovery and Retrieval**: -Running Indexer is required for efficient content discovery and retrieval for Filecoin network, since Filecoin needs a CID to Storage Provider mapping to enable content retrieval. It is also an important step toward retrieval across IPFS and Filecoin, as Indexer will index both IPFS and Filecoin data and support efficient and universal retrieval across IPFS and Filecoin network as the end state. -2. **Better usage and growth of the network**: -Making data more accessible to the network increases Filecoin data usage. This drives adoption of the network to onboard more data, resulting in more data being discoverable and retrievable - it is a positive growth flywheel. - - -## Specification - -### Terminology - -Before we dive into the design details, here are a list of concepts we need to cover for Indexer: - -- **Advertisement**: A record available from a publisher that contains, a link to a chain of multihash blocks, the CID of the previous advertisement, and provider-specific content metadata that is referenced by all the multihashes in the linked multihash blocks. The provider data is identified by a key called a context ID. -- **Announce Message**: A message that informs indexers about the availability of an advertisement. This is usually sent via gossip pubsub, but can also be sent via HTTP. An announce message contains the advertisement CID it is announcing, which allows indexers to ignore the announce if they have already indexed the advertisement. The publisher's address is included in the announce to tell indexers where to retrieve the advertisement from. -- **Context ID**: A key that, for a provider, uniquely identifies content metadata. This allows content metadata to be updated or delete on the indexer without having to refer to it using the multihashes that map to it. -- **Entries:** Represents the list of multihashes the content hosted by the storage provider. The entries are represented as an IPLD DAG, divided across a set of interlinked IPLD nodes, referred to as Entries Chunk. -- **Gossip PubSub**: Publish/subscribe communications over a libp2p gossip mesh. This is used by publishers to broadcast Announce Messages to all indexers that are subscribed to the topic that the announce message is sent on. For production publishers and indexers, this topic is `"/indexer/ingest/mainnet"`. -- **Indexer**: A network node that keeps a mappings of multihashes to provider records. -- **Metadata**: Provider-specific data that a retrieval client gets from an indexer query and passed to the provider when retrieving content. This metadata is used by the provider to identify and find specific content and deliver that content via the protocol (e.g. graphsync) specified in the metadata. -- **Provider**: Also called a Storage Provider, this is the entity from which content can be retrieved by a retrieval client. When multihashes are looked up on an indexer, the responses contain provider that provide the content referenced by the multihashes. A provider is identified by a libp2p peer ID. -- **Publisher**: Also referred to as _Index Provider_, This is an entity that publishes advertisements and index data to an indexer. It is usually, but not always, the same as the data provider. A publisher is identified by a libp2p peer ID. -- **Retrieval Addresses:** A list of addresses included in each advertisement that points to where you can retrieve from the advertised content. -- **Retrieval Client**: A client that queries an indexer to find where content is available, and retrieves that content from a provider. -- **Sync** (indexer with publisher): Operation that synchronizes the content indexed by an indexer with the content published by a publisher. A sync is initiated when an indexer receives and announces message, by an administrative command to sync with a publisher, or by the indexer when there have been no updates for a provider for some period of time (24 hours by default). - -### Parties - -**Data Providers** store data and make it available for retrieval clients to retrieve. They want to advertise the availability of this content so that any consumers can easily find it. Data providers will also want to revoke advertisements of content that they no longer provide and update their internal location of content. - -**Index Providers (Publishers)** maintain a history of changes to content stored by a Data Provider, and present the sequence of changes to Indexers as advertisements. Index Providers sent notifications to Indexers to announce that new advertisements are available. Usually a Data Provider is also its own Index Provider, but these can be different entities. - -**Indexer nodes** receive Advertisement announcements published by Index Providers, allowing Indexers to discover Data Providers and to be notified of updates to the Data Provider content. The announcements let Indexers know that new advertisements are available. Indexers retrieve Advertisements from the Index Providers, to get Data Provider information and associated index multihash data. The Indexer decides if and when to fetch index data from an Index Provider based on the policies the Indexer is configured with. Indexers may reroute client requests to other indexers if they do not handle that content. - -**Clients** want to issue query style requests for content to an Indexer node and receive a set of Data Provider records that inform the client how and where to retrieve that content. The response from an Indexer contains a set of Data Provider records, each having the Provider's ID and addresses. Each record contains the protocol that the client must use to retrieve the data (e.g. graphsync) as well as other information that the client presents to the Provider. This additiona data is used by the Provider to retrieve the content, and may consist of a deal ID or other lookup keys specific to the Provider. If multiple Providers provide the same content, the client may choose based on input from a reputation system, network response time, location, or any other information available to the client. - - -### Data Provier Interface: - -Data providers maintain local records of the CIDs of the content they store and the changes to this content. Providers must be able to present this as an ordered series of changes to sets of multihashes over time. For indexing, a multihash extracted from a CID is used identify content, since indexed content does not track how the content is encoded. The source CID is the CID of a merkle-tree of content hosted by the provider. - -Each addition of content is represented a set of multihashes accompanied by context information (metadata), that is within the domain of interpretation of the Provider, and an unique identifier (context ID) that identifies this context. Therefore a context ID links a set of multihashes to metadata that pertains to that set of multihashes. Removal of content is done by deletion of a context ID, which represents removing a set of multihashes and metadata identified by that context ID. Metadata identified by context ID may also be replaced by new metadata, as the information pertaining to the associated set of multihashes changes (change of location, storage deal, etc.). - -Each of these changes is a presented as a separate record, that is linked to the previous record, forming an ordered log changes to the Provider's content. Indexers track these changes to keep their view of the Provider's content in sync with the Provider. - -### Advertisements: - -An Advertisement is a data structure that packages information about a change to Provider content. The Advertisement contains the Provider ID and addresses, content metadata, context ID, a link to a chain of multihash blocks, and a link to the previous Advertisement. The Advertisement is also signed by the Provider or publisher of the Advertisement, using a signature computed over all of these fields. - -Each Advertisement is uniquely identified by a content ID (CID) that is used to retrieve that Advertisement from the Index Provider. This makes the advertisement an immutable record. The link to the chain of multihash blocks and each link in the chain is also a CID, making the chain of multihash blocks immutable as well. The Advertisement is what is communicated from an Index Provider to an Indexer to supply the Indexer with index data. - -Advertisement as IPLD schema: -``` -# EntryChunk captures a chunk in a chain of entries that collectively contain the multihashes -# advertised by an Advertisement. -type EntryChunk struct { - # Entries represent the list of multihashes in this chunk. - Entries [Bytes] - # Next is an optional link to the next entry chunk. - Next optional Link -} - -# Advertisement signals availability of content to the indexer nodes in form of a chunked list of -# multihashes, where to retrieve them from, and over protocol they are retrievable. -type Advertisement struct { - # PreviousID is an optional link to the previous advertisement. - PreviousID optional Link - # Provider is the peer ID of the host that provides this advertisement. - Provider String - # Addresses is the list of multiaddrs as strings from which the advertised content is retrievable. - Addresses [String] - # Signature is the signature of this advertisement. - Signature Bytes - # Entries is a link to the chained EntryChunk instances that contain the multihashes advertised. - Entries Link - # ContextID is the unique identifier for the collection of advertised multihashes. - ContextID Bytes - # Metadata captures contextual information about how to retrieve the advertised content. - Metadata Bytes - # IsRm, when true, specifies that the content identified by ContextID is no longer retrievalbe from the provider. - IsRm Bool -} -``` - -- The ContextID is limited to a maximum of 64 bytes. -- The Metadata is limited to a maximum of 1024 bytes (1KiB). - -The metadata field contains provider-specific data that pertains to the set of multihashes being added. The metadata is prefixed with a protocol ID number followed by data encoded as per the protocol. The content of the metadata is up to the provider, but if more than the limited size is needed, then the metadata should contain an ID identifying mode complex data stored by the provider. - -Graphsync is the most commonly used protocol for retrieving content from a content provider. The filecoin-graphsync transport metadata is currently defined as follows: - -Uvarint protcol `0x0910` (TransportGraphsyncFilecoinv1 in the [multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv#L134)). This is followed by a CBOR-encoded struct of: -```go -type GraphsyncFilecoinV1 struct { - // PieceCID identifies the piece this data can be found in - PieceCID cid.Cid - // VerifiedDeal indicates if the deal is verified - VerifiedDeal bool - // FastRetrieval indicates whether the provider claims there is an unsealed copy - FastRetrieval bool -} - -An index entry is always a multihash, extracted from a CID (since indexed content does not track how the content is encoded). This should be the multihash from the CID of a merkle-tree of content hosted by the provider. - -### Announcement - -Index Providers announce the availability of new Advertisements by publishing a notification on a known gossipsub pub-sub channel, `/indexer/ingest/mainnet`, that indexer nodes subscribe to. Alternatively, the Index Provider may post an announcement message directly to an indexer over HTTP. Typically, gossip pub-sub is used and is preferred because the Index Providers will not necessarily know how to contact the Indexers, but will know the filecoin chain nodes that will relay gossip pub-sub to Indexers. - -After receiving an announcement, the Indexer checkes if it has already retrieved the announced Advertisement, and if so, ignores the announcement. If the Advertisement has not yet been retrieved, the Indexer contacts the Index Provider, at an address supplied in the announcement, to retrieve the announced Advertisement. The addresses in the announcement can be libp2p addresses or an HTTP address, and the Indexer uses the addresses to retrieve the Advertisement over graphsync or HTTP respectively. - -Announcement message as golang struct: -```go -{ - Cid: cid.Cid - Addrs [][]byte - ExtraData []byte `json:",omitempty"` -} -``` - -The extra data is used by Index Providers to pass identity data to filecoin nodes in order for the filecoin nodes allow the announcement to be forwarded over gossip pub-sub. - -### Advertisement Chain -Once the Indexer has received an Advertisement it checks if the previous Advertisement has already been retrieved, and if not, retrieves it. This continues until all previously unseen Advertisements are retrieved or until there are no more Advertisements to retrieve, i.e. the end of the chain is reached. - -After the entire chain of unprocessed Advertisements has been retrieved, the Indexer walks the chain in order from oldest to newest and retrieves the chain of multihash blocks linked to by each advertisement. A multihash block is a chunk of the multihashes in the change set with a link to the next block. Splitting all the total multihashes into blocks enables block-based data transfer mechanisms to fetch the multihash data and servies as a pagination mechanism for other transports. - -``` - Oldest Newest -+----------+ +--------+ +--------+ -| Ad A | | Ad B | | Ad C | -| prev=nil |<----| prev=A |<----| prev=B | -+----+-----+ +---+----+ +---+----+ - | | | - V V V -[multihashes] [multihashes] [multihashes] - | | | - V V V -[multihashes] [multihashes] [multihashes] -``` - -### Index Data Storage - -All of the multihashes in the multihash blocks are read and stored in the indexer as a mapping of multihashes to a list of providerID-contextID in the Advertisement, and each providerID-contextID is mapped to its metadata record. This allows a multihash to resolve to a multiple provider, context ID, metadata records. It also allows a providerID-contextID to be used to identify metadata records to update and delete. - -``` -Multihash ---+ +-----------------------+ - | | ProviderID-ContextID--|----> Metadata -Multihash ---+---> | ProviderID-ContextID--|----> Metadata - | +-----------------------+ -Multihash ---+ -``` - -The Data Provider addresses from the Advertisement are stored separately, and are updated with each advertisement that has a different retrieval address for the Data Provider. When the Indexer responds to a client query, it adds the current Data Provider addresses to each data Provider record in the response. - -When an Advertisement is received that has a ProviderID-ContextID that is already stored in the indexer but different metadata, the indexer updates the metadata that the ProviderID-ContextID maps to. - -A Find result has a list of MultihashResults. Each element of that list contains a Multihash and a list of ProviderResults for that multihash. Each Provider result has a ContextID, Metadata, and Provider. The Provider has an ID and a list of Addrs. - -## **Backwards Compatibility** - -This FIP does not change actors behavior so it does not require any Filecoin network update. - -## **Security Considerations** - -This FIP does not touch underlying proofs or security. - -## **Incentive Considerations** - -No change to incentives. In the future this could support retrieval incentive. - -## **Product Considerations** - -No change to product considerations, except that increased content discoverability and retrieval capability is a net improvement to the Filecoin network. - -In the near future, the product should include controls on what contents can be retrieved, so Storage Providers would have the ability to turn off content that they don’t want to be accessible for retrieval. - -## **Implementation** - -Indexer: - -[https://github.com/filecoin-project/storetheindex](https://github.com/filecoin-project/storetheindex) - -Index Provider: - -[https://github.com/filecoin-project/index-provider](https://github.com/filecoin-project/index-provider) - -Indexer Design: - -[https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png](https://github.com/filecoin-project/storetheindex/blob/main/doc/indexer_ecosys.png) - -## **Copyright** - -Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). From 2b69abab20f3829d39c64aa6e1c60d324ec8df89 Mon Sep 17 00:00:00 2001 From: Honghao Qiu Date: Fri, 24 Jun 2022 18:59:21 -0700 Subject: [PATCH 15/15] Update frc-indexer.md --- FRCs/frc-indexer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/FRCs/frc-indexer.md b/FRCs/frc-indexer.md index faf677cb6..5d3616ed2 100644 --- a/FRCs/frc-indexer.md +++ b/FRCs/frc-indexer.md @@ -6,7 +6,7 @@ status: Draft type: FRC created: 2022-06-24 discussion: https://github.com/filecoin-project/FIPs/discussions/337 -spec-pr: https://github.com/filecoin-project/FIPs/pull/365/files +spec-pr: https://github.com/filecoin-project/FIPs/pull/393/files --- ## Simple Summary