Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piece Indexer Missing CIDs #30

Closed
NikolasHaimerl opened this issue Jan 27, 2025 · 12 comments
Closed

Piece Indexer Missing CIDs #30

NikolasHaimerl opened this issue Jan 27, 2025 · 12 comments

Comments

@NikolasHaimerl
Copy link
Contributor

NikolasHaimerl commented Jan 27, 2025

Many miners don't advertise their deals to IPNI at all. This leads to the piece CID indexer not keeping record of payloads for all piece CID and minerID combinations.

It is possible that miners do publish to IPNI but it takes an unexpectedly long amount of time.
To summarize three cases need to be accounted for:

  1. Miner does not publish to IPNI -> Payload is not retrievable
  2. Miner does publish to IPNI but it is not yet updated -> Wait for update
  3. Miner does publish to IPNI and it has been updated -> Fetch payload.
@NikolasHaimerl NikolasHaimerl changed the title Piece Indexer MIssing CIDs Piece Indexer Missing CIDs Jan 27, 2025
@bajtos
Copy link
Member

bajtos commented Jan 28, 2025

Many miners don't advertise their deals to IPNI at all. This leads to the piece CID indexer not keeping record of payloads for all piece CID and minerID combinations.

Could you please explain why this is an issue?

In my mind, when deal-observer asks PieceIndexer for a payload CID and receives back "not found" response, it should retry the request after some time to take into account IPNI ingestion delays, and eventually give up. E.g. retry in 2 hours, 8 hours, 24 hours and then give up.

This seems like a simpler solution to me:

  • We already need to implement "wait for update" inside deal-observer, so my suggestion does not make deal-observer any more complex.
  • We don't need to query the list of providers - less code to write & maintain.

@NikolasHaimerl
Copy link
Contributor Author

Many miners don't advertise their deals to IPNI at all. This leads to the piece CID indexer not keeping record of payloads for all piece CID and minerID combinations.

Could you please explain why this is an issue?

In my mind, when deal-observer asks PieceIndexer for a payload CID and receives back "not found" response, it should retry the request after some time to take into account IPNI ingestion delays, and eventually give up. E.g. retry in 2 hours, 8 hours, 24 hours and then give up.

This seems like a simpler solution to me:

* We already need to implement "wait for update" inside deal-observer, so my suggestion does not make deal-observer any more complex.

* We don't need to query the list of providers - less code to write & maintain.

It is an issue in sofar, as the current implementation expects all piece CIDs to be present in the IPNI piece indexer.
I had a similar solution in mind. One thing that we need to think of is whether we want deals to where there is simply no Payload published on IPNI to be retried for at a later point in time, or whether we should flag them. That latter would guarantee that when we restart the piece indexer part of the deal-observer it does not try to fetch deals it had already given up on before.

@bajtos
Copy link
Member

bajtos commented Jan 29, 2025

One thing that we need to think of is whether we want deals to where there is simply no Payload published on IPNI to be retried for at a later point in time, or whether we should flag them. That latter would guarantee that when we restart the piece indexer part of the deal-observer it does not try to fetch deals it had already given up on before.

Yes, we need to flag deals where we already contacted PieceIndexer/IPNI and could not find any payload CID.

The current DB of all active f05 deals has over 20 million entries. From what I see in Spark measurements, around 65% of retrievals fail because the miner did not advertise the deal to IPNI. That gives us around 13 million deals not advertised to IPNI. We don't want to retry 13 million IPNI/PieceIndexer queries.

when we restart the piece indexer part of the deal-observer

I may be misreading your comment, but I want to be explicit that we must store the status of piece->payload resolution in the database, so that when we restart the Node.js process (e.g. after deploying a new version), the loop continues the work where it left.

I also want to mention that it's okay to keep the initial version simpler as long as the reduced design stays reasonable.

What we need to ship ASAP: When an SP starts accepting FIL+ deals, advertises them correctly to IPNI/PieceIndexer and serves retrievals, Spark should report a high RSR score for them. (This is kind of an happy path in the context of your work.)

Where I see an opportunity to simplify the initial version:

  • For each deal, make only one request to PieceIndexer.
  • PayloadCID found -> we are done.
  • PayloadCID not found -> flag the deal in the database and ignore it for now.

In subsequent iterations, we can implement a retry mechanism for deals where the first PieceIndexer request did not find any payload.

Even later, we may want to allow SPs to manually trigger re-running of deal->payload mapping for their existing deals, e.g. after they fix their configuration and start announcing deals to IPNI.

@bajtos
Copy link
Member

bajtos commented Jan 29, 2025

It was initially not clear to me how did you designed the logic to determine which deals to skip. After re-reading your comments and reading through the code, I think I understand it better now.

See #31 (comment)

TL;DR:

I think your current design is good enough for the first iteration, but we need to fix the initialisation at service (re)start:

  • Find the first epoch where all deals are missing payload CID, and start from that.

@NikolasHaimerl
Copy link
Contributor Author

NikolasHaimerl commented Feb 3, 2025

Implementation Plan:

Database Changes

The proposed solution adds two more fields to the database:

  • Payload Unretrievable: Bool
  • Last Payload Retrieval: number
  • Index for fast querying of SELECT * FROM active_deals WHERE payload_cid IS NULL AND payload_unretrievable IS DISTINCT FROM TRUE AND (last_payload_retrieval IS NULL OR last_payload_retrieval < $1) ORDER BY activated_at_epoch ASC LIMIT $2'

Program Flow

Payload Unretrievable marks whether there has been a retry of fetching the payload for a given deal. It is left as NULL if there has not been a retry, if the retry was successful and it is set to TRUE if the retry was unsuccessful too.

Last Payload Retrieval marks the timestamp at which the payload was tried to be fetched the last time. It is set to NULL if the payload has not been attempted to be fetched yet and it is set to the unix timestamp of the last time the piece indexer tried to fetch the payload for a given deal.

The piece indexer will fetch all deals where the payload ID is missing, the column payload_unretrievable is not set to TRUE and the column last_payload_retrieval is either NULL or is set to a timestamp that is more than 3 days or (24606010003) milliseconds ago. For simplicity we only retry fetching payloads once and a conservative estimate for when deals that are published are updated in the IPNI is 3 days.

If the retry is successful we record this in the database by setting payload_unretrievable to FALSE. Should the retry also fail then we set payload_unretrievable to TRUE and the deal should thus be excluded from further retries. In both cases we update last_payload_retrieval to the appropriate timestamp.

Image

@juliangruber
Copy link
Member

Implementation Plan:

Database Changes

The proposed solution adds two more fields to the database:

  • Payload Unretrievable: Bool

Looking at the query below, this can be TRUE, FALSE or NULL, NULL specifying that we haven't decided yet whether the payload is unretrievable. Is that correct?

  • Last Payload Retrieval: number

Can we use a date type instead of a number? https://www.postgresql.org/docs/current/datatype-datetime.html timestamptz should be the way to go. This way the DB helps us with data correctness, and certain queries. Otherwise, we could store any value as type JSON.

  • Index for fast querying of SELECT * FROM active_deals WHERE payload_cid IS NULL AND payload_unretrievable IS DISTINCT FROM TRUE AND (last_payload_retrieval IS NULL OR last_payload_retrieval < $1) ORDER BY activated_at_epoch ASC LIMIT $2'

Can address this later in code review, or now:

  • Which properties will the index be built upon?
  • IS DISTINCT FROM TRUE -> IS NOT TRUE

Program Flow

The first one marks

I'm sorry, the first one of what? Do you mean the flow chart entry "Fetch deals with..."?

whether there has been a retry of fetching the payload for a given deal. It is left as NULL if there has not been a retry, if the retry was successful and it is set to TRUE if the retry was unsuccessful too.

I'm not sure I understand, let me paraphrase how I read this: We mark "retry of fetching the payload" to NULL if there has not been a retry, or if the retry was successful. We set "retry of fetching the payload" to TRUE if the retry was unsuccessful too.

Maybe this logic could also be clarified through the flow diagram?

The second field marks the timestamp at which the payload was tried to be fetched the last time. It is set to NULL if the payload has not been attempted to be fetched yet and it is set to the unix timestamp of the last time the piece indexer tried to fetch the payload for a given deal.

Oh I think I understand, by "the first one" and "the second field" you mean the DB columns "Payload Unretrievable" and "Last Payload Retrieval" from above, right?

Could you please clarify the questions from this comment, before I move on with the implementation plan review?

Tip: If you want to codify the flow chart, GitHub supports Mermaidjs:

Ex:

flowchart TD
  Start
Loading

@bajtos
Copy link
Member

bajtos commented Feb 4, 2025

The piece indexer will fetch all deals where the payload ID is missing, the column payload_unretrievable is not set to TRUE and the column last_payload_retrieval is either NULL or is set to a timestamp that is more than 3 days or (24606010003) milliseconds ago. For simplicity we only retry fetching payloads once and a conservative estimate for when deals that are published are updated in the IPNI is 3 days.

👏🏻

Last Payload Retrieval: number

I agree with Julian's suggestion to use timestamptz data type. IIRC, node-postgres will use Date on the Node.js side, giving us more type information compared to just a "number".

The first one marks whether there has been a retry of fetching the payload for a given deal. It is left as NULL if there has not been a retry, if the retry was successful and it is set to TRUE if the retry was unsuccessful too.

It makes sense to me to have a second column to store this bit of information 👍🏻

I find the currently proposed representation of "Payload Unretrievable" a bit difficult to reason about. Can we improve it for more simplicity and easier understanding?

I'd say the primary cause of confusion is that a data type allowing three values is not a boolean. Some ideas to consider:

  • Can we use an enum type instead? E.g. PAYLOAD_CID_NOT_QUERIED_YET, PAYLOAD_CID_NOT_FOUND, PAYLOAD_CID_RESOLVED.
  • Maybe we can use a counter to store the number of attempts performed? We can use DEFAULT 0 to initialise this column with a non-null value for new deals.
  • Maybe we don't need to store all three states in this field, because the state "payload CID was retrieved" is already encoded as payload_cid IS NOT NULL.

Also: let's find a different term than "retrieve" to clarify the distinction between retrieving the payload bytes (usually from the Storage Provider) and discovering the CID of those payload bytes (via IPNI or PieceIndexer query).

@pyropy
Copy link
Contributor

pyropy commented Feb 4, 2025

Last Payload Retrieval: number

I side with @juliangruber and @bajtos on using timestamps for this field.

Also, are we trying to resolve Payload CID when we first find the deal or are we waiting for three days before doing so? If latter is the case we could set this value to NOW() instead of it being null.

Index for fast querying

What are the fields you were thinking of indexing? I guess indexing over last_payload_retrieval makes sense.

Maybe we can use a counter to store the number of attempts performed? We can use DEFAULT 0 to initialise this column with a non-null value for new deals.

I think this is a good idea. That way we can try stop retrying after some set number of attempts or have some different rules for payloads that we have attempted to fetch some N number of times.

Maybe we don't need to store all three states in this field, because the state "payload CID was retrieved" is already encoded as payload_cid IS NOT NULL.

This combined with the counter would be a good idea.

@NikolasHaimerl
Copy link
Contributor Author

Looking at the query below, this can be TRUE, FALSE or NULL, NULL specifying that we haven't decided yet whether the payload is unretrievable. Is that correct?
@juliangruber Payload Unretrievable is NULL if we have not retried fetching the deal. It is set to False as soon as we know for sure it is retrievable and it is set to True once we have retried and it was unsuccessful.

Can we use a date type instead of a number? https://www.postgresql.org/docs/current/datatype-datetime.html timestamptz should be the way to go. This way the DB helps us with data correctness, and certain queries. Otherwise, we could store any value as type JSON.

Sure we can also use timestamp as a data type.

Can address this later in code review, or now:

Which properties will the index be built upon?
IS DISTINCT FROM TRUE -> IS NOT TRUE

@pyropy
At the time of writing this plan I was still experimenting with what index makes sense using the EXPLAIN functionality of SQL. I went with the following:

CREATE INDEX CONCURRENTLY IF NOT EXISTS missing_payloads_idx
ON active_deals (activated_at_epoch)
WHERE payload_cid IS NULL
  AND payload_unretrievable IS DISTINCT FROM TRUE;

I'm not sure I understand, let me paraphrase how I read this: We mark "retry of fetching the payload" to NULL if there has not been a retry, or if the retry was successful. We set "retry of fetching the payload" to TRUE if the retry was unsuccessful too.

Maybe this logic could also be clarified through the flow diagram?

By default the field payload_unretrievable is null which means there has not been a retry, it is set to FALSE if the retry was successful and it is set to TRUE if it was not successful.

Also: let's find a different term than "retrieve" to clarify the distinction between retrieving the payload bytes (usually from the Storage Provider) and discovering the CID of those payload bytes (via IPNI or PieceIndexer query).

@bajtos Do we ever store the payload bytest in the database?

I'd say the primary cause of confusion is that a data type allowing three values is not a boolean. Some ideas to consider:

I am fine with using an enum instead. The program logic does not change.

Maybe we can use a counter to store the number of attempts performed? We can use DEFAULT 0 to initialise this column with a non-null value for new deals.

Is there an advantage of having multiple attempts over trying once more at the end of the period where the payload is expected to be published by?

@NikolasHaimerl
Copy link
Contributor Author

Also, are we trying to resolve Payload CID when we first find the deal or are we waiting for three days before doing so? If latter is the case we could set this value to NOW() instead of it being null.

@pyropy we are retrieving the deal from the RPC provider, this deal does not yet include the payload. After we fetched the deal we try to retrieve the payload CID. It is at this moment that we set the value for the last payload retrieval. Before that it is set to NULL because we have not yet attempted to retrieve it.

@bajtos
Copy link
Member

bajtos commented Feb 5, 2025

Also: let's find a different term than "retrieve" to clarify the distinction between retrieving the payload bytes (usually from the Storage Provider) and discovering the CID of those payload bytes (via IPNI or PieceIndexer query).

Do we ever store the payload bytest in the database?

No, we don't. When a checker node executes a task (a retrieval attempt), it downloads the payload bytes from the Storage Provider, verifies that the hash of the bytes matches the CID requested, and reports the outcome only, not the payload bytes.

Maybe we can use a counter to store the number of attempts performed? We can use DEFAULT 0 to initialise this column with a non-null value for new deals.

Is there an advantage of having multiple attempts over trying once more at the end of the period where the payload is expected to be published by?

Great question!

IIUC, our primary motivation in this task is to handle ingestion delays on the IPNI side. I think your proposal - try one more time after 3 days - is the right solution for that. It's simple and elegant 👏🏻

Being able to retry more than once would become important if our requests to IPNI/PieceIndexer would fail. I.e. if the outcome of the operation "ask IPNI for payload CID" is neither "payload CID" nor "this deal was not indexed" but an error instead. Plus if these outages were too long to be handled by wrapping the IPNI/PieceIndexer requests with pRetry. That sounds very speculative to me, so I would say YAGNI - don't worry about this case until we actually experience it.

@juliangruber WDYT?

@bajtos
Copy link
Member

bajtos commented Feb 5, 2025

I'd say the primary cause of confusion is that a data type allowing three values is not a boolean. Some ideas to consider:

I am fine with using an enum instead. The program logic does not change.

The important insight here is that finding a correct solution (program logic) is only the first step — it is necessary but not sufficient. The second step is to improve the solution so that it's easy for others to understand it, too, even without the full context you built while working on this problem. In many cases, this other person will be future you trying to understand code you wrote months ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants