-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Piece Indexer Missing CIDs #30
Comments
Could you please explain why this is an issue? In my mind, when deal-observer asks PieceIndexer for a payload CID and receives back "not found" response, it should retry the request after some time to take into account IPNI ingestion delays, and eventually give up. E.g. retry in 2 hours, 8 hours, 24 hours and then give up. This seems like a simpler solution to me:
|
It is an issue in sofar, as the current implementation expects all piece CIDs to be present in the IPNI piece indexer. |
Yes, we need to flag deals where we already contacted PieceIndexer/IPNI and could not find any payload CID. The current DB of all active f05 deals has over 20 million entries. From what I see in Spark measurements, around 65% of retrievals fail because the miner did not advertise the deal to IPNI. That gives us around 13 million deals not advertised to IPNI. We don't want to retry 13 million IPNI/PieceIndexer queries.
I may be misreading your comment, but I want to be explicit that we must store the status of piece->payload resolution in the database, so that when we restart the Node.js process (e.g. after deploying a new version), the loop continues the work where it left. I also want to mention that it's okay to keep the initial version simpler as long as the reduced design stays reasonable. What we need to ship ASAP: When an SP starts accepting FIL+ deals, advertises them correctly to IPNI/PieceIndexer and serves retrievals, Spark should report a high RSR score for them. (This is kind of an happy path in the context of your work.) Where I see an opportunity to simplify the initial version:
In subsequent iterations, we can implement a retry mechanism for deals where the first PieceIndexer request did not find any payload. Even later, we may want to allow SPs to manually trigger re-running of deal->payload mapping for their existing deals, e.g. after they fix their configuration and start announcing deals to IPNI. |
It was initially not clear to me how did you designed the logic to determine which deals to skip. After re-reading your comments and reading through the code, I think I understand it better now. See #31 (comment) TL;DR: I think your current design is good enough for the first iteration, but we need to fix the initialisation at service (re)start:
|
Implementation Plan: Database ChangesThe proposed solution adds two more fields to the database:
Program FlowPayload Unretrievable marks whether there has been a retry of fetching the payload for a given deal. It is left as NULL if there has not been a retry, if the retry was successful and it is set to TRUE if the retry was unsuccessful too. Last Payload Retrieval marks the timestamp at which the payload was tried to be fetched the last time. It is set to NULL if the payload has not been attempted to be fetched yet and it is set to the unix timestamp of the last time the piece indexer tried to fetch the payload for a given deal. The piece indexer will fetch all deals where the payload ID is missing, the column If the retry is successful we record this in the database by setting |
Looking at the query below, this can be
Can we use a date type instead of a number? https://www.postgresql.org/docs/current/datatype-datetime.html
Can address this later in code review, or now:
I'm sorry, the first one of what? Do you mean the flow chart entry "Fetch deals with..."?
I'm not sure I understand, let me paraphrase how I read this: We mark "retry of fetching the payload" to NULL if there has not been a retry, or if the retry was successful. We set "retry of fetching the payload" to TRUE if the retry was unsuccessful too. Maybe this logic could also be clarified through the flow diagram?
Oh I think I understand, by "the first one" and "the second field" you mean the DB columns "Payload Unretrievable" and "Last Payload Retrieval" from above, right? Could you please clarify the questions from this comment, before I move on with the implementation plan review? Tip: If you want to codify the flow chart, GitHub supports Mermaidjs:
Ex: flowchart TD
Start
|
👏🏻
I agree with Julian's suggestion to use
It makes sense to me to have a second column to store this bit of information 👍🏻 I find the currently proposed representation of "Payload Unretrievable" a bit difficult to reason about. Can we improve it for more simplicity and easier understanding? I'd say the primary cause of confusion is that a data type allowing three values is not a boolean. Some ideas to consider:
Also: let's find a different term than "retrieve" to clarify the distinction between retrieving the payload bytes (usually from the Storage Provider) and discovering the CID of those payload bytes (via IPNI or PieceIndexer query). |
I side with @juliangruber and @bajtos on using timestamps for this field. Also, are we trying to resolve Payload CID when we first find the deal or are we waiting for three days before doing so? If latter is the case we could set this value to
What are the fields you were thinking of indexing? I guess indexing over
I think this is a good idea. That way we can try stop retrying after some set number of attempts or have some different rules for payloads that we have attempted to fetch some
This combined with the counter would be a good idea. |
Sure we can also use timestamp as a data type.
@pyropy
Maybe this logic could also be clarified through the flow diagram? By default the field
@bajtos Do we ever store the payload bytest in the database?
I am fine with using an enum instead. The program logic does not change.
Is there an advantage of having multiple attempts over trying once more at the end of the period where the payload is expected to be published by? |
@pyropy we are retrieving the deal from the RPC provider, this deal does not yet include the payload. After we fetched the deal we try to retrieve the payload CID. It is at this moment that we set the value for the last payload retrieval. Before that it is set to NULL because we have not yet attempted to retrieve it. |
No, we don't. When a checker node executes a task (a retrieval attempt), it downloads the payload bytes from the Storage Provider, verifies that the hash of the bytes matches the CID requested, and reports the outcome only, not the payload bytes.
Great question! IIUC, our primary motivation in this task is to handle ingestion delays on the IPNI side. I think your proposal - try one more time after 3 days - is the right solution for that. It's simple and elegant 👏🏻 Being able to retry more than once would become important if our requests to IPNI/PieceIndexer would fail. I.e. if the outcome of the operation "ask IPNI for payload CID" is neither "payload CID" nor "this deal was not indexed" but an error instead. Plus if these outages were too long to be handled by wrapping the IPNI/PieceIndexer requests with @juliangruber WDYT? |
The important insight here is that finding a correct solution (program logic) is only the first step — it is necessary but not sufficient. The second step is to improve the solution so that it's easy for others to understand it, too, even without the full context you built while working on this problem. In many cases, this other person will be future you trying to understand code you wrote months ago. |
Many miners don't advertise their deals to IPNI at all. This leads to the piece CID indexer not keeping record of payloads for all piece CID and minerID combinations.
It is possible that miners do publish to IPNI but it takes an unexpectedly long amount of time.
To summarize three cases need to be accounted for:
The text was updated successfully, but these errors were encountered: