refactor: fetch activity for collection as opposed to individual nft #217
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
https://app.clickup.com/t/861nap53e
In this PR, we move away from fetching activity for each individual NFT, to fetching activity for entire collections. This greatly helps with performance as we suspect there are many, many NFTs that have no recent activity. Periodically running the job to fetch their activity is a waste of both the server resources and API quota as many of them will report no new activity.
By fetching the activity on the collection level, we can avoid many unnecessary requests and jobs being pushed to the queue. For example, if a collection has hundreds of NFTs, but none of those NFTs has any activity, this would be 1 job/request to ensure no new activity has happened for all those hundreds of NFTs.
We link NFTs with their respective activity by the collection ID and their token ID instead of the internal NFT ID as it is now.
As some collections have millions of NFTs and millions of activity items, we can prefetch the entire activity log for these collections and manually insert them into the database, avoiding unnecessary jobs/requests taking hours to index everything on the production server. Once we have all the activity up to a certain point in time (in the near past), we can schedule the job to sync the latest activity. I don't assume there will be a lot of recent activity that's too hard for us to handle, but we should fine-tune this as we have more actual data on how active collections are. For more details on data migration, look for comments on the ClickUp card.
Code changes
nft_id
in thenft_activity
table, we now usecollection_id
andtoken_number
. This allows us to prefetch the activity for the NFTs that do not exist in our database. Then we use those values as keys foractivities
relationship on the NFT model (diff).FetchCollectionActivity
job. This ensures we index the activity for newly created collections.is_fetching_activity
boolean column to collections. As fetching entire activity log can take hours, we want to ignore those collections that are still getting indexed from the scheduler. In essence, let it index in peace.tokenId
parameter) to fetch the activity. WeUPSERT
them, and if we get the limit (500 items) we dispatch another job in case there are more items to retrieve. Once we get less than 500 items, we can assume there is no more activity items to fetch.FetchCollectionActivity
job is restitant to errors. If the job is interrupted (for example failed job), the next time it runs it will just take the latest timestamp activity and start from that point.Retrieving the latest activity only for a specific NFT
Since the job to fetch the collection activity uses the timestamp of the newest activity item as a cursor, dispatching a job to fetch the activity for a specific NFT would break both this logic (timestamp cursor would move) and sorting ("Recently Received") would be out of date as only some NFTs have their activity updated. So I opted not to include that in this PR at all. The only way to update the activity for a specific NFT is to update the activity for the entire collection.
Disabling activity indexing
There are times when we just want to disable indexing activities, as this takes a long time. This can be for demo, as we have a good set of data (10M) to test everything, or during the importing of production data. In that case, you can set
ACTIVITIES_ENABLED=false
environment variable.Ignoring specific collections from having their activity indexed
If you want to ignore only some collections from having their activity indexed, you can update the
activity_blacklist
array to include the address of a collection in theconfig/dashbrd.php
config file.Setup
Our drive contains 2 files,
demo.sql.gz
andprod.sql.gz
. Those contain activities data compressed into the archive. Workflow for this PR:ACTIVITIES_ENABLED=false
environment variable so activities indexer doesn't triggernft_activity
table. Make sure to restart the identity so the primary key sequence (ID) restarts. TablePlus has this option when truncating, which is how I did it./tmp
directory:scp /path/to/demo.sql.gz forge@ip:/tmp
postgres
user and unzip the archive:sudo su postgres && cd /tmp && gzip -k -d demo.sql.gz
psql -U postgres -d database_name < demo.sql
(replacedatabase_name
withforge
or whatever the database name is)demo.sql
anddemo.sql.gz
filesACTIVITIES_ENABLED=true
and restart Horizon withphp artisan horizon:terminate
. Ideally, we'd leave this disabled for demo as there's no need to keep indexing activities and SQL dump doesn't contain all activities.If importing the dump into the demo fails, let me know. This will only happen if some of the records from the
collections
table have been removed, since we now match thecollection_id
from the dump to theid
on thecollections
table. Our drive contains the CSV containing collections for both demo and prod as well.Testing
For locally testing the PR, you don't need anything special. Set the
ACTIVITIES_ENABLED=true
and runphp artisan collections:fetch-activity --collection-id=1
or whatever ID you want. Make sure to pass the ID or it will run for every collection and will take a while.Checklist