Replies: 2 comments
-
The first thing you could try is to set up your crawl to use some form of de-duplication, where binary-identical content is recognised and stored as WARC Having said that, I'm not sure how clear the documentation is. We may need to dig out some examples. |
Beta Was this translation helpful? Give feedback.
-
I have now set up CDX-based de-duplication which has worked out great. I am using wget to grab data now because I want single pages instead of long crawls and it's saving me anywhere from 33-75% depending on how static-heavy the pages are. I apologize for not having a direct answer to the Heritrix question but CDX-based de-duplication may be, in some way, possible in Heritrix too and I would encourage people who have this question to look into it. |
Beta Was this translation helpful? Give feedback.
-
What are some strategies to reduce disk usage? I am crawling a specific forum starting at its home page and re-starting crawls through the REST API once per hour. I had hoped this would reduce the amount of unchanged, old pages and posts I was archiving. I still want to keep exploring ways to reduce disk usage.
Beta Was this translation helpful? Give feedback.
All reactions