-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for tag filtering #287
Comments
This is not possible for now. But should be possible, though maybe not that trivial. Main problem of the moment is that the scraper can not scrape recent dumps, because StackExchange does not provide them anymore. |
@natamox I renamed the ticket. What you'd want is to be able to supply a list of tags and get a ZIM with just the listed tags' content, right ? As @kelson42 said, sotoki's future is unclear ATM because Stack Exchange stopped providing XML dumps. |
@natamox We never do that because this would imply URL rewriting in all HTML pages to avoid dead links. |
Just to be clear, following my previous comment we've seen that SE has published the Dump and promised to continue so the future of sotoki is not at stake anymore. As for extracting and repacking, it's possible but since we only bundle the HTML version of questions without metadata, you'd have to parse the HTML of every entry to find the tags… You're better off implementing this ticket, way less work and outcome is clear and solid 😉 |
It's quite easy:
|
Don't you need something to fix broken links to questions that have been filtered out? You might for instance redirect these links to a static page which indicate that this question has not be scraped due to tag filters, like I did for iFixit. Or is this already handled in the scraper / not wished? |
when i run env/bin/pip install -r requirements.txt something went wrong
Ok, it has been solved, I first executed
Reported an error that zimscraperlib could not be found, I can do it after
|
@natamox please open another ticket |
With your version of python and your architecture ; looks like there's no libzim wheel |
Thanks, it seems to work now, I ran setup.py and it said zimscraperlib was missing, downloaded it just fine, then it worked. |
We already remove links to questions that are not in the DB. What's not handled is the |
Because the whole thing is really too big, more than 70 GB. For example, I only want to grab javascript or other tags, how to do it, thank you
The text was updated successfully, but these errors were encountered: