Support for tag filtering #287

natamox · 2023-07-15T15:18:13Z

Because the whole thing is really too big, more than 70 GB. For example, I only want to grab javascript or other tags, how to do it, thank you

kelson42 · 2023-07-16T07:53:48Z

This is not possible for now. But should be possible, though maybe not that trivial. Main problem of the moment is that the scraper can not scrape recent dumps, because StackExchange does not provide them anymore.

rgaudin · 2023-07-17T11:34:29Z

@natamox I renamed the ticket. What you'd want is to be able to supply a list of tags and get a ZIM with just the listed tags' content, right ?
It would be a great feature indeed, but it's not implemented yet.

As @kelson42 said, sotoki's future is unclear ATM because Stack Exchange stopped providing XML dumps.

natamox · 2023-07-19T12:39:35Z

@rgaudin @kelson42 Well, thanks for your replies. Now I have another idea. I already have a 70GB stackoverflow file. Can I use a tool to unpack the zim file and delete some content according to my needs before repacking it.

kelson42 · 2023-07-19T12:42:27Z

@natamox We never do that because this would imply URL rewriting in all HTML pages to avoid dead links.

rgaudin · 2023-07-19T12:44:43Z

Just to be clear, following my previous comment we've seen that SE has published the Dump and promised to continue so the future of sotoki is not at stake anymore.

As for extracting and repacking, it's possible but since we only bundle the HTML version of questions without metadata, you'd have to parse the HTML of every entry to find the tags…
Also, the tag-less navigation would need to be fixed and the related links would not work either.

You're better off implementing this ticket, way less work and outcome is clear and solid 😉

kelson42 · 2023-07-19T12:45:30Z

@rgaudin @benoit74 I think this is not only a valid feature request, but actually a pretty good one. We could for example do one ZIM per mainstream programming language. How complex would that be?

natamox · 2023-07-19T12:59:53Z

@rgaudin @kelson42 Sounds really good, I thought I'd give it a try. . . . Although I may not be very good at this

rgaudin · 2023-07-19T13:57:23Z

It's quite easy:

cli param to capture wanted tags ; parse to list
in tags.py, skip if TagName not in the list (-> tag not recorded to db)
in posts.py in both passes parsers, list of tags should be filtered to requested one instead of just retrieved from XML. Cleaner approach is too check if in DB.tags_ids.inverse
about template should mention that it's restricted to this list.
And I think that should do it

benoit74 · 2023-07-19T15:28:21Z

Don't you need something to fix broken links to questions that have been filtered out? You might for instance redirect these links to a static page which indicate that this question has not be scraped due to tag filters, like I did for iFixit. Or is this already handled in the scraper / not wished?

natamox · 2023-07-19T15:50:50Z

when i run

env/bin/pip install -r requirements.txt

something went wrong

Collecting libzim<3.0,>=2.1.0 (from zimscraperlib<3.0,>=2.1->-r requirements.txt (line 3))
  Using cached libzim-2.1.0.tar.gz (8.3 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [2 lines of output]
      [!] The libzim library cannot be found.
      Please verify it is correctly installed and can be found.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Ok, it has been solved, I first executed

./env/bin/python src/sotoki/dependencies.py

Reported an error that zimscraperlib could not be found, I can do it after

./env/bin/pip install zimscraperlib

kelson42 · 2023-07-19T16:05:27Z

@natamox please open another ticket

rgaudin · 2023-07-19T16:22:58Z

With your version of python and your architecture ; looks like there's no libzim wheel

natamox · 2023-07-19T16:37:13Z

With your version of python and your architecture ; looks like there's no libzim wheel

Thanks, it seems to work now, I ran setup.py and it said zimscraperlib was missing, downloaded it just fine, then it worked.
Python version is 3.11.4; architecture is arm64

rgaudin · 2023-07-20T21:03:56Z

Don't you need something to fix broken links to questions that have been filtered out? You might for instance redirect these links to a static page which indicate that this question has not be scraped due to tag filters

We already remove links to questions that are not in the DB. What's not handled is the a/{aId} shortcut because we don't store answers in the DB. For valid links it's not a problem because we create a redirect elsewhere but for missing target this would lead to a dead link.

kelson42 added enhancement question labels Jul 16, 2023

rgaudin changed the title ~~How do I grab a part of the content on the stackoverflow website~~ Support for tag filtering Jul 17, 2023

rgaudin removed the question label Jul 17, 2023

natamox mentioned this issue Jul 19, 2023

Issue setting up a new developer machine #288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for tag filtering #287

Support for tag filtering #287

natamox commented Jul 15, 2023

kelson42 commented Jul 16, 2023

rgaudin commented Jul 17, 2023

natamox commented Jul 19, 2023

kelson42 commented Jul 19, 2023

rgaudin commented Jul 19, 2023

kelson42 commented Jul 19, 2023 •

edited

Loading

natamox commented Jul 19, 2023

rgaudin commented Jul 19, 2023

benoit74 commented Jul 19, 2023 •

edited

Loading

natamox commented Jul 19, 2023 •

edited

Loading

kelson42 commented Jul 19, 2023

rgaudin commented Jul 19, 2023

natamox commented Jul 19, 2023 •

edited

Loading

rgaudin commented Jul 20, 2023

Support for tag filtering #287

Support for tag filtering #287

Comments

natamox commented Jul 15, 2023

kelson42 commented Jul 16, 2023

rgaudin commented Jul 17, 2023

natamox commented Jul 19, 2023

kelson42 commented Jul 19, 2023

rgaudin commented Jul 19, 2023

kelson42 commented Jul 19, 2023 • edited Loading

natamox commented Jul 19, 2023

rgaudin commented Jul 19, 2023

benoit74 commented Jul 19, 2023 • edited Loading

natamox commented Jul 19, 2023 • edited Loading

kelson42 commented Jul 19, 2023

rgaudin commented Jul 19, 2023

natamox commented Jul 19, 2023 • edited Loading

rgaudin commented Jul 20, 2023

kelson42 commented Jul 19, 2023 •

edited

Loading

benoit74 commented Jul 19, 2023 •

edited

Loading

natamox commented Jul 19, 2023 •

edited

Loading

natamox commented Jul 19, 2023 •

edited

Loading