Automatic extraction of forum posts and metadata is a challenging task since forums do not expose their content in a standardized structure. Harvest performs this task reliably for many web forums and offers an easy way to extract data from web forums.
At the command line:
$ pip install harvest-webforum
If you want to install from the latest sources, you can do:
$ git clone https://github.com/fhgr/harvest.git
$ cd harvest
$ python3 setup.py install
Embedding harvest into your code is easy, as outlined below:
from urllib.request import urlopen, Request
from harvest import extract_data
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"
url = "https://forum.videolan.org/viewtopic.php?f=14&t=145604"
req = Request(url, headers={'User-Agent': USER_AGENT})
html = urlopen(req).read().decode('utf-8')
result = extract_data(html, url)
print(result)
The corpus currently contains from 52 different web forums gold standard documents. These documents are also used by the integrations test of harvest.
- Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2020), Melbourne, Australia, Accepted 27 October 2020.