Skip to content

Latest commit

 

History

History
50 lines (32 loc) · 1.14 KB

extraction.md

File metadata and controls

50 lines (32 loc) · 1.14 KB

Data extraction

  • Regular expressions and XPath queries for data extraction from rtvslo.si, overstock.com and themoviedb.org.
  • Implementation of an automatic data extraction wrapper generator.

The report is available here.

Setup

The dependencies for the project are numpy and lxml. They can be installed automatically using the implementation-extraction/requirements.txt file.

pip install -r implementation-extraction/requirements.txt

# OR
pip install numpy lxml

Run extraction

Move into the implementation-extraction dir.

Run regex extraction

python run-extraction.py A

Run XPath extraction

python run-extraction.py B

Run automatic wrapper generation

python run-extraction.py C

Generated wrappers

Generated XML wrappers are available for each of the sites.

XPath output for generalized nodes is also available here.