- Regular expressions and XPath queries for data extraction from
rtvslo.si
,overstock.com
andthemoviedb.org
. - Implementation of an automatic data extraction wrapper generator.
The report is available here.
The dependencies for the project are numpy
and lxml
. They can be installed automatically using the implementation-extraction/requirements.txt
file.
pip install -r implementation-extraction/requirements.txt
# OR
pip install numpy lxml
Move into the implementation-extraction
dir.
Run regex extraction
python run-extraction.py A
Run XPath extraction
python run-extraction.py B
Run automatic wrapper generation
python run-extraction.py C
Generated XML wrappers are available for each of the sites.
XPath output for generalized nodes is also available here.