T2K Match [1] is matching algorithm optimised to match millions of web tables against a central knowledge base.
Many web sites provide data in the form of HTML tables. Millions of such data tables have been extracted from the CommonCrawl web corpus by the Web Data Commons project [3]. Data from these tables can be used to fill missing values in large cross-domain knowledge bases such as DBpedia [2]. This project is an example of how pre-defined building blocks from the WInte.r framework are combined into an advanced, use-case specific integration method. The algorithm is optimized to match millions of Web tables against a central knowledge base describing millions of instances belonging to hundreds of different classes (such a people or locations) [2].
To run T2K Match, use the run_t2k_match
script in the scripts directory.
- Copy the compiled T2K Match jar file to the /lib/ directory in your home or change the path in the script file
JAR="$HOME/lib/t2kmatch-2.0-jar-with-dependencies.jar"
- Unzip the files in the data directory
gunzip data/dbpedia/*
gunzip data/*.gz
- Run the script
./scripts/run_t2k_match
This project is a re-implementation of the original T2K Match algorithm developed at the Data and Web Science Group at the University of Mannheim using the WInte.r framework.
T2K Match can be used under the Apache 2.0 License.
[1] Ritze, D., Lehmberg, O., & Bizer, C. (2015, July). Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (p. 10). ACM.
[2] Ritze, D., Lehmberg, O., Oulabi, Y., & Bizer, C. (2016, April). Profiling the potential of web tables for augmenting cross-domain knowledge bases. In Proceedings of the 25th International Conference on World Wide Web (pp. 251-261). International World Wide Web Conferences Steering Committee.
[3] Lehmberg, O., Ritze, D., Meusel, R., & Bizer, C. (2016, April). A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (pp. 75-76). International World Wide Web Conferences Steering Committee.