MBI-KG: A knowledge graph of structured and linked economic research data extracted from the book "Die Maschinen-Industrie im Deutschen Reich" written by Herbert Patschan in 1937
MBI-KG/
├── docs/
│ ├── talks/
│ │ ├── README_talks.md
│ │ ├── 2023.05.05_EURHISFIRM-Workshop-Kamlah-Shigapov.pdf
│ │ └── 2022.11.23_NFDI-Workshop-Research-Data-Maschinenindustrie-EN.pdf
│ ├── sparql_examples/
│ │ └── README_sparql_examples.md
│ └── README_docs.md
├── data/
│ ├── structured_data/
│ │ ├── README_structured_data.md
│ │ └── MBI_1937_structured.csv
│ ├── scanned_images/
│ │ └── README_scanned_images.md
│ ├── ocr_output/
│ │ └── README_ocr_output.md
│ ├── models/
│ │ ├── mbi-1937_print.mlmodel
│ │ ├── mbi-1937_layout.mlmodel
│ │ └── README_models.md
│ ├── kg_dataset/
│ │ ├── README_kg_dataset.md
│ │ ├── MBI_KG_bulk_cli_v1.0.ttl
│ │ ├── MBI_KG_bulk_cli_v1.0.json
│ │ ├── MBI_KG_bulk_api_v1.0.ndjson
│ │ └── MBI_KG_bulk_api_v1.0.csv
│ └── README_data.md
├── code/
│ ├── semantify.py
│ ├── requirements.txt
│ ├── entities2kg.py
│ ├── create_bulk_files_cli.sh
│ ├── create_bulk_files_api.py
│ ├── book2entities.py
│ └── README_code.md
├── README.md
├── LICENSE.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
└── CITATION.cff
The folder data
contains the data used in this project:
structured_data
contains the structured data in CSV, JSON and RDF formats, representing various entities such as companies, individuals, and administrative entities.scanned_images
contains the scanned images of the original book pages in JPEG format with 400 dpi.ocr_output
contains the raw text output from the Optical Character Recognition (OCR) process, saved in plain text files.models
contains the OCR-modelskg-dataset
contains bulk data exported via Wikibase API (in CSV and NFJSON formats) and also via command line php-scripts (in ttl and JSON formats)
Data availability statement: Data used in this project are freely available under the CC BY license.
The folder docs
contains a documentation for this project including
- talks
- Extracting research data from historical documents with eScriptorium and Python by Jan Kamlah, Thomas Schmidt and Renat Shigapov at NFDI Focused Tutorial on Capturing, Enriching, Disseminating Research Data Objects. The presentations in English and German are available at https://doi.org/10.5281/zenodo.7373134.
- Kamlah, Jan, & Shigapov, Renat. (2023, May 5). The German Production Pipeline: Mannheim - OCR & Knowledge Graphs. Zenodo. https://doi.org/10.5281/zenodo.7900133
sparql_examples
contains SPARQL query examples for the MBI-KG
The folder code
contains codes used in this project:
- book2entities.py
Code availability statement: Codes used in this project are openly available under MIT license.
Thank you for your interest in contributing to MBI knowledge graph. All contributions are welcome.
To get started, please follow these steps:
- Fork the repository or clone it to your local machine.
- Create a new branch for your changes.
- Make your changes and commit them with clear commit messages.
- Push your changes to your forked repository.
- Submit a pull request to the main repository.
More info in CONTRIBUTING.md.
This work is licensed under the MIT license (code) and Creative Commons Attribution 4.0 International license (for everything else). You are free to share and adapt the material for any purpose, even commercially, as long as you provide attribution (see Attribution).
Dataset (replication package):
- Shigapov, R., Schmidt, T., Kamlah, J., Schumm, I., Streb, J., & Lehmann-Hasemeyer, S. (2024). MBI-KG: Replication package for a knowledge graph of structured and linked economic research data extracted from the 1937 book "Die Maschinen-Industrie im Deutschen Reich". MADATA, [Dataset]. https://doi.org/10.7801/467.
Paper:
- Shigapov, R., Schmidt, T., Kamlah, J., Schumm, I., Streb, J., & Lehmann-Hasemeyer, S. (2024). MBI-KG: A knowledge graph of structured and linked economic research data extracted from the 1937 book "Die Maschinen-Industrie im Deutschen Reich". Data in Brief, 111238. https://doi.org/10.1016/j.dib.2024.111238