This project automatically downloads the newest wictionary blob and extracts all german word types into an sqlite database.
The file needs to be named dewiktionary-latest-pages-articles-multistream.xml.bz2 on root folder. Additionally a file named dewiktionary-latest-pages-articles-multistream.xml.bz2_lastmodified need to be placed next to the other file with the content schema: "Tue, 03 Dec 2019 07:02:38 GMT". The correct content is the last updated information from the wiktionary blob. In case the file was downloaded before by the application itself it will not download the file again. But in case it detects a newer version of the blob on the wiktionary page it will download and update.
In case a local.db already exists new entries will be inserted in case they don't exist yet. If an entry already exists but with a different value it still will be inserted with the new value into the database. If an entry exists with the same value no additional entry will be inserted (avoid duplicate entries). In general the console output only contains changes done on the database.
This project is based on https://github.com/gambolputty/german-nouns and https://github.com/gambolputty/wiktionary_de_parser.
Compiled from WiktionaryDE License: Creative Commons Attribution-ShareAlike 3.0 Unported.
A full run approximately takes about 2 days.
- Install Anaconda
- Open Anaconda Shell and select path where README.md is located
- Create environment by: conda create --name german-words-wiktionary-de-extract
- Activate environment by: conda activate german-words-wiktionary-de-extract
- Install requests: conda install requests
- Check python version: python -v (should be at least python 3.8.0 now)
- Install bz2file: conda install bz2file
- Install lxml: conda install lxml
- Install pyphen: python -m pip install pyphen
- Call: python ./create_db/main.py
- Database result is in local.db in root folder
Simply change the pyodbc string. The used commands are very basic and should be supported by mssql and postgresql too. Though not tested yet!