Check out Scribe's architecture diagrams for an overview of the organization including our applications, services and processes. It depicts the projects that Scribe is developing as well as the relationships between them and the external systems with which they interact. Also check out the Wikidata and Scribe Guide for an overview of Wikidata and getting language data from it.

Process `⇧`

The CLI commands defined within scribe_data/cli and the notebooks within the various scribe_data directories are used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop once they're active.

The main data update process in triggers language based SPARQL queries to query language data from Wikidata using SPARQLWrapper as a URI. The autosuggestion process derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in gen_autosuggestions.py. Emojis are further sourced from Unicode CLDR, with this process being ran via the scribe-data get -lang LANGUAGE -dt emoji-keywords command.

Installation `⇧`

Scribe-Data is available for installation via pip:

pip install scribe-data

# For a development build:
git clone https://github.com/scribe-org/Scribe-Data.git  # or ideally your fork
cd Scribe-Data
pip install -e .

CLI Usage `⇧`

Scribe-Data provides a command-line interface (CLI) for efficient interaction with its language data functionality. Please see the usage guide or the official documentation for detailed instructions.

Basic Usage

To utilize the Scribe-Data CLI, you can execute variations of the following command in your terminal:

scribe-data -h  # view the cli options
scribe-data [command] [arguments]

Available Commands

list (l): Enumerate available languages, data types and their combinations.
get (g): Retrieve data from Wikidata for specified languages and data types.
total (t): Display the total available data for given languages and data types.
convert (c): Transform data returned by Scribe-Data into different file formats.

Command Examples

# Commands used in the above GIF:
scribe-data list --language
scribe-data list --data-type
scribe-data get --language English --data-type verbs -od ./scribe-data
scribe-data total --language English

# Commands used in the above GIF:
scribe-data get -i
scribe-data total -i

Data Contracts `⇧`

Wikidata has lots of language data available, but not all of it is useful for all applications. In order to make the functionality of the Scribe-Data get requests as simple as possible, we made the decision to always return all data for the given languages and data types. Adding the ability to pass desired forms to the commands seemed cumbersome, and larger Scribe-Data requests should be parsing Wikidata lexeme dumps as the data source.

Scribe's solution to the get all functionality while preserving the ability to get specific forms is to allow users to filter the resulting data by contracts. The data contracts for Scribe's client applications can be found in the scribe_data_contracts directory. Data contracts are JSON objects where the values that are used in end applications are the keys and the resulting data identifiers based on Wikidata lexeme forms are the values. If the forms for a lexeme change, then the values would also change, but all that's needed is to update the contract for the application to function again.

Efficient client application data updates using Scribe-Data follow as such:

New data is derived via the Scribe-Data CLI
Contracts are written to map the data values to keys that are used in the application
Scribe-Data is ran again to get new data in the future
The contracts are checked to make sure that all contract values still exist within the resulting data
The question is whether a form was added or removed from a data point such that its identifier has changed
This is done via the following command:

scribe-data cc -cd DATA_CONTRACTS_DIRECTORY  # default data path is used

If the check above passes, then new data can be added to the client applications
If the check fails, then the contract values should be updated given the directions from the CLI and then new data can be loaded
Getting just the data that's in the client application is done via the following command:

scribe-data fd -cd DATA_CONTRACTS_DIRECTORY  # default data paths are used

Updating contracts shouldn't be something that Scribe-Data users should have to do often if they're using stable data from Wikidata. We provide this functionality given the wiki nature of the underlying data so that the Scribe community and others can easily react to potential changes in the lexeme data.

Note

You can learn more about contracts and the process around them in DATA_CONTRACTS.md.

Contributing `⇧`

Scribe uses Matrix for communications. You're more than welcome to join us in our public chat rooms to share ideas, ask questions or just say hi to the team :) We'd suggest that you use the Element client and Element X for a mobile app.

Please see the contribution guidelines and Wikidata and Scribe Guide if you are interested in contributing to Scribe-Data. Work that is in progress or could be implemented is tracked in the issues and projects.

Note

Just because an issue is assigned on GitHub doesn't mean the team isn't open to your contribution! Feel free to write in the issues and we can potentially reassign it to you.

Those interested can further check the -next release- and -priority- labels in the issues for those that are most important, as well as those marked good first issue that are tailored for first-time contributors.

After your first few pull requests organization members would be happy to discuss granting you further rights as a contributor, with a maintainer role then being possible after continued interest in the project. Scribe seeks to be an inclusive and supportive organization. We'd love to have you on the team!

Ways to Help `⇧`

Reporting bugs as they're found 🐞
Working on new features ✨
Documentation for onboarding and project cohesion 📝
Adding language data to Scribe-Data via Wikidata! 🗃️

Road Map `⇧`

The Scribe road map can be followed in the organization's project board where we list the most important issues along with their priority, status and an indication of which sub projects they're included in (if applicable).

Note

Consider joining our bi-weekly developer syncs!

Data Edits `⇧`

Note

Please see the Wikidata and Scribe Guide for an overview of Wikidata and how Scribe uses it.

Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!

Environment Setup `⇧`

Important

Suggested IDE extensions

VS Code

The development environment for Scribe-Data can be installed via the following steps:

Fork the Scribe-Data repo, clone your fork, and configure the remotes:

Note

Consider using SSH

Alternatively to using HTTPS as in the instructions below, consider SSH to interact with GitHub from the terminal. SSH allows you to connect without a user-pass authentication flow.

To run git commands with SSH, remember then to substitute the HTTPS URL, https://github.com/..., with the SSH one, git@github.com:....

e.g. Cloning now becomes git clone git@github.com:<your-username>/Scribe-Data.git

GitHub also has their documentation on how to Generate a new SSH key 🔑

# Clone your fork of the repo into the current directory.
git clone https://github.com/<your-username>/Scribe-Data.git
# Navigate to the newly cloned directory.
cd Scribe-Data
# Assign the original repo to a remote called "upstream".
git remote add upstream https://github.com/scribe-org/Scribe-Data.git

Now, if you run git remote -v you should see two remote repositories named:
- origin (forked repository)
- upstream (Scribe-Data repository)

Use Python venv to create the local development environment within your Scribe-Data directory:

On Unix or MacOS, run:

python3 -m venv venv  # make an environment named venv
source venv/bin/activate # activate the environment

On Windows (using Command Prompt), run:

python -m venv venv
venv\Scripts\activate.bat

On Windows (using PowerShell), run:

python -m venv venv
venv\Scripts\activate.ps1

After activating the virtual environment, install the required dependencies and set up pre-commit by running:

pip install --upgrade pip  # make sure that pip is at the latest version
pip install -r requirements-dev.txt  # install development dependencies
pip install -e .  # install the local version of Scribe-Data
pre-commit install  # install pre-commit hooks
# pre-commit run --all-files  # lint and fix common problems in the codebase

See the contribution guidelines for a more detailed explanation and troubleshooting.

Note

Feel free to contact the team in the Data room on Matrix if you're having problems getting your environment setup!

Featured By `⇧`

Please see the blog posts page on our website for a list of articles on Scribe, and feel free to open a pull request to add one that you've written at scribe-org/scri.be!

Organizations

The following organizations have supported the development of Scribe projects through various programs. Thank you all! 💙

Powered By `⇧`

Contributors

Many thanks to all the Scribe-Data contributors! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 2,085 Commits
.github		.github
.vscode		.vscode
docs		docs
scribe_data_contracts		scribe_data_contracts
src/scribe_data		src/scribe_data
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.licenserc.yaml		.licenserc.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
Brewfile		Brewfile
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
USAGE.md		USAGE.md
package.json		package.json
pyproject.toml		pyproject.toml
requirements-dev.in		requirements-dev.in
requirements-dev.txt		requirements-dev.txt
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wikidata and Wikipedia language data extraction

Contents

Process `⇧`

Installation `⇧`

CLI Usage `⇧`

Basic Usage

Available Commands

Command Examples

Data Contracts `⇧`

Contributing `⇧`

Ways to Help `⇧`

Road Map `⇧`

Data Edits `⇧`

Environment Setup `⇧`

Featured By `⇧`

Organizations

Powered By `⇧`

Contributors

Wikimedia Communities

About

Uh oh!

Releases 19

Uh oh!

Contributors 55

Uh oh!

Languages

License

scribe-org/Scribe-Data

Folders and files

Latest commit

History

Repository files navigation

Wikidata and Wikipedia language data extraction

Contents

Process ⇧

Installation ⇧

CLI Usage ⇧

Basic Usage

Available Commands

Command Examples

Data Contracts ⇧

Contributing ⇧

Ways to Help ⇧

Road Map ⇧

Data Edits ⇧

Environment Setup ⇧

Featured By ⇧

Organizations

Powered By ⇧

Contributors

Wikimedia Communities

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 19

Uh oh!

Contributors 55

Uh oh!

Languages

Process `⇧`

Installation `⇧`

CLI Usage `⇧`

Data Contracts `⇧`

Contributing `⇧`

Ways to Help `⇧`

Road Map `⇧`

Data Edits `⇧`

Environment Setup `⇧`

Featured By `⇧`

Powered By `⇧`