Skip to content

Latest commit

 

History

History
1168 lines (1153 loc) · 56.2 KB

README.md

File metadata and controls

1168 lines (1153 loc) · 56.2 KB

VELD registry

This is a living collection of VELD repositories and their contained velds.

The technical concept for the VELD design can be found here: https://zenodo.org/records/13318651

sections in this README:

data velds

code velds

chain velds

topic vocab

content vocab

file_type vocab

data velds

code velds

  • https://github.com/veldhub/veld_code__analyse_conllu
    • veld.yaml
      • valid: True
      • metadata:
        • topic: NLP, Machine Learning, Tokenization, Lemmatization, Part Of Speech, Dependency Parsing, Universal Dependencies, Grammatical Annotation
        • input:
          • 1:
            • file_type: conllu
        • output:
          • 1:
            • file_type: json
            • content: statistics, NLP statistics
  • https://github.com/veldhub/veld_code__apis_ner_evaluate_old_models
    • veld_evaluate.yaml
      • valid: True
      • metadata:
        • description: hard-coded evaluation of several spaCy2.2.4 models.
        • topic: NLP, Machine Learning, Named Entity Recognition
        • input:
          • 1:
            • description: This input is hard-wired to the apis spacy-ner repo and not made for generic usage.
            • file_type: pickle, txt, json, spaCy model
            • content: NER gold data, Machine Learning model, NLP model
        • output:
          • 1:
            • description: evaluation report of the models from the apis spacy-ner repo.
            • file_type: md
            • content: evaluation report
  • https://github.com/veldhub/veld_code__apis_ner_transform_to_gold
    • veld.yaml
      • valid: True
      • metadata:
        • description: hard-coded conversion of apis ner models to custom json format.
        • topic: ETL, Data Cleaning
        • input:
          • 1:
            • description: This input is hard-wired to the apis spacy-ner repo and not made for generic usage.
            • file_type: pickle, txt, json
            • content: NER gold data
        • output:
          • 1:
            • description: raw uncleaned, as it was originally. Now just transformed to json.
            • file_type: json
            • content: NER gold data
          • 2:
            • description: removed empty entity annotations and fixed border issues.
            • file_type: json
            • content: NER gold data
          • 3:
            • description: additionally to cleaning, this data is slimmed down from superfluous entity ids in favor of simplified entity classes.
            • file_type: json
            • content: NER gold data
          • 4:
            • file_type: txt
  • https://github.com/veldhub/veld_code__bert_embeddings
  • https://github.com/veldhub/veld_code__downloader
    • veld.yaml
      • valid: True
      • metadata:
        • description: A very simple curl call. Since many veld chains need to download data, it makes sense to encapsulate the download functionality into a dedicated downloader veld code
        • topic: ETL
        • output:
          • 1:
            • description: optional. If out_file is unset, this script will fetch the file name from the resource.
  • https://github.com/veldhub/veld_code__fasttext
    • veld_jupyter_notebook.yaml
      • valid: True
      • metadata:
        • description: a fasttext training and inference jupyter notebook.
        • topic: NLP, Machine Learning, Word Embeddings
    • veld_train.yaml
      • valid: True
      • metadata:
        • description: a fasttext training and inference jupyter notebook.
        • topic: NLP, Machine Learning, Word Embeddings
        • input:
          • 1:
            • description: training data must be expressed as one sentence per line.
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • file_type: fastText model
            • content: Word Embeddings
  • https://github.com/veldhub/veld_code__glove
    • veld_jupyter_notebook.yaml
      • valid: True
      • metadata:
        • description: A jupyter notebook that loads GloVe vectors and provides some convenient functions to use them.
        • topic: NLP, Machine Learning, Word Embeddings
    • veld_train.yaml
      • valid: True
      • metadata:
        • description: This code repo encapsulates the original code from https://github.com/stanfordnlp/GloVe/tree/master
        • topic: NLP, Machine Learning, Word Embeddings
        • input:
          • 1:
            • description: In the txt file, each line must be one sentence
            • file_type: txt
            • content: natural text
        • output:
          • 1:
            • file_type: GloVe model
            • content: NLP model, Word Embeddings model
          • 2:
            • file_type: GloVe model
            • content: NLP model, Word Embeddings model
          • 3:
            • file_type: GloVe model
            • content: NLP model, Word Embeddings model
          • 4:
            • file_type: GloVe model
            • content: NLP model, Word Embeddings model
  • https://github.com/veldhub/veld_code__jupyter_notebook_base
    • veld.yaml
      • valid: True
      • metadata:
        • description: template veld code repo for a juptyer notebook
  • https://github.com/veldhub/veld_code__simple_docker_test
    • veld.yaml
      • valid: True
      • metadata:
        • description: prints information about the python intepreter within the docker container.
        • topic: Testing
  • https://github.com/veldhub/veld_code__spacy
    • veld_convert.yaml
      • valid: True
      • metadata:
        • description: prepare data for spacy NER training, since spacy expects the entity annotation indices to be precisely at the beginning and end of the words, and also no overlapping entity annotations. Then it converts the data to spaCy docbin, and prepares it for training by splitting it into train, dev, eval subsets, and shuffling them randomly.
        • topic: ETL, NLP, Machine Learning
        • input:
          • 1:
            • description: name of the csv file, containing NER gold data
            • file_type: json
            • content: NER gold data
        • output:
          • 1:
            • description: path to folder where spacy docbin files will be stored with file names train.spacy, dev.spacy, eval.spacy
            • file_type: spaCy docbin
            • content: NER gold data
          • 2:
            • description: log file of conversion
            • file_type: txt
            • content: log
    • veld_create_config.yaml
    • veld_publish_to_hf.yaml
      • valid: True
      • metadata:
        • description: simple service to push spacy models to huggingface. IMPORTANT: Only works from spacy v3.* onwards!
        • topic: NLP, ETL
        • input:
          • 1:
            • file_type: spaCy model
            • content: NLP model
    • veld_train.yaml
      • valid: True
      • metadata:
        • description: A spacy trainig setup, utilizing spacy v3's config system.
        • topic: NLP, Machine Learning
        • input:
          • 1:
            • file_type: spaCy docbin
            • content: NLP gold data, ML gold data, gold data
          • 2:
            • file_type: spaCy docbin
            • content: NLP gold data, ML gold data, gold data
          • 3:
            • file_type: spaCy docbin
            • content: NLP gold data, ML gold data, gold data
          • 4:
        • output:
          • 1:
            • file_type: spaCy model
            • content: NLP model
          • 2:
            • description: training log file
            • file_type: txt
            • content: log
          • 3:
            • description: evaluation log file
            • file_type: txt
            • content: log
  • https://github.com/veldhub/veld_code__teitok-tools
  • https://github.com/veldhub/veld_code__udpipe
    • veld_infer.yaml
      • valid: True
      • metadata:
        • description: udpipe inference setup
        • topic: NLP, Machine Learning, Tokenization, Lemmatization, Part Of Speech, Dependency Parsing, Universal Dependencies, Grammatical Annotation
        • input:
          • 1:
            • description: txt files to be inferenced on. Note that the environment var in_txt_file is optional, and if it is not present, the entire input folder will be processed recursively
            • file_type: txt
            • content: raw text
          • 2:
            • file_type: udpipe model
            • content: NLP model, tokenizer, lemmatizer
        • output:
          • 1:
            • description: The file name of the output conllu is created by the corresponding input txt file, since recursive processing requires such automatic logic
            • file_type: conllu, tsv
            • content: inferenced NLP data, tokenized text, lemmatized text, Part Of Speech of text, Universal Dependencies of text, grammatically annotated text, linguistic data
    • veld_train.yaml
      • valid: True
      • metadata:
        • description: udpipe training setup
        • topic: NLP, Machine Learning, Tokenization, Lemmatization, Part Of Speech, Dependency Parsing, Universal Dependencies, Grammatical Annotation
        • input:
          • 1:
            • file_type: conllu
            • content: tokenized text, enriched text, linguistic data
        • output:
          • 1:
            • file_type: udpipe model
            • content: NLP model, tokenizer, lemmatizer
  • https://github.com/veldhub/veld_code__wikipedia_nlp_preprocessing
    • veld_download_and_extract.yaml
      • valid: True
      • metadata:
        • description: downloading wikipedia archive and extracting each article to a json file.
        • topic: NLP, Machine Learning, ETL
        • output:
          • 1:
            • description: a folder containing json files, where each file contains the content of a wikipedia article
            • file_type: json
            • content: NLP training data, raw text
    • veld_transform_wiki_json_to_txt.yaml
      • valid: True
      • metadata:
        • description: transforming wikipedia raw jsons to a single txt file.
        • topic: NLP, Machine Learning, ETL
        • input:
          • 1:
            • description: a folder containing json files, where each file contains the contents of a wikipedia article
            • file_type: json
            • content: NLP training data, raw text
        • output:
          • 1:
            • description: single txt file, containing only raw content of wikipedia pagaes, split into sentences or per article with a newline each, possibly being only a sampled subset for testing.
            • file_type: txt
            • content: NLP training data, Word Embeddings training data, raw text
  • https://github.com/veldhub/veld_code__word2vec
    • veld_jupyter_notebook.yaml
      • valid: True
      • metadata:
        • description: a word2vec jupyter notebook, for quick experiments
        • topic: NLP, Machine Learning, Word Embeddings
        • input:
          • 1:
            • description: arbitrary storage for word2vec experiments
            • file_type: word2vec model, txt
            • content: NLP model, Word Embeddings model, model metadata, NLP training data, Word Embeddings training data, raw text
        • output:
          • 1:
            • description: arbitrary storage for word2vec experiments
            • file_type: word2vec model, txt
            • content: NLP model, Word Embeddings model, model metadata, NLP training data, Word Embeddings training data, raw text
    • veld_train.yaml
      • valid: True
      • metadata:
        • description: word2vec training setup
        • topic: NLP, Machine Learning, Word Embeddings
        • input:
          • 1:
            • description: training data. Must be one single txt file, one sentence per line.
            • file_type: txt
            • content: NLP training data, Word Embeddings training data, raw text
        • output:
          • 1:
            • description: self trained Word Embeddings word2vec model
            • file_type: word2vec model
            • content: NLP model, Word Embeddings model
  • https://github.com/veldhub/veld_code__wordembeddings_evaluation
    • veld_analyse_evaluation.yaml
      • valid: True
      • metadata:
        • description: data visualization of all evaluation data. In a jupyter notebook.
        • topic: NLP, Word Embeddings, Data Visualization
        • input:
          • 1:
            • description: summary of the custom evaluation logic on word embeddings
            • file_type: yaml
            • content: Evaluation data
        • output:
          • 1:
            • description: data visualization of all evaluation data, expressed as interactive html
            • file_type: html
            • content: data visualization
          • 2:
            • description: data visualization of all evaluation data, expressed as png
            • file_type: png
            • content: data visualization
    • veld_analyse_evaluation_non_interactive.yaml
      • valid: True
      • metadata:
        • description: data visualization of all evaluation data. non-interactive version of the juypter code.
        • topic: NLP, Word Embeddings, Data Visualization
        • input:
          • 1:
            • description: summary of the custom evaluation logic on word embeddings
            • file_type: yaml
            • content: evaluation data
        • output:
          • 1:
            • description: data visualization of all evaluation data, expressed as interactive html
            • file_type: html
            • content: data visualization
          • 2:
            • description: data visualization of all evaluation data, expressed as png
            • file_type: png
            • content: data visualization
    • veld_eval_fasttext.yaml
      • valid: True
      • metadata:
        • description: custom evaluation logic on fasttext word embeddings.
        • topic: NLP, Machine Learning, Evaluation
        • input:
          • 1:
            • file_type: fastText model
            • content: NLP model, Word Embeddings model
          • 2:
            • file_type: yaml
            • content: metadata
          • 3:
            • file_type: yaml
            • content: NLP gold data
        • output:
          • 1:
            • file_type: yaml
          • 2:
            • file_type: txt
            • content: log
    • veld_eval_glove.yaml
      • valid: True
      • metadata:
        • description: custom evaluation logic on GloVe word embeddings.
        • topic: NLP, Machine Learning, Evaluation
        • input:
          • 1:
            • file_type: GloVe model
            • content: NLP model, Word Embeddings model
          • 2:
            • file_type: yaml
            • content: metadata
          • 3:
            • file_type: yaml
            • content: NLP gold data
        • output:
          • 1:
            • file_type: yaml
          • 2:
            • file_type: txt
            • content: log
    • veld_eval_word2vec.yaml
      • valid: True
      • metadata:
        • description: custom evaluation logic on word2vec word embeddings.
        • topic: NLP, Machine Learning, Evaluation
        • input:
          • 1:
            • description: word2vec model file to be evaluated
            • file_type: word2vec model
            • content: NLP model, word embeddings model
          • 2:
            • description: word2vec model metadata
            • file_type: yaml
            • content: metadata
          • 3:
            • file_type: yaml
            • content: NLP gold data
        • output:
          • 1:
            • file_type: yaml
          • 2:
            • file_type: txt
            • content: log
  • https://github.com/veldhub/veld_code__wordembeddings_preprocessing
    • veld_preprocess_clean.yaml
      • valid: True
      • metadata:
        • description: Removes lines that don't reach a threshold regarding the ratio of textual content to non-textual (numbers, special characters) content. Splits output into clean and dirty file.
        • topic: NLP, Preprocessing, ETL
        • input:
          • 1:
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • description: clean lines, where each line's ratio is above the configured threshold
            • file_type: txt
            • content: raw text
          • 2:
            • description: dirty lines, where each line's ratio is below the configured threshold
            • file_type: txt
            • content: raw text
    • veld_preprocess_lowercase.yaml
      • valid: True
      • metadata:
        • description: makes entire text lowercase
        • topic: NLP, Preprocessing, ETL
        • input:
          • 1:
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • file_type: txt
            • content: raw text
    • veld_preprocess_remove_punctuation.yaml
      • valid: True
      • metadata:
        • description: removes punctuation from text with spaCy pretrained models
        • topic: NLP, Preprocessing, ETL
        • input:
          • 1:
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • file_type: txt
            • content: raw text
          • 2:
            • file_type: txt
            • content: raw text
    • veld_preprocess_sample.yaml
      • valid: True
      • metadata:
        • description: takes a random sample of lines from a txt file. Randomness can be set with a seed too
        • topic: NLP, Preprocessing, ETL
        • input:
          • 1:
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • file_type: txt
            • content: raw text
    • veld_preprocess_strip.yaml
      • valid: True
      • metadata:
        • description: removes all lines before and after given line numbers
        • topic: NLP, Preprocessing, ETL
        • input:
          • 1:
            • file_type: txt
            • content: raw text
        • output:
          • 1:
            • file_type: txt
            • content: raw text
  • https://github.com/veldhub/veld_code__xmlanntools
  • https://github.com/veldhub/veld_code__xml_xslt_transformer
    • veld.yaml
      • valid: True
      • metadata:
        • description: generic xml / xslt transformation setup.
        • topic: ETL, Preprocessing
        • input:
          • 1:
            • description: the input xml file or folder containing xml. Note that if var in_xml_file is set, this script will only transform that file. If it's not set, it will go through the input folder recursively and create an equivalent output data structure.
            • file_type: xml
          • 2:
            • description: the input xsl file or folder containing xsl
            • file_type: xslt
        • output:
          • 1:
            • description: output file or folder for converted txt. Note that the var 'out_txt_file' is only respected, when the input is a single xml file. If the input is a folder, the output will be an equivalent data structure and the var 'out_txt_file' is ignored.
            • file_type: xml, txt

chain velds

topic vocab

  • Bible Studies
  • Data Cleaning
  • Data Visualization
  • Dependency Parsing
  • ETL
  • Evaluation
  • Grammatical Annotation
  • Lemmatization
  • Machine Learning
  • NLP
  • Named Entity Recognition
  • Part Of Speech
  • Preprocessing
  • Testing
  • Tokenization
  • Universal Dependencies
  • Word Embeddings

content vocab

  • Evaluation data
  • ML gold data
  • Machine Learning model
  • NER data
  • NER gold data
  • NLP gold data
  • NLP model
  • NLP statistics
  • NLP training data
  • Part Of Speech of text
  • TEI
  • Universal Dependencies of text
  • Word Embeddings
  • Word Embeddings model
  • Word Embeddings training data
  • annotated literature
  • data visualization
  • enriched text
  • evaluation data
  • evaluation report
  • gold data
  • grammatically annotated text
  • inferenced NLP data
  • lemmatized text
  • lemmatizer
  • linguistic data
  • linguistically enriched text
  • log
  • metadata
  • model metadata
  • natural text
  • newspaper texts
  • raw text
  • spacy model
  • spacy training config
  • statistics
  • tokenized text
  • tokenizer
  • word embeddings model

file_type vocab

  • GloVe model
  • bin
  • cfg
  • conllu
  • csv
  • fastText model
  • html
  • ini
  • json
  • md
  • pickle
  • png
  • spaCy docbin
  • spaCy model
  • tsv
  • txt
  • udpipe model
  • word2vec model
  • xml
  • xslt
  • yaml