Skip to content

Latest commit

 

History

History
52 lines (45 loc) · 8.92 KB

v1.md

File metadata and controls

52 lines (45 loc) · 8.92 KB

Baseline Version 1.0 Release Notes

This was designed to make the concepts in Baseline simpler, easier to extend, more flexible, and easier to use than the previous version (v0.5.2)

Consideration went into making Baseline a better tool for API users -- people that might wish to develop outside of the mead driver and would prefer to use the API stand-alone. Also we have reduced the complexity of MEAD extension. The v1.0 MEAD extension capabilities are documented here. Additionally, v1.0 simplifies the user experience for inference. Here are some of the key changes that are in development for v1.0.

The underlying changes have simplified mead considerably, making it easier to define custom model architectures purely in the JSON or YAML configuration files

  • hpctl: The release of v1.0 of baseline includes this a new component. This is a program built on mead but adds support for hyper-parameter searching to try and find optimal parameters using a template. It can be run across multiple GPUs. Multiple search methods and backends are supported.

  • Stronger Baselines

  • API simplifications

    • Vectorizers: the cumbersome featurizer concept is gone and has been replaced by a much simpler vectorizer

      • vectorizers build initial dictionaries over tokens, and ultimately do the featurization, using a finalized vocabulary provided to them
      • These vectorizers can be serialized during training and reused for test
      • The vocabularies have serialization code, and are automatically serialized when using mead.
      • For tagging, dictionary-based vectorizers are supported which allows compositional features to be constructed from multiple columns of the CONLL file
    • Embedding Representations: The embedding concept has been updated to be more inline with recent NLP advances, particularly "contextual embeddings"

      • It is now possible to define sub-graphs for execution in each of the frameworks to represent an embedding
      • It was observed that the term embeddings is now used in literature to describe a token representation or multi-part sub-graph
        • typically this consists of a lookup table (what most frameworks call an Embedding) and optionally, a transformation of the lookup table outputs, for example, a BiLM used to produce a word representation
      • The previous baseline.w2v models still exist, and are adapted in a framework-specific way to generate the relevant sub-graph, facilitating a new feature for user-defined embeddings in each framework
        • This will make it easy to use custom embeddings that require LSTMs or Transformers or Convolutions in the sub-graph
    • Model Simplifications

      • Due to the embeddings improvements, the models have been greatly simplified. The inputs to all default models have been simplified. The embeddings sub-graph concept allows models to delegate their sub-graphs for embeddings, and stores the sub-graph components in a dictionary
        • This also means that model methods like make_input are drastically simplified, as well as operations like TF exporting, which now can simply cycle the embedding keys. It also means that models inherently can support multiple embeddings as long as they are time-aligned
        • Additionally, the models have been abstracted for an embed phase, which allows complex composition of multiple embeddings through extension of a single method
      • The vocabularies and labels have been removed from the models. This creates better Separation of Concerns (SoC) where an external object (the vectorizer) uses another external object (the vocabulary) to produce tensor representations for the model. There are services that add support for a full bundle containing all 3 components, but this simplifies the models drastically
      • The encoder-decoder (seq2seq) task has been overhauled to split the encoders and decoders and allow them to be configurable from mead. Its possible to create your own encoders and decoders. For RNN-based decoders, which need to support some sort of policy of how/if to transfer hidden-state, we have added the concept of an arc state policy which is also extensible. We also now enforce a tensor ordering on inputs and outputs of batch as first dimension, and temporal length as second dimension
    • Services: The API support a user-friendly concept of a service (vs a component like the ClassifierModel) that has access to all 3 components required for inference (the vocabularies, the vectorizers and the models). It delegates deserialization to each component and can load any backend framework model with a simple XXXService.load(model) API. Inference is done using the predict method on the service class. Full examples can be found for classify, tagging, encoder-decoders and language modeling

      • Services can also provide an abstraction to TF serving remote models (in production). These can be accessed by passing a remote. The vectorizer and vocab handling works exactly as for local services, but the actual model is executed on the server instead.
        • Both HTTP/REST and gRPC are supported. HTTP/REST requires the requests package. To use gRPC, grpc package is required, and there are stubs included in Baseline to support
    • Data/Reader Simplifications: The readers have been simplified to use the vectorizers and the DataFeed and underlying support components have been greatly simplifed

      • The readers delegate counting to the vectorizers as well as featurization. This is more DRY than in the previous releases. This means that readers can handle much more complex features without special casing
      • The Examples and DataFeed objects are largely reused between tasks except when this is not possible
        • The batching operation on the examples is now completely generalized whih makes adding custom features simple
    • Easier Extension Points: We have removed the complexity of addon registration, preferring instead simple decorators to the previous method of convention-based plugins. Documentation can be found here

    • Training Simplifications: A design goal was that a user should easily be able to train a model without using mead. It should be easier use the Baseline API to train directly or to use external software to train a Baseline model

      • Multi-GPU support is consistent, defaults to all CUDA_VISIBLE_DEVICES
    • More Documentation: There is more code documentation, as well as API examples that show how to use the Baseline API directly. These are also used to self-verify that the API is as simple to use as possible. There is forthcoming documentation on the way that addons work under the hood, as this has been a point of confusion for some users

    • Standardized Abstractions: We have attempted to unify a set of patterns for each model/task and to try and ensure that the routines making up execution share a common naming convention and flow across each framework

  • mead: mead has been simplified to have better underlying (and reusable) methods to reduce code. It also has a new style of configuration file that is more cohesive, and more powerful than before. Here is an example of a tagger configuration using multiple embeddings and different vectorizers without adding any custom components. These models tend to perform better than single embedding models but require no custom code.

  • More utilities and models in the core library for each framework

    • Many new Encoder utilities that make it easy to build your own models
    • BLEU score added on validation and test for seq2seq tasks
  • xpctl: The mongodb backend has changed to allow a simpler command syntax where a query can be made without having to know the task name