Please note that this project is under active development and is not yet stable. Use at your own discretion.
unleash your annotation superpowers \o/
Dualtext is an annotation tool for textual data specialized in sentence similarity annotations. Some of its features include:
- interactive annotation mode / find similar sentences through search using elasticsearch and BERT SentenceEmbeddings
- review and inter rater workflow / configure and automate creation of review and inter rater reliability tasks
- live statistics / always know the current state of your project, check progress, label distributions and timing estimations
- autobalanced datasets / balance your dataset by informing annotators about labels currently underrepresented
- API client / configure projects and corpora programatically
- CLI / create projects from the CLI
Dualtext is a Django application using a Vue3 SPA-Frontend. Search functionality is provided through elasticsearch or custom search integrations.
1. installing elasticsearch
Dualtext uses elasticsearch. Go to: https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html and choose the appropriate installation method for your system.
Start elasticsearch: $ sudo systemctl start elasticsearch.service
(more methods at https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html)
2. get dualtext
$ git clone git@github.com:mathislucka/dualtext.git
$ cd dualtext
Dualtext is split into 3 distinct modules. Under the root directory you will find:
/dualtext_client
-> contains all API client and CLI related code
/dualtext_server
-> contains all backend related code
/frontend
-> contains all frontend related code
3. getting the server running
Go to settings.py
in dualtext_server/dualtext/
and point the ELASTICSEARCH_DSL
entry to your elasticsearch host (default localhost:9200).
In settings.py
configure the DATABASES
according to your local DB setup. If you'd like to use SQLite:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': BASE_DIR / 'db.sqlite3',
}
}
Now create a virtual environment if you like.
Then:
$ pip install -r requirements.txt
$ cd dualtext_server
$ python manage.py makemigrations
$ python manage.py migrate
$ python manage.py createsuperuser
$ python manage.py test
$ python manage.py runserver
Your server should now be running at localhost:8000
. Note that a SentenceEmbedding model will be downloaded from the Huggingface model hub when you first run the tests or start the server. If you'd like to use a custom SentenceEmbedding model:
Go to dualtext_server/dualtext_api/feature_builders/sentence_embedding.py:ln7
and change the model to a local file path or another SentenceEmbedding model from Huggingface.
4. getting the frontend running
Install node and npm (https://www.npmjs.com/get-npm).
Then go to dualtext/frontend/
then:
$ npm install
$ npm run serve
Your local development server should now be running at localhost:8080
. If you'd like to build your assets for production use npm run build
instead.
5. installing the CLI
Go to dualtext/dualtext_client/
then:
$ dualtext
You can now use the CLI.
This aims to be a pragmatic guide to the most essential parts of dualtext. It covers working with the API from the CLI or the API client, using the dualtext frontend and implementing custom search methods or feature types in the dualtext backend.
Dualtext was built with automated management for annotation projects and corpora in mind.
The API client and the CLI enable developers and data scientists to interact with the API
from their own python programmes or from the command line. The focus of interacting with
the API lies in project and corpora management. You can create corpora, initiate projects
and download or discover data resulting from ongoing annotation. The full API schema can be
discovered at <host>/api/v1/docs/
when the development server is running. It is served
in the form of a Swagger UI page informing the user on the basic structure of dualtext's API.
You can get a json representation of the schema at <host>/openapi
.
To use the API client simply import the required modules from dualtext_client/
. Each entity
that shall be used from the public API has a class containing all methods to interact with the
specific entity.
As an example, if you would like to create a corpus and corresponding documents you would do this:
from dualtext_client.corpus import Corpus
from dualtext_client.session import Session
from dualtext_client.document import Document
# first establish a session
s = Session(username='your username', password='your password')
# Create a corpus instance using the established session
c = Corpus(session=s)
# now create a corpus
payload = {
'name': '<name>', # a unique name for your corpus
'corpus_meta': {}, # a json field accepting any meta information
'allowed_groups': ['<int>', '<int>'], # a list of groups that shall be allowed to access the corpus
}
c = c.create(payload)
# now create some documents
# we are using the batch creation route which supports batches of up to 200 documents
d = Document(session=s, corpus=c.id)
documents = []
with open('some_file_path') as f:
for line in f:
documents.append({'content': line})
d.batch_create(documents)
You can find json schemas for most of these resources at dualtext_client/schemas/
.
Using the CLI is a bit more simple. If you would like to create a new project from a corpus of existing documents you would:
$ dualtext mkproj --project-data /some/file/path/file.json
The mkproj
command accepts a file path to a json file containing all the information for your project as an argument.
You can find an example of the file's structure at dualtext_client/examples/create_from_scratch/
.
The schema expected to be followed can be found at dualtext_client/schemas/project_from_scratch.schema.json
.
Dualtext is extensible. In its basic version it provides two search methods for searching inside corpora and one feature that can be attached to each document in a corpus. A feature is a different representation of a document's content. It can be a vector, a list of tokens, a tag or anything else that takes time to compute and that you would like to permanently attach to a document. The basic concept is this:
A Corpus
has one or more Features
the feature contains a unique feature_key
. The feature_key
is used to retrieve
methods to build feature values from a feature builder class. As an example:
Corpus A has the feature sentence_embedding
. A SentenceEmbedding class was created and the sentence_embedding
key is
linked in the Builder
class (dualtext_server/dualtext_api/feature_builders/builder.py
). When a document is added to Corpus A
the corresponding sentence embedding is automatically computed according to the implementation inside the SentenceEmbedding class.
Let's build a custom feature to illustrate this:
# /dualtext_server/dualtext_api/feature_builders/document_length.py
from .abstract_feature import AbstractFeature
from dualtext_api.models import Feature, FeatureValue, Document
import pickle
# all feature builders should inherit from abstract feature
# all necessary methods are documented in the AbstractFeature class
class DocumentLength(AbstractFeature):
def create_feature(self, documents):
# This method receives a list of documents
feature = Feature.objects.get(key='document_length')
for doc in documents:
val = pickle.dumps(len(doc['content']))
fv = FeatureValue(feature=feature, document=doc, value=val)
fv.save()
def update_features(self, documents):
pass
def remove_feature(self, documents):
pass
def process_query(self, query):
return query
Now reference your newly build feature inside the Builder class:
# /dualtext_server/dualtext_api/feature_builders/builder.py
# ...
from .document_length import DocumentLength
class Builder():
def __init__(self):
self.features = {'sentence_embedding': SentenceEmbedding(), 'elastic': Elastic(), 'document_length': DocumentLength()}
# ...
Now you are done. When you assign a feature containing the feature key document_length
to a corpus, the length of a document
will be automagically computed and saved alongside the document in your DB.
Let's build a custom search method that will retrieve all documents below a certain content length:
# /dualtext_server/dualtext_api/search/document_length_search.py
from .abstract_search import AbstractSearch
import pickle
class DocumentLengthSearch(AbstractSearch):
def __init__(self):
self.feature_key = 'document_length'
def search(self, corpora, excluded_documents, query):
feature_values = FeatureValue.objects.filter(
Q(key=self.feature_key) &
Q(document__corpus__id__in=corpora) &
~Q(document__id__in=excluded_documents)
).all()
found = []
for fv in feature_values:
length = pickle.loads(fv.value)
if length < query:
found.append((fv.document.id, length, self.feature_key))
return found
DocumentLengthSearch
inherits from AbstractSearch
it has to implement a search
method which will be run if the user decides to search for documents using their length. After implementing the custom search module, you need to reference the class in the global search class as follows:
# /dualtext_server/dualtext_api/search.py
# ...
from .document_length_search import DocumentLengthSearch
class Search():
# ...
@staticmethod
def get_available_methods():
return {
'elastic': ElasticSearch,
'sentence_embedding': SentenceEmbeddingSearch,
'document_length': DocumentLengthSearch
}
The new search method can now be used.
In practice, you might not want to actually store feature values in the database and you might want to avoid using the DB for search requests in order to increase performance. You can look at feature and search implementations using elasticsearch in /dualtext_server/dualtext_api/feature_builders/sentence_embedding.py
and /dualtext_server/dualtext_api/search/sentence_embedding_search.py
.