Skip to content

Python modules and scripts for working with Concrete, a data serialization format for NLP

License

Notifications You must be signed in to change notification settings

hltcoe/concrete-python

Repository files navigation

Tutorial

Concrete-python is the Python interface to Concrete, a natural language processing data format and set of service protocols that work across different operating systems and programming languages via Apache Thrift. Concrete-python contains generated Python classes, utility classes and functions, and scripts. It does not contain the Thrift schema for Concrete, which can be found in the Concrete GitHub repository.

This document provides a quick tutorial of concrete-python installation and usage. For more information, including an API reference and development information, please see the online documentation.

License

Copyright 2012-2019 Johns Hopkins University HLTCOE. All rights reserved. This software is released under the 2-clause BSD license. Please see LICENSE for more information.

Requirements

concrete-python is tested on Python 3.5 and requires the Thrift Python library, among other Python libraries. These are installed automatically by setup.py or pip. The Thrift compiler is not required.

Note: The accelerated protocol offers a (de)serialization speedup of 10x or more; if you would like to use it, ensure a C++ compiler is available on your system before installing concrete-python. (If a compiler is not available, concrete-python will fall back to the unaccelerated protocol automatically.) If you are on Linux, a suitable C++ compiler will be listed as g++ or gcc-c++ in your package manager.

If you are using macOS Mojave with the Homebrew package manager (https://brew.sh), you can install the accelerated protocol using the script install-mojave-homebrew-accelerated-thrift.sh.

Installation

You can install Concrete using the pip package manager:

pip install concrete

or by cloning the repository and running setup.py:

git clone https://github.com/hltcoe/concrete-python.git
cd concrete-python
python setup.py install

Basic usage

Here and in the following sections we make use of an example Concrete Communication file included in the concrete-python source distribution. The Communication type represents an article, book, post, Tweet, or any other kind of document that we might want to store and analyze. Copy it from tests/testdata/serif_dog-bites-man.concrete if you have the concrete-python source distribution or download it separately here: serif_dog-bites-man.concrete.

First we use the concrete-inspect.py tool (explained in more detail in the following section) to inspect some of the contents of the Communication:

concrete-inspect.py --text serif_dog-bites-man.concrete

This command prints the text of the Communication to the console. In our case the text is a short article formatted in SGML:

<DOC id="dog-bites-man" type="other">
<HEADLINE>
Dog Bites Man
</HEADLINE>
<TEXT>
<P>
John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.
</P>
<P>
He died!
</P>
<P>
John's daughter Mary expressed sorrow.
</P>
</TEXT>
</DOC>

Now run the following command to inspect some of the annotations stored in that Communication:

concrete-inspect.py --ner --pos --dependency serif_dog-bites-man.concrete

This command shows a tokenization, part-of-speech tagging, named entity tagging, and dependency parse in a CoNLL-like columnar format:

INDEX       TOKEN   POS     NER     HEAD    DEPREL
-----       -----   ---     ---     ----    ------
1   John    NNP     PER     2       compound
2   Smith   NNP     PER     10      nsubjpass
3   ,       ,
4   manager NN              2       appos
5   of      IN              7       case
6   ACMÉ    NNP     ORG     7       compound
7   INC     NNP     ORG     4       nmod
8   ,       ,
9   was     VBD             10      auxpass
10  bit     NN              0       ROOT
11  by      IN              13      case
12  a       DT              13      det
13  dog     NN              10      nmod
14  on      IN              15      case
15  March   DATE-NNP                13      nmod
16  10th    JJ              15      amod
17  ,       ,
18  2013    CD              13      amod
19  .       .

1   He      PRP             2       nsubj
2   died    VBD             0       ROOT
3   !       .

1   John    NNP     PER     3       nmod:poss
2   's      POS             1       case
3   daughter        NN              5       dep
4   Mary    NNP     PER     5       nsubj
5   expressed       VBD             0       ROOT
6   sorrow  NN              5       dobj
7   .       .

Reading Concrete

There are even more annotations stored in this Communication, but for now we move on to demonstrate handling of the Communication in Python. The example file contains a single Communication, but many (if not most) files contain several. The same code can be used to read Communications in a regular file, tar archive, or zip archive:

from concrete.util import CommunicationReader

for (comm, filename) in CommunicationReader('serif_dog-bites-man.concrete'):
    print(comm.id)
    print()
    print(comm.text)

This loop prints the unique ID and text (the same text we saw before) of our one Communication:

tests/testdata/serif_dog-bites-man.xml

<DOC id="dog-bites-man" type="other">
<HEADLINE>
Dog Bites Man
</HEADLINE>
<TEXT>
<P>
John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.
</P>
<P>
He died!
</P>
<P>
John's daughter Mary expressed sorrow.
</P>
</TEXT>
</DOC>

In addition to the general-purpose CommunicationReader there is a convenience function for reading a single Communication from a regular file:

from concrete.util import read_communication_from_file

comm = read_communication_from_file('serif_dog-bites-man.concrete')

Communications are broken into Sections, which are in turn broken into Sentences, which are in turn broken into Tokens (and that's only scratching the surface). To traverse this decomposition:

from concrete.util import lun, get_tokens

for section in lun(comm.sectionList):
    print('* section')
    for sentence in lun(section.sentenceList):
        print('  + sentence')
        for token in get_tokens(sentence.tokenization):
            print('    - ' + token.text)

The output is:

* section
* section
  + sentence
    - John
    - Smith
    - ,
    - manager
    - of
    - ACMÉ
    - INC
    - ,
    - was
    - bit
    - by
    - a
    - dog
    - on
    - March
    - 10th
    - ,
    - 2013
    - .
* section
  + sentence
    - He
    - died
    - !
* section
  + sentence
    - John
    - 's
    - daughter
    - Mary
    - expressed
    - sorrow
    - .

Here we used get_tokens, which abstracts the process of extracting a sequence of Tokens from a Tokenization, and lun, which returns its argument or (if its argument is None) an empty list and stands for "list un-none". Many fields in Concrete are optional, including Communication.sectionList and Section.sentenceList; checking for None quickly becomes tedious.

In this Communication the tokens have been annotated with part-of-speech tags, as we saw previously using concrete-inspect.py. We can print them with the following code:

from concrete.util import get_tagged_tokens

for section in lun(comm.sectionList):
    print('* section')
    for sentence in lun(section.sentenceList):
        print('  + sentence')
        for token_tag in get_tagged_tokens(sentence.tokenization, 'POS'):
            print('    - ' + token_tag.tag)

The output is:

* section
* section
  + sentence
    - NNP
    - NNP
    - ,
    - NN
    - IN
    - NNP
    - NNP
    - ,
    - VBD
    - NN
    - IN
    - DT
    - NN
    - IN
    - DATE-NNP
    - JJ
    - ,
    - CD
    - .
* section
  + sentence
    - PRP
    - VBD
    - .
* section
  + sentence
    - NNP
    - POS
    - NN
    - NNP
    - VBD
    - NN
    - .

Writing Concrete

We can add a new part-of-speech tagging to the Communication as well. Let's add a simplified version of the current tagging:

from concrete.util import AnalyticUUIDGeneratorFactory, now_timestamp
from concrete import TokenTagging, TaggedToken, AnnotationMetadata

augf = AnalyticUUIDGeneratorFactory(comm)
aug = augf.create()

for section in lun(comm.sectionList):
    for sentence in lun(section.sentenceList):
        sentence.tokenization.tokenTaggingList.append(TokenTagging(
            uuid=aug.next(),
            metadata=AnnotationMetadata(
                tool='Simple POS',
                timestamp=now_timestamp(),
                kBest=1
            ),
            taggingType='POS',
            taggedTokenList=[
                TaggedToken(
                    tokenIndex=original.tokenIndex,
                    tag=original.tag.split('-')[-1][:2],
                )
                for original
                in get_tagged_tokens(sentence.tokenization, 'POS')
            ]
        ))

Here we used AnalyticUUIDGeneratorFactory, which creates generators of Concrete UUID objects (see Working with UUIDs for more information). We also used now_timestamp, which returns a Concrete timestamp representing the current time. But now how do we know which tagging is ours? Each annotation's metadata contains a tool name, and we can use it to distinguish between competing annotations:

from concrete.util import get_tagged_tokens

for section in lun(comm.sectionList):
    print('* section')
    for sentence in lun(section.sentenceList):
        print('  + sentence')
        token_tag_pairs = zip(
            get_tagged_tokens(sentence.tokenization, 'POS', tool='Serif: part-of-speech'),
            get_tagged_tokens(sentence.tokenization, 'POS', tool='Simple POS')
        )
        for (old_tag, new_tag) in token_tag_pairs:
            print('    - ' + old_tag.tag + ' -> ' + new_tag.tag)

The output shows our new part-of-speech tagging has a smaller, simpler set of possible values:

* section
* section
  + sentence
    - NNP -> NN
    - NNP -> NN
    - , -> ,
    - NN -> NN
    - IN -> IN
    - NNP -> NN
    - NNP -> NN
    - , -> ,
    - VBD -> VB
    - NN -> NN
    - IN -> IN
    - DT -> DT
    - NN -> NN
    - IN -> IN
    - DATE-NNP -> NN
    - JJ -> JJ
    - , -> ,
    - CD -> CD
    - . -> .
* section
  + sentence
    - PRP -> PR
    - VBD -> VB
    - . -> .
* section
  + sentence
    - NNP -> NN
    - POS -> PO
    - NN -> NN
    - NNP -> NN
    - VBD -> VB
    - NN -> NN
    - . -> .

Finally, let's write our newly annotated Communication back to disk:

from concrete.util import CommunicationWriter

with CommunicationWriter('serif_dog-bites-man.concrete') as writer:
    writer.write(comm)

Note there are many other useful classes and functions in the concrete.util library. See the API reference in the online documentation for details.

concrete-inspect.py

Use concrete-inspect.py to quickly explore the contents of a Communication from the command line. concrete-inspect.py and other scripts are installed to the path along with the concrete-python library.

--id

Run the following command to print the unique ID of our modified example Communication:

concrete-inspect.py --id serif_dog-bites-man.concrete

Output:

tests/testdata/serif_dog-bites-man.xml

--metadata

Use --metadata to print the stored annotations along with their tool names:

concrete-inspect.py --metadata serif_dog-bites-man.concrete

Output:

Communication:  concrete_serif v3.10.1pre

  Tokenization:  Serif: tokens

    Dependency Parse:  Stanford

    Parse:  Serif: parse

    TokenTagging:  Serif: names
    TokenTagging:  Serif: part-of-speech
    TokenTagging:  Simple POS

  EntityMentionSet #0:  Serif: names
  EntityMentionSet #1:  Serif: values
  EntityMentionSet #2:  Serif: mentions

  EntitySet #0:  Serif: doc-entities
  EntitySet #1:  Serif: doc-values

  SituationMentionSet #0:  Serif: relations
  SituationMentionSet #1:  Serif: events

  SituationSet #0:  Serif: relations
  SituationSet #1:  Serif: events

  CommunicationTagging:  lda
  CommunicationTagging:  urgency

--sections

Use --sections to print the text of the Communication, broken out by section:

concrete-inspect.py --sections serif_dog-bites-man.concrete

Output:

Section 0 (0ab68635-c83d-4b02-b8c3-288626968e05)[kind: SectionKind.PASSAGE], from 81 to 82:



Section 1 (54902d75-1841-4d8d-b4c5-390d4ef1a47a)[kind: SectionKind.PASSAGE], from 85 to 162:

John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.
</P>


Section 2 (7ec8b7d9-6be0-4c62-af57-3c6c48bad711)[kind: SectionKind.PASSAGE], from 165 to 180:

He died!
</P>


Section 3 (68da91a1-5beb-4129-943d-170c40c7d0f7)[kind: SectionKind.PASSAGE], from 183 to 228:

John's daughter Mary expressed sorrow.
</P>

--entities

Use --entities to print the named entities detected in the Communication:

concrete-inspect.py --entities serif_dog-bites-man.concrete

Output:

Entity Set 0 (Serif: doc-entities):
  Entity 0-0:
      EntityMention 0-0-0:
          tokens:     John Smith
          text:       John Smith
          entityType: PER
          phraseType: PhraseType.NAME
      EntityMention 0-0-1:
          tokens:     John Smith , manager of ACMÉ INC ,
          text:       John Smith, manager of ACMÉ INC,
          entityType: PER
          phraseType: PhraseType.APPOSITIVE
          child EntityMention #0:
              tokens:     John Smith
              text:       John Smith
              entityType: PER
              phraseType: PhraseType.NAME
          child EntityMention #1:
              tokens:     manager of ACMÉ INC
              text:       manager of ACMÉ INC
              entityType: PER
              phraseType: PhraseType.COMMON_NOUN
      EntityMention 0-0-2:
          tokens:     manager of ACMÉ INC
          text:       manager of ACMÉ INC
          entityType: PER
          phraseType: PhraseType.COMMON_NOUN
      EntityMention 0-0-3:
          tokens:     He
          text:       He
          entityType: PER
          phraseType: PhraseType.PRONOUN
      EntityMention 0-0-4:
          tokens:     John
          text:       John
          entityType: PER.Individual
          phraseType: PhraseType.NAME

  Entity 0-1:
      EntityMention 0-1-0:
          tokens:     ACMÉ INC
          text:       ACMÉ INC
          entityType: ORG
          phraseType: PhraseType.NAME

  Entity 0-2:
      EntityMention 0-2-0:
          tokens:     John 's daughter Mary
          text:       John's daughter Mary
          entityType: PER.Individual
          phraseType: PhraseType.NAME
          child EntityMention #0:
              tokens:     Mary
              text:       Mary
              entityType: PER
              phraseType: PhraseType.OTHER
      EntityMention 0-2-1:
          tokens:     daughter
          text:       daughter
          entityType: PER
          phraseType: PhraseType.COMMON_NOUN


Entity Set 1 (Serif: doc-values):
  Entity 1-0:
      EntityMention 1-0-0:
          tokens:     March 10th , 2013
          text:       March 10th, 2013
          entityType: TIMEX2.TIME
          phraseType: PhraseType.OTHER

--mentions

Use --mentions to show the named entity mentions in the Communication, annotated on the text:

concrete-inspect.py --mentions serif_dog-bites-man.concrete

Output:

<ENTITY ID=0><ENTITY ID=0>John Smith</ENTITY> , <ENTITY ID=0>manager of <ENTITY ID=1>ACMÉ INC</ENTITY></ENTITY> ,</ENTITY> was bit by a dog on <ENTITY ID=3>March 10th , 2013</ENTITY> .

<ENTITY ID=0>He</ENTITY> died !

<ENTITY ID=2><ENTITY ID=0>John</ENTITY> 's <ENTITY ID=2>daughter</ENTITY> Mary</ENTITY> expressed sorrow .

--situations

Use --situations to show the situations detected in the Communication:

concrete-inspect.py --situations serif_dog-bites-man.concrete

Output:

Situation Set 0 (Serif: relations):

Situation Set 1 (Serif: events):
  Situation 1-0:
      situationType:    Life.Die

--treebank

Use --treebank to show constituency parse trees of the sentences in the Communication:

concrete-inspect.py --treebank serif_dog-bites-man.concrete

Output:

(S (NP (NPP (NNP john)
            (NNP smith))
       (, ,)
       (NP (NPA (NN manager))
           (PP (IN of)
               (NPP (NNP acme)
                    (NNP inc))))
       (, ,))
   (VP (VBD was)
       (NP (NPA (NN bit))
           (PP (IN by)
               (NP (NPA (DT a)
                        (NN dog))
                   (PP (IN on)
                       (NP (DATE (DATE-NNP march)
                                 (JJ 10th))
                           (, ,)
                           (NPA (CD 2013))))))))
   (. .))


(S (NPA (PRP he))
   (VP (VBD died))
   (. !))


(S (NPA (NPPOS (NPP (NNP john))
               (POS 's))
        (NN daughter)
        (NPP (NNP mary)))
   (VP (VBD expressed)
       (NPA (NN sorrow)))
   (. .))

Other options

Use --ner, --pos, --lemmas, and --dependency (together or independently) to show respective token-level information in a CoNLL-like format, and use --text to print the text of the Communication, as described in a previous section.

Run concrete-inspect.py --help to show a detailed help message explaining the options discussed above and others. All concrete-python scripts have such help messages.

create-comm.py

Use create-comm.py to generate a simple Communication from a text file. For example, create a file called history-of-the-world.txt containing the following text:

The dog ran .
The cat jumped .

The dolphin teleported .

Then run the following command to convert it to a Concrete Communication, creating Sections, Sentences, and Tokens based on whitespace:

create-comm.py --annotation-level token history-of-the-world.txt history-of-the-world.concrete

Use concrete-inspect.py as shown previously to verify the structure of the Communication:

concrete-inspect.py --sections history-of-the-world.concrete

Output:

Section 0 (a188dcdd-1ade-be5d-41c4-fd4d81f71685)[kind: passage], from 0 to 30:
The dog ran .
The cat jumped .

Section 1 (a188dcdd-1ade-be5d-41c4-fd4d81f7168a)[kind: passage], from 32 to 57:
The dolphin teleported .

Other scripts

concrete-python provides a number of other scripts, including but not limited to:

concrete2json.py
reads in a Concrete Communication and prints a JSON version of the Communication to stdout. The JSON is "pretty printed" with indentation and whitespace, which makes the JSON easier to read and to use for diffs.
create-comm-tarball.py
like create-comm.py but for multiple files: reads in a tar.gz archive of text files, parses them into sections and sentences based on whitespace, and writes them back out as Concrete Communications in another tar.gz archive.
fetch-client.py
connects to a FetchCommunicationService, retrieves one or more Communications (as specified on the command line), and writes them to disk.
fetch-server.py
implements FetchCommunicationService, serving Communications to clients from a file or directory of Communications on disk.
search-client.py
connects to a SearchService, reading queries from the console and printing out results as Communication ids in a loop.
validate-communication.py
reads in a Concrete Communication file and prints out information about any invalid fields. This script is a command-line wrapper around the functionality in the concrete.validate library.

Use the --help flag for details about the scripts' command line arguments.

Working with UUIDs

Each UUID object contains a single string, uuidString, which can be used as a universally unique identifier for the object the UUID is attached to. The AnalyticUUIDGeneratorFactory produces UUID generators for a Communication, one for each analytic (tool) used to process the Communication. In contrast to the Python uuid library, the AnalyticUUIDGeneratorFactory yields UUIDs that have common prefixes within a Communication and within annotations produced by the same analytic, enabling common compression algorithms to much more efficiently store the UUIDs in each Communication. See the AnalyticUUIDGeneratorFactory class in the API reference in the online documentation for more information.

Note that uuidString is generated by a random process, so running the same code twice will result in two completely different sets of identifiers. Concretely, if you run a parser to produce a part-of-speech TokenTagging for each Tokenization in a Communication, save the modified Communication, then run the parser again on the same original Communication, you will get two different identifiers for each TokenTagging, even though the contents of each pair of TokenTaggings---the part-of-speech tags---may be the identical.

Validating Concrete Communications

The Python version of the Thrift Libraries does not perform any validation of Thrift objects. You should use the validate_communication() function after reading and before writing a Concrete Communication:

from concrete.util import read_communication_from_file
from concrete.validate import validate_communication

comm = read_communication_from_file('tests/testdata/serif_dog-bites-man.concrete')

# Returns True|False, logs details using Python stdlib 'logging' module
validate_communication(comm)

Thrift fields have three levels of requiredness:

  • explicitly labeled as required
  • explicitly labeled as optional
  • no requiredness label given ("default required")

Other Concrete tools will raise an exception if a required field is missing on deserialization or serialization, and will raise an exception if a "default required" field is missing on serialization. By default, concrete-python does not perform any validation of Thrift objects on serialization or deserialization. The Python Thrift classes do provide shallow validate() methods, but they only check for explicitly required fields (not "default required" fields) and do not validate nested objects.

The validate_communication() function recursively checks a Communication object for required fields, plus additional checks for UUID mismatches.

About

Python modules and scripts for working with Concrete, a data serialization format for NLP

Topics

Resources

License

Stars

Watchers

Forks

Languages