ami section


To split a document in a Ctree into sections. Based on:

  • tags from JATS, etc.
  • text labels in document
  • @class attributes

Will extract tables and figures.


  • fulltext.xml (in JATS)
  • scholarly.html (if semantic and structured).


  • difficult to extract from PDF or many raw HTML files.



Uses default heuristics and common section names

 ami -p <cproject> section

produces output below (output).


Splits XML files into sections using XPath.
Creates names from titles of sections (or 'elem<num>.xml' if cannot)
optionally writes HTML (slow) using specified stylesheet
    --sections ALL --html nlm2html
         //not sure this works    --sections ABSTRACT ACK_FUND --write false

    --forcemake --extract table fig --summary figure table         // this seems to create sections OK, use this?
      --boldsections      convert paras with bold first sentence/phrase into subsections.
                          e.g. <sec id='s2.1'><p><bold>Extraction of Oils.</bold>. more text...</p></sec>
                          =>  <sec id='s2.1'><sec id='s2.1.1'><title>Extraction of Oils.</title>. <p>more text...

                          extract float elements to subdirectory,default table, fig, supplementary)
                            Default: [table, fig, supplementary]
  -h, --help              Show this help message and exit.
      --html=<xsltName>   convert sections to HTML using stylesheet (convention as in --transform). recommend:
                            nlm2html; if omitted defaults to no HTML currently 201909 very slow since XSLT seems to be
                            slow,  seems to be size related (references can take 1 sec)
                          sections to extract (uses JATSSectionTagger)
                          if none, lists Tagger tags
                          ALL selects all tags in Tagger
                          AUTO creates hierchical tree based on JATS and heuristics (default)
                            Default: [AUTO]
                          Type of section (XML or HTML) default XML. Probably only used in development
                            Default: XML
                          create summary files for sections
                            Default: []
  -V, --version           Print version information and exit.
      --write             write section files (may be customised later);


 tree -h -L 2 PMC6808808/sections/
├── [ 128]  0_front
│   ├── [ 256]  0_journal-meta
│   └── [ 736]  1_article-meta
├── [  96]  1_back
│   └── [ 197]  0_notes.xml
└── [  64]  2_floats-group


  • all sections are pre-numbered to avoid collisions. (e.g. later there are two pub-date records). Numbers reflect the reading/document order. This document has no body (it's an abstract).
  • [ddd] sections show size in bytes

front section

├── [ 256]  0_journal-meta
│   ├── [ 113]  0_journal-id.xml
│   ├── [ 117]  1_journal-id.xml
│   ├── [ 102]  2_journal-id.xml
│   ├── [ 151]  3_journal-title-group.xml
│   ├── [  80]  4_issn.xml
│   └── [ 162]  5_publisher.xml
└── [ 736]  1_article-meta
    ├── [  94]  0_article-id.xml
    ├── [ 164]  10_pub-date.xml
    ├── [ 144]  11_pub-date.xml
    ├── [  60]  12_volume.xml
    ├── [  64]  13_issue.xml
    ├── [  90]  14_issue-title.xml
    ├── [  61]  15_fpage.xml
    ├── [  61]  16_lpage.xml
    ├── [1005]  17_permissions.xml
    ├── [ 125]  18_self-uri.xml
    ├── [2.8K]  19_abstract.xml
    ├── [ 109]  1_article-id.xml
    ├── [  87]  20_counts.xml
    ├── [ 105]  2_article-id.xml
    ├── [ 286]  3_article-categories.xml
    ├── [ 209]  4_title-group.xml
    ├── [1.3K]  5_contrib-group.xml
    ├── [ 159]  6_aff.xml
    ├── [ 180]  7_aff.xml
    ├── [ 182]  8_aff.xml
    └── [ 127]  9_pub-date.xml

These are all the tagged sections in the front partition. 0_front has two children:


Metadata about the journal (its id, publisher, journal title, etc.)


Metadata about the article (its dates, findability, ids, abstract, ). Authors are in contrib-group.

Note the repeated tags (e.g. journal-id have unique pre-numbers).

body section

Using a later article:

tree -h -L 2 PMC6994851/sections/
├── [ 128]  0_front
│   ├── [ 224]  0_journal-meta
│   └── [ 736]  1_article-meta
├── [ 256]  1_body
│   ├── [ 256]  0_introduction
│   ├── [ 256]  1_methods_and_methodology
│   ├── [ 224]  2_results
│   ├── [ 320]  3_discussion
│   ├── [ 192]  4_conclusion_and_recommenda
│   └── [ 224]  5_declarations
├── [ 128]  2_back
│   ├── [1.1K]  0_ref-list
│   └── [ 403]  1_ack.xml
└── [ 288]  3_floats-group
    ├── [1.8K]  0_table 1.xml
    ├── [ 323]  1_figure 1.xml
    ├── [ 316]  2_figure 2.xml
    ├── [ 317]  3_figure 3.xml
    ├── [1.3K]  4_table 2.xml
    ├── [ 314]  5_figure 4.xml
    └── [ 316]  6_figure 5.xml

The body has non-standard sections, but they clearly map onto our proposed:

  • introduction
  • methods
  • results
  • discussion

There's another section ("conclusions") which is often conflated with discussion, The final section is complex:

    └── [ 224]  5_declarations
│       ├── [  69]  0_title.xml
│       ├── [ 224]  1_author_contribution_state
│       │   ├── [  86]  0_title.xml
│       │   ├── [ 146]  1_p.xml
│       │   ├── [ 153]  2_p.xml
│       │   ├── [ 161]  3_p.xml
│       │   └── [ 196]  4_p.xml
│       ├── [ 128]  2_funding_statement
│       │   ├── [  74]  0_title.xml
│       │   └── [ 184]  1_p.xml
│       ├── [ 128]  3_competing_interest_statem
│       │   ├── [  85]  0_title.xml
│       │   └── [ 104]  1_p.xml
│       └── [ 128]  4_additional_information
│           ├── [  79]  0_title.xml
│           └── [ 114]  1_p.xml

some are clearly classifiable ("funding statement"), others are not ("additional information").


├── [ 128]  2_back
│   ├── [1.1K]  0_ref-list
│   │   ├── [  67]  0_title.xml
│   │   ├── [ 997]  10_ref.xml
│   │   ├── [1.1K]  11_ref.xml
│   │   ├── [ 642]  6_ref.xml
│   │   ├── [1.0K]  7_ref.xml
│   │   ├── [1.1K]  8_ref.xml
│   │   └── [ 786]  9_ref.xml
│   └── [ 403]  1_ack.xml

(Note that tree sorts them in lexical order). ack is thanks and may have funder information.


"float"s are chunks that don't fit into reading order, normally tables and figures. ami section will move floats to a special area floats-group although this is often provided by JATS.

Tester 1: Vaishali Arora

  • Sectioning of the dataset is usually done for greater precision.

  • To download a corpus of 50 articles in XML format in the directory project.

  • Open the Command Prompt and give the syntax:

    getpapers -q "Viral epidemics" -o project -f mycorpus/log.txt -k 50 -x -p

  • To divide the content of your papers into sections, again open the Command Prompt and give the syntax:

    ami -p project section

  • This will create a subfolder of sections in each folder of the scientific paper which is there in your directory.

  • Open the folder 'sections' and you will get subfolders as - Front, Body, Back, floats group etc.

  • This completes the sectioning part of your Cproject.


  • Make sure that you have no spaces in your directory name as this will break the path of your command. eg. It can be My_project not My project

Test 2

Beta tester: Ambreen H

An attempt was made to split all full-length papers (in XML format) within a directory into sections

  1. 20 papers were downloaded into an output directory in XML format using getpapers : getpapers -q "viral epidemics" -o sectioning\project -k 20 -x -p

  2. 17 full-length articles could be retrieved using this query

  3. The directory project was next used as an input directory for ami section : ami -p sectioning\project section

  4. Successful query execution with a few warnings. These warnings were generated for all papers where clear element tags were unavailable eg. papers with no subsections for introduction, methodology etc.

Generic values (AMISectionTool)
-v to see generic values

Specific values (AMISectionTool)
xslt                    null
boldSections            false
extract                 [table, fig, supplementary]
sectionList             [AUTO]
sectiontype             XML
summaryList             []
write                   true

AMISectionTool cTree: PMC3561042
AMISectionTool cTree: PMC6517453

no class for: journal-subtitle
0    [main] WARN  org.contentmine.norma.sections.JATSFactory  - Unknown JATS Span journal-subtitle
0 [main] WARN org.contentmine.norma.sections.JATSFactory  - Unknown JATS Span journal-subtitle
JATSElement untagged element: isbn
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7120695
JATSElement untagged element: isbn
AMISectionTool cTree: PMC7300792 ...

  1. Query results:
  • Each full-length research paper successfully sectioned as indicated above
  • Review articles sectioned as per subheadings within the article


ami -p <project> section --hypertree mincount=2

generates files hypertree.html and hypertree.xml as CProject children . This aggregates all the sections with common titles and adds counts. Preliminary results for 950 "viral epidemics" (without the documents) gives

(some common titles - see link for better display)

cTree 950
eupmc_result.json 950
results 950
search 950
country 950
drugs 950
funders 950
word 806
frequencies 806
sections 806
floats-group 806
front 806
article-meta 806
UNKspan 4
journal-meta 806
body 782
introduction 337
protein__and_peptide_base 4
discussion 206
limitations 4
results 184
contribution_of_all_virus 3
deficiency_of_hif_1α_in_a 3
phylogenetic_analysis_of 3
symbolic_transfer_entropy 3

Tester: Lakshmi Devi Priya

The hypertree creation was tested in the disease corpus by using the syntax

ami -v -p disease/1-part section --summary all --hypertree mincount=2

The files hypertree.html and hypertree.xml were created.

