Skip to content
petermr edited this page Aug 18, 2020 · 11 revisions

ami summary

Aggregates files in the CProject hypertree into a toplevel tree. Useful for collecting all sections of a given type in one place.

picocli

ami summary --help
Usage: ami summary [OPTIONS]
Description
===========
Summarizes the CTree files into a single toplevel CProject directory tree.Used to be hardcoded , but now can be
controlled by glob
Options
=======
      --dictionary=<dictionaryList>...
                  dictionaries to summarize. Probably OBSOLETE
                    Default: []
      --gene[=<geneList>...]
                  genes to summarize. OBSOLETE
                    Default: human
      --glob=<globList>[,<globList>...]
                  files to summarize (as glob)
                    Default: []
  -h, --help      Show this help message and exit.
      --output-types=<outputTypes>...
                  output type/s. Not sure how useful this is. `table` creates a CSV table
                    Default: []
      --species[=<speciesList>...]
                  species to summarize. OBSOLETE
                    Default: binomial
  -V, --version   Print version information and exit.
      --word      analyze word frequencies. Probably OBSOLETE.

examples

example CTree

Typical ami CTree below (in src/test/resources/ami/battery10) which has many subtrees created by ami commands. It is "snipped" === into separate chunks, all of which could be summary-ized. ami iterates over many CTrees , each of which is different in details of hierarchy, numbering, content. Note that the leading digits (1_body are to keep the order of the sections as otherwise they may be scrambled by directory listing.

PMC3893646/
├── eupmc_result.json
===
├── fulltext.pdf
===
├── fulltext.xml
===
├── pdfimages
│   ├── image.2.2.304_553.69_273
│   │   ├── image.2.2.304_553
│   │   │   ├── image.2.2.304_553
│   │   │   │   └── hocr
===................................. OCR from image
│   │   │   │       ├── hocr.html
===................................. OCR from image (probably best)
│   │   │   │       └── hocr.svg
===................................. raw image
│   │   │   └── image.2.2.304_553.png 
│   │   ├── images.html
│   │   ├── octree
│   │   │   ├── binary.png
===................................. colour channels for image
│   │   │   ├── channel.07a957.png
...
│   │   │   ├── channel.fcfcfc.png
│   │   │   ├── channels.html.       ... browse
│   │   │   ├── histogram.svg
│   │   │   └── octree.png
│   │   ├── raw
│   │   │   └── hocr
│   │   ├── raw.annot.html
│   │   ├── raw.png
│   │   └── raw_o8.png
...
│   └── images.html
├── results
│   ├── search
│   │   ├── country
===................................. ami search results (entity in context)
│   │   │   └── results.xml
│   │   ├── elements
│   │   │   └── results.xml
│   │   └── funders
│   │       └── results.xml
===................................ word frequencies
│   └── word
│       └── frequencies
│           ├── results.html
│           └── results.xml
===................................ fulltext in HTML
├── scholarly.html
├── search.country.count.xml       .... probably obsolete
├── search.country.snippets.xml
├── search.elements.count.xml
├── search.elements.snippets.xml
├── search.funders.count.xml
├── search.funders.snippets.xml
├── sections
===................................ bibliography (the directory names are controlled) 
│   ├── 0_front
│   │   ├── 0_journal-meta
│   │   │   ├── 0_journal-id.xml
│   │   │   ├── 1_journal-id.xml
│   │   │   ├── 2_journal-title-group.xml
│   │   │   ├── 3_issn.xml
│   │   │   └── 4_publisher.xml
│   │   └── 1_article-meta
│   │       ├── 0_article-id.xml
│   │       ├── 10_elocation-id.xml
│   │       ├── 11_history.xml
│   │       ├── 12_permissions.xml
│   │       ├── 13_abstract.xml
│   │       ├── 1_article-id.xml
│   │       ├── 2_article-id.xml
│   │       ├── 3_article-categories.xml
│   │       ├── 4_title-group.xml
│   │       ├── 5_contrib-group.xml
│   │       ├── 6_author-notes.xml
│   │       ├── 7_pub-date.xml
│   │       ├── 8_pub-date.xml
│   │       └── 9_volume.xml
===................................ body (sections mainly as HTML) can be snipped anywhere
│   ├── 1_body
│   │   ├── 0_p.xml
│   │   ├── 1_p.xml
│   │   ├── 2_p.xml
===
│   │   ├── 3_results
│   │   │   ├── 0_title.xml
│   │   │   ├── 1_p.xml
│   │   │   ├── 2_p.xml
│   │   │   └── 3_p.xml
===
│   │   ├── 4_discussion
│   │   │   ├── 0_title.xml
│   │   │   ├── 1_p.xml
│   │   │   ├── 2_p.xml
│   │   │   ├── 3_p.xml
│   │   │   ├── 4_p.xml
│   │   │   ├── 5_p.xml
│   │   │   └── 6_p.xml
===................................ lower levels not consistently named
│   │   ├── 5_methods
│   │   │   ├── 0_title.xml
│   │   │   ├── 1_material_and_synthesis
│   │   │   │   ├── 0_title.xml
│   │   │   │   └── 1_p.xml
│   │   │   ├── 2_material_characterization
│   │   │   │   ├── 0_title.xml
│   │   │   │   └── 1_p.xml
│   │   │   └── 3_electrochemical_measureme
│   │   │       ├── 0_title.xml
│   │   │       └── 1_p.xml
===
│   │   ├── 6_author_contributions
│   │   │   ├── 0_title.xml
│   │   │   └── 1_p.xml
===
│   │   └── 7_supplementary_material
│   │       └── 0_title.xml
===................................ backmatter 
│   ├── 2_back
===................................ acknowledgements
│   │   ├── 0_ack.xml
===................................ references
│   │   └── 1_ref-list
│   │       ├── 0_ref.xml
│   │       ├── 10_ref.xml
...
│   │       └── 9_ref.xml
===................................ original container of tables, figures, supplementary
│   ├── 3_floats-group
│   │   └── 6_supplementary-material.xml
===................................ figure captions
│   ├── figures
│   │   ├── figure_1.html
│   │   ├── figure_1.xml
...
│   │   ├── figure_6.html
│   │   ├── figure_6.xml
│   │   └── summary.html
│   └── supplementary
│       ├── summary.html
│       ├── supplementary_6.html
│       └── supplementary_6.xml
===................................ text and images from PDF
├── svg
│   ├── fulltext-page.0.svg
...
│   └── fulltext-page.6.svg
├── word.frequencies.count.xml      ... OBSOLETE?
└── word.frequencies.snippets.xml

67 directories, 237 files
pm286macbook:battery10 pm286$ 

code

in https://github.com/petermr/ami3/blob/master/src/test/java/org/contentmine/ami/tools/AMISummaryTest.java

illustrative example - flattening

based on extracting methods subtrees

	@Test
	public void summarizeMethods() {
		String root = "methods";
		String project = "summarizeProject/";
		File targetDir = new File("target/"+project);
		CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir);
		String cmd = "-vvv"
				+ " -p "+targetDir
				+ " --output " + "/sections/body/"+root
				+ " summary "
				+ " --glob **/PMC*/sections/*_body/*_methods/**/*_p.xml"
			;
		AMI.execute(cmd);
		AbstractAMITest.compareDirectories(targetDir, expectedDir);
	}

On the commandline this is:

ami -p <targetDir> --output /sections/body/methods 
summary
 --glob **/PMC*/sections/*_body/*_methods/**/*_p.xml

This extracts from:

│   ├── 1_body
│   │   ├── 0_p.xml
...
===................................ lower levels not consistently named
│   │   ├── 5_methods
│   │   │   ├── 0_title.xml
│   │   │   ├── 1_material_and_synthesis
│   │   │   │   ├── 0_title.xml
│   │   │   │   └── 1_p.xml
│   │   │   ├── 2_material_characterization
│   │   │   │   ├── 0_title.xml
│   │   │   │   └── 1_p.xml
│   │   │   └── 3_electrochemical_measureme
│   │   │       ├── 0_title.xml
│   │   │       └── 1_p.xml

Generally we split at divs so there are only <title> and <p> content. <p> may contain any normal HTML (<ul>, <span>, various style/format). Tables are normally in <float-group>.

methods is generic (and may include "materials and methods" and other related concepts. But the lower titles are unlikely to be general and probably not even consistent within the discipline (electrochemistry). So here we simply ignore them and look for any content under methods.

globbing and flattening

 --glob **/PMC*/sections/*_body/*_methods/**/*_p.xml

The glob is relative to the CProject. It works by traversing the whole of the tree under targetDir and matching the files against the glob - which is a filter rather than a template. At each file or directory the globber asks if the file matches. Because it doesn't know the context, we need the leading ** ("match any"). Note: this cannot go outside the CProject so is "safe", but don't choose a CProject of "/" (anymore than rm -rf / which I and many have done :-)).

The levels of the glob mean:

  • ** ancestors up to the disk root.
  • PMC* all files somewhere under cProject that start with PMC . Avoids accessing other toplevel. This works with EPMC but is not optimal - we may change this later
  • sections . A child directory of every PMC that is exactly named sections.
  • *_body any child directory of sections that ends with _body (there's only normally one)
  • *_methods any child directory of any *_body directory. Normally only 1.
  • ** any number of directories or none. There are no conventions or names or number of levels so we have to do this.
  • *_p.xml any leaf node with the name ending in _p.xml

This will retrieve just the leafnodes and "flatten" the subtree. Because there may be many files named 1_p.xml we prepend another counter. The result is a directory containing 27 files:

target/summarizeProject/_summary/sections/body/methods/
├── 10_3_p.xml
├── 11_1_p.xml
├── 12_2_p.xml
...
├── 25_1_p.xml
├── 26_2_p.xml
├── 27_1_p.xml
...
└── 9_1_p.xml 

(These have lost the knowledge of where they came from, but I'll deal with that soon). They contain XML tags which will probably need to be removed before doing textual analysis.

illustrative example - tree preservation

Not Yet Implemented @Test public void summarizeResults() { String root = "methods"; String project = "summarizeProject/"; File targetDir = new File("target/"+project); CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir); String cmd = "-vvv" + " -p "+targetDir + " --output " + "/sections/body/"+root + " summary " + " --glob /PMC/sections/_body/_methods/*/*_p.xml" ; AMI.execute(cmd); AbstractAMITest.compareDirectories(targetDir, expectedDir); }


### Example with flattening

This allows you to collect all files of a given type, using glob. This can be used for sections, such as methods and abstract. The output is either a tree of files, or a CSV file with filenames and content.
Typical test:
/** extracts the flattened subtree of abstracts
 * and a summary.csv 
 * 
 * */
@Test
public void testSummarizeAbstracts() {
	String root = "abstract";
	String project = "battery10/";
	File expectedDir = new File(TEST_BATTERY10+"."+"expected", project);
	File targetDir = new File(TARGET_SUMMARY, project);
	CMineTestFixtures.cleanAndCopyDir(TEST_BATTERY10, targetDir);
	String cmd = "-vvv"
			+ " -p "+targetDir
			+ " --output " + "/"+root
			+ " summary "
			+ " --flatten"
			+ " --outtype tab"
			+ " --glob **/PMC*/sections/*_front/*_article-meta/*_abstract.xml"
		;
	AMI.execute(cmd);
	AbstractAMITest.compareDirectories(targetDir, expectedDir);
}

On commandline:

ami -vvv -p myProject --output myoutput summary --flatten --outtype tab \
     --glob **/PMC*/sections/*_front/*_article-meta/*_abstract.xml