Skip to content

Refactoring `search`

petermr edited this page Sep 16, 2020 · 1 revision

refactoring ami search

overview 2020-09-16

Currently ami search is a monolithic command which carries out:

  • norma-transformation if required
  • word frequencies
  • searches by dictionary
  • specialist syntactic searches (gene, species, acronyms , regex, identifiers . These have all stopped working. They should be restored.)
  • datatables analysis which includes
    • bibliography
    • links to sources (e.g. EPMC)
    • analysis of results/ folder
    • analysis of words folder
    • links to Wikipedia (not yet Wikidata)
  • cooccurrence and frequency plots

This was all done through an AMIArgProcessor commandline which is now being gradually obsoleted.

proposed design

separate normalization

There are an increasing number of transformations needed so these shouls be under picocli control. Maybe re-institute norma / ami transform

separate ami words

This has been done but not customised (i.e. cannot easily change stopwords, remove junk. Also linking to Wikipedia is very crude and misleading.

separate ami search

The legacy search must be kept, but should have no triggering of ami words or datatables or cooccurrence. It should simply build the results/ folders. The search includes the following:

  • lowercasing
  • stopwords
  • stemming
  • phrases ("trailing words")

These need mending

separate datatables

This has been partially done. It needs better control over Wikidata, icons, etc.

separate cooccurrence

This has been partially done but not customised.