Skip to content

TranskribusDU_SPM

Hervé Déjean edited this page Jul 12, 2017 · 5 revisions

Use of sequential Pattern Mining techniques for Document Document Understanding

Sequential Pattern Mining

  • use of PrefixSpan algorithm to mine document objects
  • see this blog entry for some explanation
  • code: github

Example 1: Line Mining

This toy example illustrates how SPM can be used in order to hierarchically structure a set of contiguous lines. See also entry blog

Transkribus collection id: 4453: document id 12033 (send me your Transkribus username and I will give you access to the collection)

Download collection:

   python TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 4453 

Convert into internal data structure:

   python TranskribusDU/src/xml_formats/Page2DS.py -i trnskrbs_4453/col/12033.mpxml  -o trnskrbs_4453/xml/12033.ds_xml

SPM Line Miner:

 python ../spmLine.py  -i trnskrbs_4453/xml/12033.ds_xml  -o trnskrbs_4453/out/12033.ds_xml

Output of the method:

sequence of elements and their features:

    TEXT ['x=244.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=293.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=293.0']
    TEXT ['x=132.0', 'x2=329.0', 'f=coniugum filius legitimus.', 'f=231.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=293.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=329.0', 'f=231.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=244.0', 'x2=329.0', 'f=Testes', 'f=282.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0']
    TEXT ['x=244.0', 'x2=375.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=350.0', 'f=coniugum filius legitimus.', 'f=231.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=375.0', 'f=252.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
    TEXT ['x=244.0', 'f=Testes', 'f=282.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=375.0', 'f=252.0']
    TEXT ['x2=350.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=450.0', 'f=282.0']
    TEXT ['x=132.0', 'x2=433.0', 'f=282.0']

List of patterns and their support:

    [['x=132.0']] 24.0
    [['f=282.0']] 18.0
    [['x=132.0', 'f=282.0']] 15.0
    [['x=132.0', 'x2=450.0']] 12.0
    [['x2=450.0']] 12.0
    [['x=132.0', 'x2=450.0', 'f=282.0']] 9.0
    [['x2=450.0', 'f=282.0']] 9.0
    [['x2=433.0']] 6.0
    [['x2=433.0', 'f=282.0']] 6.0
    [['x=132.0', 'x2=433.0']] 6.0
    [['x=132.0', 'x2=433.0', 'f=282.0']] 6.0

Selected pattern : ['x=132.0']

Iteration 2: List of patterns and their support:

    [['x=244.0'], ['f=[['x=132.0']]']] 5.0
    [['f=[['x=132.0']]'], ['x=244.0']] 4.0
    [['f=[['x=132.0']]'], ['f=282.0']] 4.0
    [['f=282.0'], ['f=[['x=132.0']]']] 4.0
    [['x=244.0', 'f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 3.0
    [['x=244.0', 'f=Testes'], ['f=[['x=132.0']]']] 3.0
    [['x=244.0', 'f=282.0'], ['f=[['x=132.0']]']] 3.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0']] 3.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['f=282.0']] 3.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes']] 3.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0']] 3.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=282.0']] 3.0
    [['f=[['x=132.0']]'], ['x=244.0'], ['f=[['x=132.0']]']] 3.0
    [['f=[['x=132.0']]']] 3.0
    [['f=[['x=132.0']]'], ['f=[['x=132.0']]']] 3.0
    [['f=[['x=132.0']]'], ['f=Testes']] 3.0
    [['f=[['x=132.0']]'], ['f=Testes', 'f=282.0']] 3.0
    [['f=[['x=132.0']]'], ['f=282.0'], ['f=[['x=132.0']]']] 3.0
    [['f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 3.0
    [['f=Testes'], ['f=[['x=132.0']]']] 3.0
    [['f=282.0']] 3.0
    [['x=244.0']] 2.0
    [['x=244.0', 'f=Testes']] 2.0
    [['x=244.0', 'f=Testes', 'f=282.0']] 2.0
    [['x=244.0', 'f=282.0']] 2.0
    [['x=244.0'], ['x=244.0']] 2.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=Testes']] 2.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0']] 2.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=282.0']] 2.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['f=Testes']] 2.0
    [['x=244.0'], ['f=[['x=132.0']]'], ['f=Testes', 'f=282.0']] 2.0
    [['x=244.0'], ['f=282.0']] 2.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 2.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes'], ['f=[['x=132.0']]']] 2.0
    [['f=[['x=132.0']]'], ['x=244.0', 'f=282.0'], ['f=[['x=132.0']]']] 2.0
    [['f=[['x=132.0']]'], ['f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 2.0
    [['f=[['x=132.0']]'], ['f=Testes'], ['f=[['x=132.0']]']] 2.0
    [['f=Testes']] 2.0
    [['f=Testes', 'f=282.0']] 2.0

Selected pattern: ['x=244.0'], ['x=132.0']

Final hierarchical structure:

     Node
       TEXT Die 7. februarij.
       Node
         TEXT sponsus carolus Rudolphi stockinger co:
         TEXT =loni in spelterback et ursula amb: p: m:
         TEXT coniugum filius legitimus.
         TEXT sponsa agatha caspari kiblbock defuneti
         TEXT coloni in wincklbrun et mari?? piriter? p:
         TEXT m: uxoris filia legitima.
         TEXT assistens R. D. coop: godefridus schimd.
       TEXT Testes
       Node
         TEXT D: thomas schranck civis et pistor loci.
         TEXT ac josephus pindrer colonus in spelten
         TEXT back
       TEXT Die 14 Huius
       Node
         TEXT sponsus simon, andrea goz domune??darij
         TEXT in perlesedt et mario amb: viventium
         TEXT coniugum filisu legitimus.
         TEXT sponsa anna maria, joannis wagner
         TEXT coloni in perlesedt et viventis et eva
         TEXT defuncta uxoris filia legitima.
         TEXT nb? prosati sponsi fuerunt in terio gradu co:
         TEXT sanguinitatis clementissime dispensati
         TEXT assistens R. D. coop pancratuis tellner
       TEXT Testes
       Node
         TEXT josephus wurm alimentatius in
         TEXT perlesedt et joannis georgius civi:
         TEXT =gaissinger, sivivus loci sartor
     TEXT Die 15 huius
     Node
       TEXT sponsus simon, joannis klinenger ope:
       TEXT =rarij in hinterschmiding et rosina
    real    0m1.529s
    user    0m0.000s
    sys     0m0.046s

Example 2: vertical regions mining

  • Objective: find the vertical regions (columns) of a page
  • Input: A document, and for each page the line regions.
  • Output: the vertical regions of the pages
  • Tested on StaZH/MM_1_001 (colID=;docId=; version = ...)
Clone this wiki locally