-
Notifications
You must be signed in to change notification settings - Fork 7
TranskribusDU_SPM
Hervé Déjean edited this page Jul 12, 2017
·
5 revisions
Use of sequential Pattern Mining techniques for Document Document Understanding
- use of PrefixSpan algorithm to mine document objects
- see this blog entry for some explanation
- code: github
This toy example illustrates how SPM can be used in order to hierarchically structure a set of contiguous lines. See also entry blog
Transkribus collection id: 4453: document id 12033 (send me your Transkribus username and I will give you access to the collection)
Download collection:
python TranskribusPyClient/src/TranskribusCommands/Transkribus_downloader.py 4453
Convert into internal data structure:
python TranskribusDU/src/xml_formats/Page2DS.py -i trnskrbs_4453/col/12033.mpxml -o trnskrbs_4453/xml/12033.ds_xml
SPM Line Miner:
python ../spmLine.py -i trnskrbs_4453/xml/12033.ds_xml -o trnskrbs_4453/out/12033.ds_xml
Output of the method:
sequence of elements and their features:
TEXT ['x=244.0'] TEXT ['x=132.0', 'x2=450.0', 'f=293.0'] TEXT ['x=132.0', 'x2=450.0', 'f=293.0'] TEXT ['x=132.0', 'x2=329.0', 'f=coniugum filius legitimus.', 'f=231.0'] TEXT ['x=132.0', 'x2=450.0', 'f=293.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=329.0', 'f=231.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=244.0', 'x2=329.0', 'f=Testes', 'f=282.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0'] TEXT ['x=244.0', 'x2=375.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=350.0', 'f=coniugum filius legitimus.', 'f=231.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0'] TEXT ['x=132.0', 'x2=375.0', 'f=252.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0'] TEXT ['x=244.0', 'f=Testes', 'f=282.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=375.0', 'f=252.0'] TEXT ['x2=350.0', 'f=282.0'] TEXT ['x=132.0', 'x2=450.0', 'f=282.0'] TEXT ['x=132.0', 'x2=433.0', 'f=282.0']
List of patterns and their support:
[['x=132.0']] 24.0 [['f=282.0']] 18.0 [['x=132.0', 'f=282.0']] 15.0 [['x=132.0', 'x2=450.0']] 12.0 [['x2=450.0']] 12.0 [['x=132.0', 'x2=450.0', 'f=282.0']] 9.0 [['x2=450.0', 'f=282.0']] 9.0 [['x2=433.0']] 6.0 [['x2=433.0', 'f=282.0']] 6.0 [['x=132.0', 'x2=433.0']] 6.0 [['x=132.0', 'x2=433.0', 'f=282.0']] 6.0
Selected pattern : ['x=132.0']
Iteration 2: List of patterns and their support:
[['x=244.0'], ['f=[['x=132.0']]']] 5.0 [['f=[['x=132.0']]'], ['x=244.0']] 4.0 [['f=[['x=132.0']]'], ['f=282.0']] 4.0 [['f=282.0'], ['f=[['x=132.0']]']] 4.0 [['x=244.0', 'f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 3.0 [['x=244.0', 'f=Testes'], ['f=[['x=132.0']]']] 3.0 [['x=244.0', 'f=282.0'], ['f=[['x=132.0']]']] 3.0 [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0']] 3.0 [['x=244.0'], ['f=[['x=132.0']]'], ['f=282.0']] 3.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes']] 3.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0']] 3.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=282.0']] 3.0 [['f=[['x=132.0']]'], ['x=244.0'], ['f=[['x=132.0']]']] 3.0 [['f=[['x=132.0']]']] 3.0 [['f=[['x=132.0']]'], ['f=[['x=132.0']]']] 3.0 [['f=[['x=132.0']]'], ['f=Testes']] 3.0 [['f=[['x=132.0']]'], ['f=Testes', 'f=282.0']] 3.0 [['f=[['x=132.0']]'], ['f=282.0'], ['f=[['x=132.0']]']] 3.0 [['f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 3.0 [['f=Testes'], ['f=[['x=132.0']]']] 3.0 [['f=282.0']] 3.0 [['x=244.0']] 2.0 [['x=244.0', 'f=Testes']] 2.0 [['x=244.0', 'f=Testes', 'f=282.0']] 2.0 [['x=244.0', 'f=282.0']] 2.0 [['x=244.0'], ['x=244.0']] 2.0 [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=Testes']] 2.0 [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0']] 2.0 [['x=244.0'], ['f=[['x=132.0']]'], ['x=244.0', 'f=282.0']] 2.0 [['x=244.0'], ['f=[['x=132.0']]'], ['f=Testes']] 2.0 [['x=244.0'], ['f=[['x=132.0']]'], ['f=Testes', 'f=282.0']] 2.0 [['x=244.0'], ['f=282.0']] 2.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 2.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=Testes'], ['f=[['x=132.0']]']] 2.0 [['f=[['x=132.0']]'], ['x=244.0', 'f=282.0'], ['f=[['x=132.0']]']] 2.0 [['f=[['x=132.0']]'], ['f=Testes', 'f=282.0'], ['f=[['x=132.0']]']] 2.0 [['f=[['x=132.0']]'], ['f=Testes'], ['f=[['x=132.0']]']] 2.0 [['f=Testes']] 2.0 [['f=Testes', 'f=282.0']] 2.0
Selected pattern: ['x=244.0'], ['x=132.0']
Final hierarchical structure:
Node TEXT Die 7. februarij. Node TEXT sponsus carolus Rudolphi stockinger co: TEXT =loni in spelterback et ursula amb: p: m: TEXT coniugum filius legitimus. TEXT sponsa agatha caspari kiblbock defuneti TEXT coloni in wincklbrun et mari?? piriter? p: TEXT m: uxoris filia legitima. TEXT assistens R. D. coop: godefridus schimd. TEXT Testes Node TEXT D: thomas schranck civis et pistor loci. TEXT ac josephus pindrer colonus in spelten TEXT back TEXT Die 14 Huius Node TEXT sponsus simon, andrea goz domune??darij TEXT in perlesedt et mario amb: viventium TEXT coniugum filisu legitimus. TEXT sponsa anna maria, joannis wagner TEXT coloni in perlesedt et viventis et eva TEXT defuncta uxoris filia legitima. TEXT nb? prosati sponsi fuerunt in terio gradu co: TEXT sanguinitatis clementissime dispensati TEXT assistens R. D. coop pancratuis tellner TEXT Testes Node TEXT josephus wurm alimentatius in TEXT perlesedt et joannis georgius civi: TEXT =gaissinger, sivivus loci sartor TEXT Die 15 huius Node TEXT sponsus simon, joannis klinenger ope: TEXT =rarij in hinterschmiding et rosina
real 0m1.529s user 0m0.000s sys 0m0.046s
- Objective: find the vertical regions (columns) of a page
- Input: A document, and for each page the line regions.
- Output: the vertical regions of the pages
- Tested on StaZH/MM_1_001 (colID=;docId=; version = ...)