Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

IndexError: string index out of range during segmentation #8

Open
cboulanger opened this issue Feb 18, 2022 · 1 comment
Open

IndexError: string index out of range during segmentation #8

cboulanger opened this issue Feb 18, 2022 · 1 comment
Assignees

Comments

@cboulanger
Copy link
Owner

 File "/app/run-main.py", line 174, in <module>
    call_segmentation_training(sys.argv[2])
  File "/app/run-main.py", line 125, in call_segmentation_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Seg.py", line 55, in train_segmentation
    train_feat[len(train_feat) - 1].extend([word2feat(a, stopw, 2, len(ln), b1, b2, b3, b4, b5, b6)])
  File "/app/EXparser/src/gle_fun_seg.py", line 378, in word2feat
    feat.update(get_last(w))
  File "/app/EXparser/src/gle_fun_seg.py", line 281, in get_last
    c = w[-1] * 2
IndexError: string index out of range
cboulanger added a commit that referenced this issue Feb 19, 2022
@cboulanger
Copy link
Owner Author

Added a try/except to work around this issue. It shows that the bug is caused by malformed annotations (see below). The fix simply ignores the malformed lines, which might be the only appropriate solution.

Segmentation training [###.............................] 35/320: 0:00:24 remaining...
16563.xml: problem parsing <author><surname>Weber <author><given-names>Max </surname></author></given-names></author>(</author><year>1988</year><author>c/ Orig. </author><year>1920</ye
ar><author>) <title>Gesammelte Aufsätze zur Religionssoziologie I</author>. <other>Tübingen</title>.</other>
Segmentation training [#######.........................] 71/320: 0:00:23 remaining...
20786.xml: problem parsing <author><surname>Schnell</surname>,<given-names> R.</given-names></author>, <year>1997</year>: <title>Nonresponse in Bevölkerungsumfragen. Ausmaß, Entwicklun
g und Ursachen</title>. <other>Opladen<other>: <publisher>Leske + Budrich.</publisher></other></other>
Segmentation training [#######.........................] 77/320: 0:00:14 remaining...
21690.xml: problem parsing <source>Working Brief</source> <volume>15</volume>: <author><given-names>Diego</given-names> <surname>Compagna / <author><given-names>Stefan</surname> <surna
me>Derpmann</surname></author></given-names></author> / <author><given-names>Kathrin</given-names> <surname>Mauz</surname></author> / <author><given-names>Karen</given-names> <surname>
Shire</surname></author> (<year>2009</year>): <title>Förderung des Wissenstransfers für eine aktive Mitgestaltung des Pflegesektors durch Mikrosystemtechnik (WiMi-Care)</title>, <sourc
e>Working Brief</source> <volume>15</volume>: <title>Die Einstellung von Pflegekräften gegenüber technischen Neuerungen</title>. In: <url>http://www.wimi-care.de/outputs.html#Briefs</u
rl> (letzter Abruf: <other>02.12.2009</other>).
Segmentation training [##################..............] 188/320: 0:00:13 remaining...
36684.xml: problem parsing <title>Stellungnahmen geladener Sachverständiger vor dem Bundestag zum Thema Fiskalpakt und ESM</title>, <other>7.5.</other><year>2012</year>: <url><www. bun
destag.de/bundestag/ausschuesse17/a08/anhoerungen/fiskalpakt_und_esm/stellungnahmen/index.html/></url>.
Segmentation training [######################..........] 225/320: 0:00:10 remaining...
40723.xml: problem parsing <author><surname>Koskinas</surname></author>, <author><given-names>Ioannis </given-names></author>(<year>2014</year>),<title> The Only Choice Left for Afghan
istan</title>, online: <url>htp://southasia.foreign-policy.com/posts/2014/09/11/the_only_choice_ left_for_afghanistan></url> (<other>27 October 2014</other>).
Segmentation training [##########################......] 260/320: 0:00:05 remaining...
45841.xml: problem parsing <editor>Folha Online</editor> (<year>2012</year>), <url><www1.folha.uol.com.br/fsp/brasil/></url> (<other>12. November 2012</other>).
45841.xml: problem parsing <author><surname>Patarra</surname>, <given-names>Ivo</given-names></author> (<year>2010</year>), <title>O chefe</title>, online: <url><www.escandalodomensala
o.com.br></url> (<other>2. November 2012</other>).
45841.xml: problem parsing <editor>Veja</editor> (<year>2012</year>), <title>O Julgamento do Mensalão. A hora da Sentença</title>, online: <url><htp://veja.abril.com.br/o-jul - gamento
-do-mensalao/hora-da-sentenca/></url> (<other>13. November 2012</other>).

cboulanger added a commit that referenced this issue Feb 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants