tokenization issues: ' followed by s, m, t, etc #1

AngledLuffa · 2022-12-11T06:52:38Z

it's gets tokenized into three tokens, it, ', s

that should be fixed

same with 'm 't etc

The text was updated successfully, but these errors were encountered:

SecroLoL · 2024-07-29T08:02:03Z

Are you saying that it should be tokenized as it, 's?

AngledLuffa · 2024-07-29T15:07:23Z

yes, those should be it, 's and i, 'm, etc

AngledLuffa · 2024-07-29T15:08:05Z

lmk if you need or want some assistance scripting changes like that

SecroLoL · 2024-07-30T06:15:34Z

I think I've got this, thanks! Will let you know if I need help though

SecroLoL · 2024-08-09T06:03:55Z

How about cases where a noun is followed by 's? Are these annotated properly?
Example:

The	O
opposition	O
's	O
poor	O
election	O
results	O

SecroLoL · 2024-08-09T06:07:46Z

Here's what I'm seeing when inspecting some processed data:

national	O
truth	O
-	O
telling	O
process	O
would	O
have	O
on	O
Australia	B-Location
,	O
it	O
's	O
remarkable	O
.	O

"	O
One	O
of	O
the	O
things	O
that	O
we	O
're	O
thinking	O
about	O

I	O
'm	O
a	O
non	O
-	O
conformist	O
politician	O
.	O
I	O
'm	O
a	O
revolutionary	O
,	O
'	O
'	O
Bouteflika	B-Person
told	O
The	B-Organization
Associated	I-Organization
Press	I-Organization

Can't find the cases you're talking about. Was that perhaps only for the raw annotated data?

AngledLuffa · 2024-08-09T15:00:02Z

the possessive 's and the contraction 'm are correct

when i was going through the data myself, i'd occasionally fix them when i came across such errors

cd processed_annotated
grep "^s  O$" * | less        # that's a tab character between s and O

af_afrol_16.txt.tsv:s   O
af_afrol_18.txt.tsv:s   O
af_allaf_15.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_allaf_24.txt.tsv:s   O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
af_ips_10.txt.tsv:s     O
etc etc

AngledLuffa · 2024-08-09T15:27:13Z

i'm fairly certain most of those can be cleaned up via a script...

just look for s on a line by itself, especially after a ' or a curly apostrophe, check that the labels are the same, combine the rows

again, i can take that on ... maybe i should just go ahead and do that

SecroLoL · 2024-08-15T05:10:21Z

If you could, that would be great. If you have time, of course.

AngledLuffa · 2024-08-16T19:41:06Z

I'm about half done with checking incorrect ', but am uncovering a whole bunch of other random tokenization errors in the process.

’ (the fancy apostrophe) on a line by itself, followed by s, d, t, etc

Ms .

Jr .

' ' and backticks or curly apostrophes

. . . instead of as a single token

46-41 or other scores / votes

U . S .

and in one file, cuba_diariodecuba_5.txt.tsv, César got cut off many times. I suspect there will be other words like that which need to be cleaned up

AngledLuffa · 2024-08-17T03:42:12Z

alright, i have taken on the ’ tokenizations and the ' tokenizations

the others are still TODO

AngledLuffa · 2024-08-17T08:33:07Z

US, titles, and ellipses are now cleared up. Would still like to look for decade+s

AngledLuffa · 2024-08-17T08:40:06Z

did the decades as well

maybe still need to look for ' ' on two separate lines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization issues: ' followed by s, m, t, etc #1

tokenization issues: ' followed by s, m, t, etc #1

AngledLuffa commented Dec 11, 2022

SecroLoL commented Jul 29, 2024

AngledLuffa commented Jul 29, 2024

AngledLuffa commented Jul 29, 2024

SecroLoL commented Jul 30, 2024

SecroLoL commented Aug 9, 2024

SecroLoL commented Aug 9, 2024 •

edited

Loading

AngledLuffa commented Aug 9, 2024

AngledLuffa commented Aug 9, 2024

SecroLoL commented Aug 15, 2024

AngledLuffa commented Aug 16, 2024

AngledLuffa commented Aug 17, 2024

AngledLuffa commented Aug 17, 2024

AngledLuffa commented Aug 17, 2024

tokenization issues: ' followed by s, m, t, etc #1

tokenization issues: ' followed by s, m, t, etc #1

Comments

AngledLuffa commented Dec 11, 2022

SecroLoL commented Jul 29, 2024

AngledLuffa commented Jul 29, 2024

AngledLuffa commented Jul 29, 2024

SecroLoL commented Jul 30, 2024

SecroLoL commented Aug 9, 2024

SecroLoL commented Aug 9, 2024 • edited Loading

AngledLuffa commented Aug 9, 2024

AngledLuffa commented Aug 9, 2024

SecroLoL commented Aug 15, 2024

AngledLuffa commented Aug 16, 2024

AngledLuffa commented Aug 17, 2024

AngledLuffa commented Aug 17, 2024

AngledLuffa commented Aug 17, 2024

SecroLoL commented Aug 9, 2024 •

edited

Loading