Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICCL-LDcalc output of frequency draw word pairs #42

Open
martinreynaert opened this issue May 27, 2020 · 9 comments
Open

TICCL-LDcalc output of frequency draw word pairs #42

martinreynaert opened this issue May 27, 2020 · 9 comments
Assignees

Comments

@martinreynaert
Copy link
Collaborator

In TICCL-LDcalc it may happen that the frequencies of words in a retrieved pair are the same.

In the case of such a draw, it is actually more likely (for diverse reasons) that the word form having the larger anagram value is the 'variant' and the one having the lower one the 'correction candidate'. Please output these accordingly.

Thank you!

@kosloot kosloot self-assigned this May 27, 2020
@kosloot
Copy link
Collaborator

kosloot commented May 27, 2020

I checked the code and it is not as easy as I thought.
The anagram value is at the moment only available for one of the 2 words. So that should be added for the other word too. Which makes me wonder if is a mistake in the current code as we do swap the variant and the CC when the frequency is smaller. Shouldn't we swap the hashes too then?
(The anagram hash value for the variant is stored in de LDcalc output file as field 7.
after swapping it is in fact the hash for the CC)

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented May 28, 2020

Hi Ko,

The value in field 7 is in fact the numerical difference between the Anagram Values of the pair. It is a value from the character confusion list produced on the basis of the alphabet by TICCL-lexstat and stands for a difference (usually) of just two characters, at most. TICCL-indexer(NT) attaches to these character confusion values the lower value of any word pair (in fact: set of word anagrams) identified.

So TICCL-LDcalc reads in these character confusion values (column 1 in TICCL-indexer output) and for each of them picks the attached values (which are the lower ones) to retrieve from the anahash the set of word anagrams, i.e. the word(s), associated to this value and pairs them to the other set of word(s) also retrieved from the anahash. This retrieval is done on the basis of the sum of the character confusion value with the associated (lower) word anagram value. So the result of this addition, i.e. sum, gives the value for the higher AV.

At least at the start of LDcalc, you therefore have both the lower and higher values at hand.

LDcalc next proceeds to look at the associated words frequencies etc.

Hope this helps!

Martin

kosloot pushed a commit that referenced this issue May 30, 2020
@kosloot
Copy link
Collaborator

kosloot commented May 30, 2020

I tried to implement this and installed the fix on maize and violet.
I see no differences in results though. So OR I made a mistake, OR my testset is inadequate.
@martinreynaert would you please check it? And when it doesn't work, provide me with an example of an entry that should be 'reversed'

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented Jun 1, 2020

Hi Ko,

Thank you!

I see no difference on maize between the LDcalc-output of the previous version and the current one either:

reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ diff GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.ldcalc GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.LDCALC.ldcalc
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$

And I re-ran the same on violet: no difference with the new maize output there either:

reynaert@violet:~$ diff /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.ldcalc /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.ldcalc
reynaert@violet:~$

So something did not work as expected.

To try and help solve this, I extracted the hapaxes from the corpus frequency list and ran TICCL-LDcalc only on that. So that, artificially, reproduces nothing but draws.

This here is the command line:

reynaert@violet:~$ nohup /exp/sloot/usr/local/bin/TICCL-LDcalc --threads 124 --LD 2 --low=5 --high=35 --index /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.INDEX.index --hash=/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash --clean /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.UNK.UnigramHapaxesOnly.clean --alph=/reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --artifrq 98765432 -o /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly > /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.20200531.VIOLET.UnigramHapaxesOnly.stdout 2> /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.20200531.VIOLET.UnigramHapaxesOnly.stderr &

We select 4 examples from the tail of the output:

reynaert@violet:~$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
vryfsteen~1~1~vryffsteen~1~1~28153056843~1~9~0~1~1~0~0

reynaert@violet:~$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
walvisbenen~1~1~walvisbeenen~1~1~11592740743~1~11~0~1~1~0~0

reynaert@violet:~$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
wereltkaart~1~1~wereltskaert~1~1~12953462985~2~10~0~1~1~0~0

reynaert@violet:~$ grep '^zeestucxke' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
zeestucxken~1~1~zeestucxkien~1~1~14693280768~1~11~0~1~1~0~0

Their AVs:

reynaert@violet:~$ grep 'vryfsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
162731772280~vryfsteen
reynaert@violet:~$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
190884829123~vryffsteen

reynaert@violet:~$ grep 'walvisbenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
187591370533~walvisbenen
reynaert@violet:~$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
199184111276~walvisbeenen

reynaert@violet:~$ grep 'wereltkaart' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
168947636901~wereltkaart
reynaert@violet:~$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
181901099886~wereltskaert

reynaert@violet:~$ grep 'zeestucxken' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
204409956510~Zeestucxken#zeestucxken
reynaert@violet:~$ grep 'zeestucxkien' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
219103237278~zeestucxkien

The word forms nearer to the modern canonical form (if there is or would be such) are consistently the lower AV forms.

I would very much like to see the output reversed!

Thanks!

Martin

@kosloot
Copy link
Collaborator

kosloot commented Jun 1, 2020

So I made a small change ON MAIZE ONLY!.
Now the results ARE reversed.
But I wonder if this working out well in general for all wordpairs
Maybe we need to look a bit deeper in the code. But at least you can test this.

Happy testing

@martinreynaert
Copy link
Collaborator Author

Many thanks, Ko!

Sure I will test this!!!

Starting it up right now ;0)

M.

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented Jun 1, 2020

Yes! They're all reversed now :0)

reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
walvisbeenen~1~1~walrusbeenen~98765432~98765432~6864473168~2~10~1~1~1~0~0
walvisbeenen~1~1~walvis-beenen~98765432~98765432~35723051649~1~12~1~1~1~0~0
walvisbeenen~1~1~walvis_beene~98765433~98765433~1125720992~2~10~1~1~0~0~0
walvisbeenen~1~1~walvis_beenen~98765433~98765433~11040808032~1~12~1~1~1~0~1
walvisbeenen~1~1~walvisbeen~100000004~100000004~23759269767~2~10~1~1~1~0~2
walvisbeenen~1~1~walvisbeene~2~2~12166529024~1~11~0~1~0~0~0
walvisbeenen~1~1~walvisbeerden~98765432~98765432~18219703433~2~11~1~1~1~0~0
walvisbeenen~1~1~walvisbenen~1~1~11592740743~1~11~0~1~1~0~4
walvisbeenen~1~1~walvisbiene~2~2~9065988999~2~10~0~1~0~0~0
walvisbeenen~1~1~walvischbeenen~98765432~98765432~47760777568~2~12~1~1~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
vryff_steen~1~1~vryffsteen~1~1~11040808032~1~10~0~1~1~0~0
vryffsteen~1~1~graffsteen~98765432~98765432~25245524877~2~8~1~0~1~0~0
vryffsteen~1~1~vrijfsteen~98765438~98765438~18190663819~2~8~1~1~1~0~0
vryffsteen~1~1~vryfsteen~1~1~28153056843~1~9~0~1~1~0~1
vryffsteen~1~1~wryffsteen~2~2~3378826023~1~9~0~0~1~0~1
vryffsteen~1~1~wryfsteen~98765432~98765432~24774230820~2~8~1~0~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
wereltskaert~1~1~Werelt_kaert~98765433~98765456~4345431517~1~11~1~1~1~0~0
wereltskaert~1~1~wereldt_kaert~98765434~98765434~13277985315~2~11~1~1~1~0~1
wereltskaert~1~1~werelt_caert~98765439~98765439~1283622659~2~10~1~1~1~0~1
wereltskaert~1~1~werelt_kaert~98765455~98765456~4345431517~1~11~1~1~1~0~2
wereltskaert~1~1~wereltcaert~2~2~9757185373~2~10~0~1~1~0~0
wereltskaert~1~1~wereltkaart~1~1~12953462985~2~10~0~1~1~0~0
wereltskaert~1~1~werelts-kaerte~98765432~98765432~47315792392~2~12~1~1~0~0~0
wereltskaert~1~1~werelts_caert~98765434~98765434~16669862208~2~11~1~1~1~0~1
wereltskaert~1~1~werelts_kaart~98765433~98765433~13473584596~2~11~1~1~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'zeestucxkien' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
zeestucxkien~1~1~Zeestucxken~2~3~14693280768~1~11~0~1~1~0~0
zeestucxkien~1~1~seestuckien~1~1~48169707983~2~10~0~0~1~0~0
zeestucxkien~1~1~zee_stuckien~2~2~21997561375~2~10~0~1~1~0~0
zeestucxkien~1~1~zee_stucxken~3~3~3652472736~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestucken~6~6~47731650175~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestuckgen~2~2~29307298382~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestuckie~4~4~45204898431~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckien~4~4~33038369407~1~11~0~1~1~0~2
zeestucxkien~1~1~zeestuckiens~2~2~17652129858~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckies~4~4~29818658882~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckijen~3~3~6011287775~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestucxken~1~3~14693280768~1~11~0~1~1~0~2
zeestucxkien~1~1~zeestucxkens~2~2~692958781~2~10~0~1~0~0~0

Btw, these are all words from Dutch 'Golden Age' notarial descriptions of house inventories about paintings. A 'zeestucxkien' would have been a small painting depicting a sea scene.

Will now run the full thing ;0)

@kosloot
Copy link
Collaborator

kosloot commented Jun 2, 2020

Nice this looks good.
My doubts are after running some test on a small dataset where I see reversions like:

verllooren~1~1~verftooren~1~1~7579697257~2~8~0~1~1~0~0

to

verftooren~1~1~verllooren~1~1~7579697257~2~8~0~1~1~0~0

and:

ANSCHE~1~1~ganfche~1~1~29957217696~2~5~0~0~1~0~0

To

ganfche~1~1~ANSCHE~1~1~29957217696~2~5~0~0~1~0~0

Which doesn't look like progress to me, and the complete removal of:

C^sars~1~1~Cssars~1~1~4183180267~1~5~0~1~1~0~0
ai.der~1~1~aonder~1~1~2953462985~2~4~0~1~1~0~0

As the left sides are 'out of the lexicion' and deleted after reversal.
Which may be is a good thing after all.

On a side note: shouldn't the --alph option be made mandatory for LDcalc?
It isn't at the moment and allows LDcalc to create correction to non-alphabet words

@kosloot
Copy link
Collaborator

kosloot commented Jun 3, 2020

reminder for @martinreynaert :
On a side note: shouldn't the --alph option be made mandatory for LDcalc?
It isn't at the moment and allows LDcalc to create correction to non-alphabet words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants