You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to store a large number of 4-grams in memory to speed up computation as part of a larger program I'm working on, and I think I found a reasonable way to do it using your DAWG library, but I ran into an issue because you currently don't allow byte string keys (I'm not sure if this is a limitation of your wrapper or the underlying library).
The idea I had was to have one DAWG that maps from words to unique numbers, and then store the 4-gram counts in another DAWG by converting the 4-grams to struct-packed byte strings of the format 'LLLL' where each packed item in the byte string is one of the unique numbers from the word DAWG. However, I can't currently do this because the keys are required to be unicode.
Is this something that would be feasible to add in the future?
The text was updated successfully, but these errors were encountered:
Underlying dawgdic C++ library doesn't allow zero bytes in keys (because it uses null-terminated strings) - that's why the wrapper disallows arbitrary bytestring keys. It stores utf8-encoded unicode and utf8 is guaranteed not to have zero bytes. Struct-packed 'LLLL' numbers will have zero bytes for sure.
But the wrapper supports binary values (in BytesDAWG and RecordDAWG). This is implemented using a hack: binary data is encoded/decoded to/from base64.
I don't know the best solution for N-Gram storage.
You may try https://github.com/kmike/marisa-trie - its base storage is not as efficient and fast as DAWG but it supports perfect hashing (each key is assigned an unique number "for free" - storing unique numbers in DAWG may be costly) and it also supports binary keys (including keys containing zero bytes).
Another option is to just store N-grams in DAWG unpacked (e.g. whitespace-separated); I'd not be surprised if this will be quite efficient.
I'm trying to store a large number of 4-grams in memory to speed up computation as part of a larger program I'm working on, and I think I found a reasonable way to do it using your DAWG library, but I ran into an issue because you currently don't allow byte string keys (I'm not sure if this is a limitation of your wrapper or the underlying library).
The idea I had was to have one DAWG that maps from words to unique numbers, and then store the 4-gram counts in another DAWG by converting the 4-grams to struct-packed byte strings of the format 'LLLL' where each packed item in the byte string is one of the unique numbers from the word DAWG. However, I can't currently do this because the keys are required to be unicode.
Is this something that would be feasible to add in the future?
The text was updated successfully, but these errors were encountered: