Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to save as a dictionary instead of a list #233

Open
garfieldnate opened this issue Dec 27, 2020 · 7 comments
Open

option to save as a dictionary instead of a list #233

garfieldnate opened this issue Dec 27, 2020 · 7 comments

Comments

@garfieldnate
Copy link

I have found that I always need to convert the data into a dictionary (instead of the default list) when I'm using it. Because of this, I decided to always store the file in dictionary format. My method for doing so is a bit hacky, and it would be great to have a --structure <dict|list> or even --dictionary parameter to do this within unihan_etl.

Here's my current code. It relies on the undocumented python formatting option:

from unihan_etl.process import Packager as unihan_packager
from unihan_etl.process import export_json

def unihan_download(unihan_file):
    # destination argument is required even though the packager will not write the file
    p = unihan_packager.from_cli(["-F", "json", "--destination", unihan_file])
    p.download()
    # instruct packager to return data instead of writing to file
    p.options["format"] = "python"
    unihan = p.export()

    # convert from list to dictionary
    unihan_dict = {entry["char"]: entry for entry in unihan}

    export_json(unihan_dict, unihan_file)
garfieldnate added a commit to garfieldnate/uniunihan-db that referenced this issue Dec 27, 2020
This is a much more useful structure, and also unifies the file
structure with the augmentation file. I've opened a ticket with
unihan_etl asking to add dictionary structuring as an option:
cihai/unihan-etl#233.
@tony
Copy link
Member

tony commented Jul 18, 2022

@garfieldnate I missed this message! Sorry about that!

Is there anything I can do at this time? Looks like you have stuff going on here https://github.com/garfieldnate/uniunihan-db

@garfieldnate
Copy link
Author

Thanks for noticing :D I obviously have a workaround already, but I do still think that a --dictionary option would make unihan-etl more useful. No worries if you can't get to it, as my workaround is fine for me. Thanks for the great library!

@tony
Copy link
Member

tony commented Jul 24, 2022

@garfieldnate We can add it, and also make it available via Python API

@garfieldnate
Copy link
Author

In the most recent unihan_etl the code I pasted above fails with this error. Not sure if my usage of the API is wrong or if there's an issue in the library.

.venv/lib/python3.11/site-packages/unihan_etl/process.py:531: in export
    data = expand_delimiters(data)
.venv/lib/python3.11/site-packages/unihan_etl/process.py:406: in expand_delimiters
    char[field] = expansion.expand_field(field, char[field])
.venv/lib/python3.11/site-packages/unihan_etl/expansion.py:416: in expand_field
    return expansion_func(fvalue)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = [{'radical': 5, 'simplified': False, 'strokes': 10}, "213''.0"]

    def _expand_kRSGeneric(value):
        pattern = re.compile(
            r"""
            (?P<radical>[1-9][0-9]{0,2})
            (?P<simplified>\'?)\.
            (?P<strokes>-?[0-9]{1,2})
        """,
            re.X,
        )
    
        for i, v in enumerate(value):
>           m = pattern.match(v).groupdict()
E           AttributeError: 'NoneType' object has no attribute 'groupdict'

.venv/lib/python3.11/site-packages/unihan_etl/expansion.py:332: AttributeError

@tony
Copy link
Member

tony commented Feb 26, 2024

@garfieldnate Thank you!

Does wiping cache and the DB file and rerunning change anything?

@garfieldnate
Copy link
Author

That was a really fast response :D

This is actually my bad; the latest unihan_etl already has a fix for this in place, and I mistakenly thought I had updated.

The issue is a typo in the kRSUnicode field for 亀: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E4%BA%80. It has two apostrophes, which does not follow the syntax specified in the standard. unihan_etl has already updated its parsing to allow the second apostrophe.

I did have to update my code for some unihan_etl changes, but nothing crazy.

@tony tony mentioned this issue Feb 26, 2024
@tony
Copy link
Member

tony commented Feb 26, 2024

@garfieldnate Thank you for the added information. I created an issue in case anyone bumps into this issue to let them know updating works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants