import takes ~30 seconds #280

Snawe · 2024-01-14T09:30:38Z

Hi!
I just upgraded my application to python 3.12.

Doing import emoji there takes arround 30 seconds there. Doing the same on python 3.11 takes less than a second.

Any clue?

is reproducable with a simple 2 line script like

import emoji
print (emoji.is_emoji("done"))

Using Windows atm with emoji 2.9.0

Something similar was already reported here: #274

The text was updated successfully, but these errors were encountered:

cvzi · 2024-01-16T09:30:27Z

I have no idea at the moment.

If you have the time, could you check two things:

Does it also happen with just the single import line?

import emoji

And maybe test some older versions of the module, to see if it is a new change that introduced this problem. For example these versions:

2.6.0
2.0.0
1.5.0

Do pip install emoji==2.6.0 to install a specific version.

If you are interested in trying it, I could create a test version for you that only has English language (or whatever languages you need) as suggested in the other issue. That would presumably reduce memory usage overall and reduce start-up time.

Snawe · 2024-01-18T19:24:32Z

Does it also happen with just the single import line?

yes

regarding emoji versions:

2.9.0 => 34 sec
2.6.0 => 24 sec
2.0.0 => 13 sec
1.5.0 => 10 sec

But I just found out something really strange... Like I already mentioned int the beginning, this only happens on python3.12. With python3.11 this does not happen.
What I found out now, it only happens if I run it via vs code. So if I hit F5 in vs code, I get the above times.
If I run it directly in the commandline, I still see that the time doubled (more emojis I think), but the times are:

2.9.0 => ~0.06 sec
2.0.0 => ~0.03 sec

I still think that it has something to do with the emoji library, since every other doesn't take ~500times longer, but yeah....

Just to complete the list, with python 3.11 via vs code:

2.9.0 => 0.03 sec
2.6.0 => 0.03 sec
2.0.0 => 0.03 sec
1.5.0 => 0.03 sec

From the commandline:
2.9.0 => 0.03 sec
2.6.0 => 0.03 sec
2.0.0 => 0.03 sec
1.5.0 => 0.03 sec

cvzi · 2024-01-18T19:46:05Z

Wow, thanks for the details!
My first guess would be that VS Code attaches a debugger or something similar and that somehow changed between Python 3.11 and 3.12

I will look into it.

cvzi · 2024-01-19T11:35:58Z

I can reproduce it with Python 3.12 on Windows 10 when running with F5 in VS Code. If you run it without the debugger CTRL+F5, it doesn't happen.
(Also I see no problems with Python 3.11)

VS Code runs a file like this, when you press F5:
cmd /C "C:\Python312\python.exe c:\Users\cuzi\.vscode\extensions\ms-python.python-2023.22.1\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher 64938 -- C:\Users\cuzi\Desktop\mytest.py "

I have create a issue at debugpy, maybe they know why this happens: microsoft/debugpy#1496

cvzi · 2024-01-19T11:49:25Z

The problem seems to be the large number of lines in the file data_dict.py (currently ~87k lines).

The dictionary in the file can be compressed into a single line:

EMOJI_DATA = {
        '\U0001F947': {
        'en': ':1st_place_medal:',
        'status':
        ...

-->

EMOJI_DATA={'\U0001F947':{'en':':1st_place_medal:','status: ....

resulting in a file with just 46 lines. With the compressed file, the debugging runs as fast as in Python 3.11

TahirJalilov · 2024-01-19T13:18:26Z

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

Snawe · 2024-01-19T15:37:33Z

oh wow! Thank you very much for looking into it! :)

cvzi · 2024-01-19T20:19:20Z

@cvzi may be it's time to think about separation of languages into different files, like it was before, what do you think?

I agree.

Not sure it will help enough regarding this problem though, because the dictionary would still be huge. It takes 4 minutes on my computer at the moment. Even if it cuts the time to 10% it would still take 25 seconds, far too long.

Putting the dictionary into a single line is obviously really ugly. But it would be a quick fix.

I guess using a different file format, not Python code, could solve this problem with the debugging. For example storing the dictionary in a JSON file and the loading the JSON file when the module is imported.

lsmith77 · 2024-01-23T14:17:28Z

I guess right now there is no work-around when using Python 3.12?

... to prevent debugging overhead in Python 3.12 carpedm20#280

lsmith77 · 2024-01-25T08:56:07Z

@cvzi based on 6fb1321 you are releasing a work-around?

cvzi · 2024-01-25T09:53:56Z

I guess so. I am not really happy with putting the dict into a single line, but there seems to be no other quick work around. And VS Code is one of the most used editors at the moment and already there seem to be about 2000/day downloads from Python 3.12 of this library (according to PyPi stats).

I have deployed that commit on my own apps, and it seems to work, i.e. in release environment, not debugging.

@lsmith77 any change you could test if it actually solves the problem with VS Code for you? It does solve it on my computer. You can install from my branch cvzi/one_line_dict like this: pip install https://github.com/cvzi/emoji/archive/one_line_dict.zip and then just create a python file with import emoji and run it in VS Code with debugging i.e. F5

cvzi · 2024-01-25T09:59:23Z

BTW for reference:
I also tried to put each "sub-dict" (each emoji) on a single line instead of everything in one line:

EMOJI_DATA = {
    '\U0001F947': {'en': ':1st_place_medal:','status': fully_qualified,'E': 3,'de': ':goldmedaille:','es': ':medalla_de_oro:','fr': ':médaille_d’or:','ja': ':金メダル:','ko': ':금메달:','pt': ':medalha_de_ouro:','it': ':medaglia_d’oro:','fa': ':مدال_طلا:','id': ':medali_emas:','zh': ':金牌:','ru': ':золотая_медаль:','tr': ':birincilik_madalyası:','ar': ':ميدالية_مركز_أول:'},
    '\U0001F948': {'en': ':2nd_place_medal:','status': fully_qualified,'E': 3,'de': ':silbermedaille:','es': ':medalla_de_plata:','fr': ':médaille_d’argent:','ja': ':銀メダル:','ko': ':은메달:','pt': ':medalha_de_prata:','it': ':medaglia_d’argento:','fa': ':مدال_نقره:','id': ':medali_perak:','zh': ':银牌:','ru': ':серебряная_медаль:','tr': ':ikincilik_madalyası:','ar': ':ميدالية_مركز_ثان:'},
    '\U0001F949': {'en': ':3rd_place_medal:','status': fully_qualified,'E': 3,'de': ':bronzemedaille:','es': ':medalla_de_bronce:','fr': ':médaille_de_bronze:','ja': ':銅メダル:','ko': ':동메달:','pt': ':medalha_de_bronze:','it': ':medaglia_di_bronzo:','fa': ':مدال_برنز:','id': ':medali_perunggu:','zh': ':铜牌:','ru': ':бронзовая_медаль:','tr': ':üçüncülük_madalyası:','ar': ':ميدالية_مركز_ثالث:'},
    ...

That reduces the import time (as expected) but it still takes too long, about 15 seconds on my computer.

lsmith77 · 2024-01-25T18:29:12Z

sorry didn't get to it today will try to do it tomorrow morning

lsmith77 · 2024-01-30T08:39:23Z

I tried my best but I am stuck in virtualenv hell here. pip needs to be updated to even install the package and I am somehow unable to figure out how to get pip to both upgrade and then actually use 3.12

anyway using pdm I got it to work nice and fast (first excution) and then slow using the official package (second execution)

so overall I can confirm your workaround does what it is supposed to.

cvzi · 2024-01-30T19:22:57Z

@lsmith77 Thanks for checking!

lsmith77 · 2024-01-30T19:23:40Z

thank you for this package and caring about reports such as this one!

cvzi · 2024-02-17T22:06:48Z

I did some performance tests to check the feasibility of JSON compared to the Python-dictionary-literal.
Below are the import times for different methods of loading the dictionary.
My conclusion is that JSON could be used and it would be viable to split the languages into separate JSON files

	import time
Python dict, pretty-printed, human-readable (before this bugfix)	0.16004
One-line Python dict (current master branch)	0.15565
JSON file, pretty-printed, human-readable	0.22966
Compressed JSON file, one-line, no spaces	0.19430
Splitted JSON files, pretty-printed, load English and metadata from one file, all other languages removed	0.15470
Splitted JSON files, first load English and metadata (as above) then load ONE other language from another JSON file	0.19083

Command to test this:

perf stat -r 10 -B python -c "import emoji; emoji.emojize(':lion:')"

where 10 is the repeats (should be much higher for good average results)

carpedm20#280

cvzi · 2024-04-06T10:52:19Z

I am going to continue in this thread with this JSON idea, please unsubscribe if you're not interested. Any feedback or suggestions are appreciated though :)

I am thinking about making a main JSON file that has the metadata and English/aliases and a file for each language.

Main file:

{
  "🗺️": {
    "E": 0.7,
    "en": ":world_map:",
    "status": 2,
    "variant": true
  },
  "🗻": {
    "E": 0.6,
    "en": ":mount_fuji:",
    "status": 2
  },
  "🗼": {
    "E": 0.6,
    "alias": [
      ":tokyo_tower:"
    ],
    "en": ":Tokyo_tower:",
    "status": 2
  },
  ...
}

A language file would look like this, e.g. Spanish:

{
  "🗺️": ":mapa_mundial:",
  "🗻": ":monte_fuji:",
  "🗼": ":torre_de_tokio:",
  ...
}

The main file would be loaded when importing the module. The language file would only be loaded when the language is used with d/emojize(str, language='es'). It would be loaded into EMOJI_DATA and the EMOJI_DATA dict would have the same structure as before.

It does mean that the EMOJI_DATA dict is incomplete after importing the module, because all the languages are missing.

This roughly reduces memory usage by about half, if only one language is used.
Import time with only English is slightly faster (about 10%).
Import time with one other-than-English language is slightly slower (about 10%).

Advantages:

more languages can be added without increasing memory usage
also more (new) metadata could be added to separate JSON files as well
JSON data can be read by other programming languages
This debugging-bug would be solved

Disadvantages:

You can't directly use e.g. EMOJI_DATA['🗺️']['fr'] anymore. To mitigate this, there could be a new function to load one or all languages from the JSON files.
JSON is less human-readable than a python dict, mainly comments aren't possible

So this would be a breaking change, but I don't think this would affect many people, I searched in github and I couldn't find a public repository that directly uses something like EMOJI_DATA['🗺️']['fr']
(There are a few repositories that use English i.e. EMOJI_DATA['🗺️']['en'] but that would still work)

lovetox · 2024-04-14T19:41:35Z

I think this makes sense, and i see no other option, there are simply a lot of languages and most applications need exactly one. Loading them on demand seems the right decision.

EDIT: Question about your performance test methodology, does your command not also include the starting of the whole python interpreter? This would only be relevant for someone who uses this lib standalone.

For most projects this will just one of many dependencies.

i did a quick test with 2.11.0

def load():
    import emoji
    emoji.emojize(':lion:')

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=1)
    print(res)

and this gives me a load time of just the lib around 0.030 s on my machine.

As the import statement is only once executed, even on repeats, raising the number of repetitions does not yield interesting data.

cvzi · 2024-04-14T20:55:09Z

Yes my times include the loading of the Python interpreter. It doesn't really matter, because I am only interested in the relative changes.
Measuring a single import is not really robust, for example because some other process could be using the CPU at the same time as the test.

It is possible to load the module multiple times in Python, but it is a bit hacky:

import sys

def load():
    import emoji
    emoji.emojize(':lion:')

    # remove the emoji modules from the loaded modules
    for name in [name for name in sys.modules if "emoji" in name]:
        del sys.modules[name]

if __name__ == '__main__':
    import timeit
    res = timeit.timeit("load()",
                        setup="from __main__ import load",
                        number=100)
    print(res)

cvzi · 2024-05-17T14:18:05Z

FYI compressing the dict into single line has caused coverage to break on Python 3.12.3
nedbat/coveragepy#1785
Edit:
coverage should be fixed with Python 3.12.4

lsmith77 mentioned this issue Jan 23, 2024

spacy.load() on python 3.12 with vscode explosion/spaCy#13263

Closed

cvzi added a commit to cvzi/emoji that referenced this issue Jan 24, 2024

Compress dict into one line

6fb1321

... to prevent debugging overhead in Python 3.12 carpedm20#280

cvzi mentioned this issue Jan 30, 2024

One line dict #281

Merged

cvzi added a commit to cvzi/emoji that referenced this issue Mar 23, 2024

Compress dictionary

778dc33

carpedm20#280

cvzi mentioned this issue Apr 1, 2024

Feature: Use muan/emojilib dataset #286

Open

cvzi mentioned this issue May 6, 2024

Code enhancements suggestions #292

Open

guyskk mentioned this issue Jul 14, 2024

VSCode 用 debugpy 调试，加载 pinyin_dict 和 phrases_dict 这两个步骤时，速度严重下降 mozillazg/python-pinyin#319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import takes ~30 seconds #280

import takes ~30 seconds #280

Snawe commented Jan 14, 2024 •

edited

Loading

cvzi commented Jan 16, 2024

Snawe commented Jan 18, 2024 •

edited

Loading

cvzi commented Jan 18, 2024

cvzi commented Jan 19, 2024 •

edited

Loading

cvzi commented Jan 19, 2024

TahirJalilov commented Jan 19, 2024

Snawe commented Jan 19, 2024

cvzi commented Jan 19, 2024

lsmith77 commented Jan 23, 2024

lsmith77 commented Jan 25, 2024

cvzi commented Jan 25, 2024

cvzi commented Jan 25, 2024

lsmith77 commented Jan 25, 2024

lsmith77 commented Jan 30, 2024

cvzi commented Jan 30, 2024

lsmith77 commented Jan 30, 2024

cvzi commented Feb 17, 2024

cvzi commented Apr 6, 2024

lovetox commented Apr 14, 2024 •

edited

Loading

cvzi commented Apr 14, 2024

cvzi commented May 17, 2024 •

edited

Loading

import takes ~30 seconds #280

import takes ~30 seconds #280

Comments

Snawe commented Jan 14, 2024 • edited Loading

cvzi commented Jan 16, 2024

Snawe commented Jan 18, 2024 • edited Loading

cvzi commented Jan 18, 2024

cvzi commented Jan 19, 2024 • edited Loading

cvzi commented Jan 19, 2024

TahirJalilov commented Jan 19, 2024

Snawe commented Jan 19, 2024

cvzi commented Jan 19, 2024

lsmith77 commented Jan 23, 2024

lsmith77 commented Jan 25, 2024

cvzi commented Jan 25, 2024

cvzi commented Jan 25, 2024

lsmith77 commented Jan 25, 2024

lsmith77 commented Jan 30, 2024

cvzi commented Jan 30, 2024

lsmith77 commented Jan 30, 2024

cvzi commented Feb 17, 2024

cvzi commented Apr 6, 2024

lovetox commented Apr 14, 2024 • edited Loading

cvzi commented Apr 14, 2024

cvzi commented May 17, 2024 • edited Loading

Snawe commented Jan 14, 2024 •

edited

Loading

Snawe commented Jan 18, 2024 •

edited

Loading

cvzi commented Jan 19, 2024 •

edited

Loading

lovetox commented Apr 14, 2024 •

edited

Loading

cvzi commented May 17, 2024 •

edited

Loading