Change write_table to atomically write to the parsetab file. #184

jfly · 2018-09-21T19:00:52Z

The previous implementation had issues with multiple python processes
simultaneously invoking .yacc(...) with the same arugments. One of them
could be midway through generating a .py file that the other process
would attempt to import and then crash with a SyntaxError. This diff
avoids that problem by instead writing to a tempfile and then atomically
renaming it. This means processes starting up at the same time will both
do the same work, but that's a lot better than having one of them crash!

flanker had this issue reported to them in
mailgun/flanker#168 and they attempted to work
around this by committing the parsetab files that ply generates
(mailgun/flanker#188), but this doesn't help if
the user is running a different version of ply than the version that
flanker generated their parsetab files with (because ply will go ahead
and regenerate those files).

To test this, I was able to fairly reliably reproduce this issue with the following command:

$ docker run $(docker build -q https://raw.githubusercontent.com/jfly/jfly.github.io/master/misc/ply-race/Dockerfile-race-demo)
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Process Process-4:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "importing.py", line 5, in someFunc
    from flanker.addresslib import address
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py", line 49, in <module>
    from flanker.addresslib._parser.parser import (Mailbox, Url, mailbox_parser,
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/_parser/parser.py", line 161, in <module>
    tabmodule='mailbox_parsetab')
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 3293, in yacc
    read_signature = lr.read_table(tabmodule)
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 1987, in read_table
    if parsetab._tabversion != __tabversion__:
AttributeError: module 'flanker.addresslib._parser.mailbox_parsetab' has no attribute '_tabversion'
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Process Process-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "importing.py", line 5, in someFunc
    from flanker.addresslib import address
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py", line 49, in <module>
    from flanker.addresslib._parser.parser import (Mailbox, Url, mailbox_parser,
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/_parser/parser.py", line 166, in <module>
    tabmodule='addr_spec_parsetab')
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 3293, in yacc
    read_signature = lr.read_table(tabmodule)
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 1984, in read_table
    exec('import %s' % module)
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/_parser/addr_spec_parsetab.py", line 68
    ('comment_text -> comment_text CTEXT','comment_text',2,'p_expression_comment_text','parser.py',112),
                                                                                                       ^
SyntaxError: unexpected EOF while parsing
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'phrase' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Process Process-6:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "importing.py", line 5, in someFunc
    from flanker.addresslib import address
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py", line 49, in <module>
    from flanker.addresslib._parser.parser import (Mailbox, Url, mailbox_parser,
  File "/usr/local/lib/python3.7/site-packages/flanker/addresslib/_parser/parser.py", line 171, in <module>
    tabmodule='url_parsetab')
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 3293, in yacc
    read_signature = lr.read_table(tabmodule)
  File "/usr/local/lib/python3.7/site-packages/ply/yacc.py", line 1987, in read_table
    if parsetab._tabversion != __tabversion__:
AttributeError: module 'flanker.addresslib._parser.url_parsetab' has no attribute '_tabversion'
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>

I then ran the following command to verify that the fix in this PR works. This command succeeded on 1000 consecutive runs.

$ docker run $(docker build -q https://raw.githubusercontent.com/jfly/jfly.github.io/master/misc/ply-race/Dockerfile-ply-fix)
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'url' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'url' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url' is unreachable
Symbol 'mailbox' is unreachable
Symbol 'addr_spec' is unreachable
Symbol 'angle_addr' is unreachable
Symbol 'name_addr' is unreachable
Symbol 'phrase' is unreachable
Symbol 'local_part' is unreachable
Symbol 'domain' is unreachable
Symbol 'quoted_string' is unreachable
Symbol 'domain_literal' is unreachable
Symbol 'quoted_string_text' is unreachable
Symbol 'domain_literal_text' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
Symbol 'mailbox_or_url_list' is unreachable
Symbol 'delim' is unreachable
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>
<module 'flanker.addresslib.address' from '/usr/local/lib/python3.7/site-packages/flanker/addresslib/address.py'>

The previous implementation had issues with multiple python processes simultaneously invoking `.yacc(...)` with the same arugments. One of them could be midway through generating a .py file that the other process would attempt to import and then crash with a `SyntaxError`. This diff avoids that problem by instead writing to a tempfile and then atomically renaming it. This means processes starting up at the same time will both do the same work, but that's a lot better than having one of them crash! `flanker` had this issue reported to them in mailgun/flanker#168 and they attempted to work around this by committing the parsetab files that ply generates (mailgun/flanker#188), but this doesn't help if the user is running a different version of ply than the version that flanker generated their parsetab files with (because ply will go ahead and regenerate those files).

dabeaz · 2018-09-28T12:02:47Z

Over the years, there has been a lot of confusion surrounding the PLY parsetab.py file and it's proper deployment. I'm going to reject this pull request because I feel that it doesn't really address the actual problem that underlies all of this. Specifically, the parsetab.py file is something that is meant to be generated only once during the deployment of a project. Ideally, it gets created once during package installation and that's it. If it's constantly being recreated on execution, or being recreated by hundreds of processes simultaneously, then something is wrong. Creating the file atomically doesn't address that--instead, it merely hides it.

If you are working on a major project that uses PLY, the best way to deal with these kinds of versioning problems is to directly incorporate PLY into your project--not as a external dependency, but by directly copying the 'ply' package directory into your project. PLY is not large--there are only two source files. PLY also changes very rarely--often going years between releases. I'm not adding new features to it. It's also not likely to break in future Python versions. Done properly, everything will just "work" and it won't matter if some user has installed a different version of PLY for some reason (you'll just have to pay careful attention to your import statements involving PLY to make sure they are coming from your package, but that's about it).

I would also recommend that any 'setup.py' file import the associated parser modules to force them to create the table files on installation of any package.

The only other option would be to use PLY and tell it to not create the table files at all. For small grammars, this is probably fine--table generation is quite fast on modern hardware.

jfly · 2018-09-30T19:29:47Z

I am sympathetic to your "we don't want to paper over the root cause" argument. However, the fact is that there is at least 1 library out there that is getting this wrong. Your comment that "there has been a lot of confusion surrounding the PLY parsetab.py file and it's proper deployment" implies that there are probably more libraries that also get this wrong. Have you considered documentation and/or code changes to try to better educate people on how to use this library? Some partially baked ideas not in any particular order:

Add a note to the README (and http://www.dabeaz.com/ply/) about this gotcha and give an example of the right way to ship a library using ply.
Change the default behavior of ply to write_tables=False. People can still opt in to the one time table generation, but they need to educate themselves on it.
Get rid of the implicit table generation + writing to file. Change the code to crash if the parsetab files don't exist. Pull out the parsetab generation into an explicit, well documented function that explains it's not safe to execute in multiple threads or processes at the same time.
Remove this library from PyPi. If people are supposed to copy to copy the code into their projects, then removing it from PyPi would encourage them to do so!

I'm happy to send in PRs implementing anything we agree upon doing to make this library easier to use correctly.

dabeaz · 2018-09-30T19:51:30Z

Packaging advice concerning table files is already found in the documentation at http://dabeaz.com/ply/ply.html#ply_nn49

In the big picture, it's not feasible for me to think of every possible scenario in which a library might used and to come up with a fix for every possible problem. The writing of table files is a well-known facet of PLY. If it's being executed concurrently in threads or processes, then the user will have to take explicit steps to make sure it's doing what they want (by introducing locking, disabling table files, or taking some other step that's appropriate for that situation).

jfly · 2018-09-30T20:06:56Z

Packaging advice concerning table files is already found in the documentation at http://dabeaz.com/ply/ply.html#ply_nn49

Oh cool, I had not seen that. Do you think that should have a note about your recommendation to copy ply into your code to avoid getting hit by the "a new version of ply will want to regenerate the tables" issue?

This fixes mailgun#206. Before this change, when multiple Python processes are simultaneously doing a `from flanker.addresslib import address`, it's possible for some of them to crash in `ply` code. See dabeaz/ply#184, where I attempted to work around this issue by changing ply. You can see in this comment: dabeaz/ply#184 (comment) that the author of ply suggests to workarounds for this issue: 1. Remove `ply` as a dependency in setup.py and copy the source code of `ply` into `flanker`. 2. Disable writing parsetab files to disk when invoking `yacc`. 2) seemed like the simpler solution to me, so that's what I've done here.

This fixes mailgun#206. Before this change, when multiple Python processes are simultaneously doing a `from flanker.addresslib import address`, it's possible for some of them to crash in `ply` code. See dabeaz/ply#184, where I attempted to work around this issue by changing ply. You can see in this comment: dabeaz/ply#184 (comment) that the author of ply suggests two workarounds for this issue: 1. Remove `ply` as a dependency in setup.py and copy the source code of `ply` into `flanker`. 2. Disable writing parsetab files to disk when invoking `yacc`. 2) seemed like the simpler solution to me, so that's what I've done here.

jfly mentioned this pull request Sep 21, 2018

Crashes in parsetab due to ply race condition mailgun/flanker#206

Open

jfly mentioned this pull request Sep 30, 2018

Disable writing parsetab files to disk. mailgun/flanker#207

Open

dabeaz closed this Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change write_table to atomically write to the parsetab file. #184

Change write_table to atomically write to the parsetab file. #184

jfly commented Sep 21, 2018

dabeaz commented Sep 28, 2018

jfly commented Sep 30, 2018

dabeaz commented Sep 30, 2018

jfly commented Sep 30, 2018

Change write_table to atomically write to the parsetab file. #184

Change write_table to atomically write to the parsetab file. #184

Conversation

jfly commented Sep 21, 2018

dabeaz commented Sep 28, 2018

jfly commented Sep 30, 2018

dabeaz commented Sep 30, 2018

jfly commented Sep 30, 2018