Skip to content

Why was this project so hard to work on, learn, and maintain?

R. Bernstein edited this page Feb 12, 2023 · 1 revision

Working with this code has been an incredible time sink. I doubt many have understood or will understand it in its entirety. More likely, I think programmers have been working on a small part of it and ignoring understanding the rest. That's okay, I guess, it's called modularity. However that prevents whole-scale code cleanup. And in some cases it means fixes are applied in one phase of the code where it would be better done in another phase.

So here are some of the challenges.

The first problem is that it is old, runs on lots of versions of Python and already is in great need of code cleanup. That's what generally happens when you have a large piece of code with different people working on different little pieces of the code. I've already split this project into three parts:

  • spark_parser: SPARK which handles generic parsing with some generic scanning and abstract syntax routines
  • Python version independent code disassembly and demarshaling.
  • the rest: decompiling.

Having one project to work on is of course easier in the sense that you don't have to fix two code bases. However having two code bases does enforce a little bit of modularity and forces APIs to be cleaner.

So some of the cause of code duplication in earlier versions was a result of Python's standard libraries. Since lot of this borrows from Python routines that disassemble code using that Python version's opcodes, or routines to handle code marshaling which change from time to time, the easiest thing has been to copy each version of that code in this project. And that's is what is initially done.

Of course that causes a lot of code bloat. But you still need to ensure that you can run say a Python 3.5.1 disassembler on Python 2.7.11, so then the standard Python code still has to be modified.

But even with that which is now addressed, now we get to the difficulty that code generation of Python bytecode itself changes a bit from version to version. More so than for other VMs like JVM, or Ruby's VM or Emacs Lisp's bytecode. Adding to this is the unfamiliarity of many of how to write and debug grammar rules. See How does this code work?, Table-driven semantic actions, and Adding a new language-construct: Python 3.6 Formatted String Literals for more information on this.