how to zip similar files using python?

Is Pyston (a new JIT enabled Python interpreter from Dropbox) badly designed?

  • Recently, I have been trying to contribute to [Pyston][1], a JIT compiler for Python made open source by Dropbox. I was quite excited about the project and was looking forward to contribute, but it's been a wild, long ride. To put it delicately, the code is not very welcoming to someone trying to join the project. Several red flags have raised since I started studying the code: The parser is not really a parser. It "parses" Python code by invoking a python script (under `CPython`) that reads the code, recursively serializes its AST, and spits it out into a pipe that is plugged into Pyston's process. Pyston then reads it and builds the AST back on the C++ side. The code that does this is in [parser.cpp][2], and honestly, it's very bad code. Some examples:         AST_stmt* readASTStmt(BufferedReader *reader) {             uint8_t type = reader->readByte();             if (VERBOSITY("parsing") >= 2)                 printf("type = %d\n", type);             if (type == 0)                 return NULL;             uint8_t checkbyte = reader->readByte();             assert(checkbyte == 0xae);                 switch (type) {                 case AST_TYPE::Assert:                     return read_assert(reader);                 case AST_TYPE::Assign:                     return read_assign(reader);                 case AST_TYPE::AugAssign:                     return read_augassign(reader);                 case AST_TYPE::Break:                     return read_break(reader);                 case AST_TYPE::ClassDef:                     return read_classdef(reader);                 case AST_TYPE::Continue:                     return read_continue(reader);                 case AST_TYPE::Expr:                     return read_expr(reader);                 case AST_TYPE::For:                     return read_for(reader);                 case AST_TYPE::FunctionDef:                     return read_functiondef(reader);                 case AST_TYPE::Global:                     return read_global(reader);                 case AST_TYPE::If:                     return read_if(reader);                 case AST_TYPE::Import:                     return read_import(reader);                 case AST_TYPE::ImportFrom:                     return read_importfrom(reader);                 case AST_TYPE::Pass:                     return read_pass(reader);                 case AST_TYPE::Print:                     return read_print(reader);                 case AST_TYPE::Return:                     return read_return(reader);                 case AST_TYPE::While:                     return read_while(reader);                 case AST_TYPE::With:                     return read_with(reader);                 default:                     fprintf(stderr, "Unknown stmt node type (parser.cpp:" STRINGIFY(__LINE__) "): %d\n", type);                     exit(1);                     break;             }         } This would be the kind of code I'd use to teach someone how **NOT** to develop software. This gets worse, the constants in `AST_TYPE` are duplicated in the Python script that reads the AST, meaning that a change in `AST_TYPE` must be kept in sync with the other file. Duplicated information ... whoope! But this gets worse: the various `read_*` functions are expected to read each instruction's attributes in alphabetical order, because that's how the python scripts serializes nodes. And once in a  while, there appears to be a magic byte `0xae` (this is hardcoded all over the place!). It also bothers me the existence of a `BufferedReader` class that is kind of mimicking the operating system's functions. This seems utterly unnecessary. I won't even mention all the naked `new`'s and the astonishing amount of memory that the code leaks. Also, it's one of those C/C++ projects - it uses C++, but it's not real C++: files are manipulated using `FILE *`, output is done using `printf()`, **classes have public variables that are accessed directly by other components**, and there are numerous ugly casts all over the place, especially in the parser routines. There is a class `AST_Expr`, and another `AST_expr` (note the difference in the lower / upper case `e`). There are 2 header files with the name `types.h` placed in different directories, and one of them includes the other. There isn't a real Read-eval-print loop. If the interpreter is running in interactive mode, **each command is written to a temporary file, executed, and then the file is deleted** (this is because the python script that serializes the AST expects a file as an argument). I mean, this is a nightmare. At the beginning I thought "Ok, this is just one file, maybe it's going to be refactored or something), but **every source file I read is full of bad examples**. For example, I go have a look at `hooks.cpp`, and I see this:     // TODO terrible place for these!     const std::string SourceInfo::getName() {         assert(ast);         switch (ast->type) {             case AST_TYPE::FunctionDef:                 return ast_cast<AST_FunctionDef>(ast)->name;             case AST_TYPE::Module:                 return "module";             default:                 RELEASE_ASSERT(0, "%d", ast->type);         }     } And a ton of other similar functions with the same pattern: a `switch` with a case to test the type and do the appropriate action. The code is cluttered with this in every source file I read. Another example:     class GCVisitor {         public:             virtual ~GCVisitor() {}             virtual void visit(void* p) = 0;             virtual void visitRange(void** start, void** end) = 0;             virtual void visitPotential(void* p) = 0;             virtual void visitPotentialRange(void** start, void** end) = 0;     }; I didn't know there was still someone using raw `void *` pointers in C++ these days. Apparently, the developer never heard of templates. `LLVM` **state is kept in a global variable named `g`** (a global variable with a single-letter name!!!), and God knows which files use it... I could be here all day listing the major flaws and ill-formed design of this code. This is a Dropbox product, and honestly, seeing such horrible code has considerably brought down my vision of Dropbox as a software company, and it definitely lowered its reputation. I would never believe this was code developed by Dropbox, and **I would be ashamed of showing it to public**. I am thinking about giving up on this project, because I'm afraid I won't learn anything useful. I don't want to work on bad code, and I totally lost confidence on this project (the goal was to learn LLVM, but now I don't trust that Pyston uses it properly). It looks like Pyston was debugged into existence. Am I being too harsh / picky here? **Is real-world software development supposed to be like this**? What do you think about these code samples I posted? I find it hard to believe that Dropbox has gotten that far developing code like this. Maybe they have very good Python code, but C++ is definitely not their strongest skill. **Is Pyston badly designed?**   [1]: https://github.com/dropbox/pyston   [2]: https://github.com/dropbox/pyston/blob/master/src/codegen/parser.cpp

  • Answer:

    First of all, I'll emphatically agree that there are large swaths of the codebase that could be better in various ways.  I would love if we could have released code that was perfect, and also implemented all of Python, and also solved the problems of extension modules and multithreading, and also had it convert your code to Python 3 and autoparallelize it in the process.  Reality, of course, puts a damper on that dream. So is the Pyston codebase everything that I want it to be?  Of course not.  Do I think cleaning up any technical debt is the most important thing to the project?  Not really.  I think technical debt, like financial debt, is not in itself a bad thing if it lets you move faster; one could argue that Pyston has accrued too much debt but I'd disagree. Another important thing to realize about a nascent, fast-moving project like this, is the expected lifetime of any piece of code is pretty short, maybe 3-6 months.  This means that if we can live with something for a few months, then it will go away when a rewrite happens -- as would any investments made in improving the code.  So yes, the "parser" is definitely a hack, but the whole thing will be replaced some day hopefully soon, so is it really worth making an incremental improvement to it? But overall, the root point is that Pyston is a very young project, perhaps younger than you were expecting.  I can't blame you for disliking the stage of the project -- to each their own -- but it's a stage where "room for improvement" doesn't really imply anything bad, just that there's more work to be done.

Kevin Modzelewski at Quora Visit the source

Was this solution helpful to you?

Other answers

I'm not familiar with Pyston or CPython source code and neither I'm saying I'm a good enough programmer but anyway here are some of my thoughts: 1. In regarding to the CPython parser part, that makes perfect sense. Goal of Pyston is to build a good JIT-enabled Python and JITs work at the AST levels, so that's where Pyston should put effort into dealing with. If there is some code from CPython that could be used, and re-using that code introduces no significant drawbacks while saving a lot of code development and debugging, it's an OK solution, at least for now. 2. The "bad smells" in the code listed below are definitely considered out of the list of anyone's best practices. But there may or may not be a better solution to many of the problems, and while it is still at it's infancy stage, does it worth or not to make the code looks perfectly modern, or is it better to pick the best (or most familiar parts) from the languages and make the algorithms right? I would go for the second one. Pyston is still in it's early time, even it's JIT optimiser isn't good enough to support the full language, so there are much work left to be done.

Ryan Gao

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.