I read Karolina Alexiou's excellent blog post about The Top Mistakes Developers Make When Using Python for Big Data Analytics with great interest. I have made - and partially learned from - all of the mistakes she warned about. I was particularly eager to try out and extend some of the code snippets she provided to illustrate 2 of the mistakes:
- Mistake #1: Reinventing the wheel
- Mistake #2: Not tuning for performance
I started composing a rather lengthy comment on the blog post, highlighting some aspects I especially appreciated and seeking clarification on others. Whenever I notice myself getting a bit voluminous in a comment on someone else's blog, I typically compose a separate post on my own blog (Gumption), and then substitute a link (with a brief summary) on the original blog post.
In this instance, it seemed more appropriate - and constructive - to create an IPython Notebook to illustrate and/or investigate some of the issues I was raising in that comment ... and thereby finding some of the clarifications I was initially seeking.
I am sharing those investigations here in case they are of interest or use to others ... and because it's been a while since I created and shared an IPython Notebook about Python and data science.