The Pandas documentation can be dense, but it will make understanding snippets you see elsewhere a lot easier if you read through some of the guides.
In particular, Intro to Data Structures is really helpful for understanding the relationships between DataFrame
and Series
. Play around with constructing a few toy data structures.
Indexing and Selecting Data is another useful section of the docs to read.
Remember, there are lots of ways to do things with Pandas, so don't get overwhelmed. Find a way that works for you.
Whenever you have a question about how to do something in Pandas or Python, put it in a Markdown file or Jupyter notebook. You'll likely need to remember how to do it in the future, and writing it in your own words will help you understand the concept better.
Shout out to Cecilia Reyes for this tip.
Break them into separate pieces:
- Data loading and cleaning
- Validation
- One notebook for each high-level reporting question
Use DataFrame.to_pickle()
to share data between notebooks.
Then use these small functions with DataFrame.apply()
- #python on the NewsNerdery Slack
- NICAR-L mailing list
- Data science blogs
ipython-sql lets you write SQL queries in a Jupyter notebook! It makes the results of a query available as a DataFrame
so you can even mix SQL and Pandas.
Teamwork makes the dream work. Groups like the California Civic Data Coalition and the Chicago data Collaborative are working to make data easier to use rather than having multiple newsrooms replicating the same data cleaning. Consider collaborations to make reporting using ugly data easier.