Follow me on twitter @clarecorthell!
September 9 2014 @ Hackbright Academy, SF
- numpy multi-dimensional container of data
- pandas data structures analysis tools
- matplotlib python plotting library
- iPython browser-based code notebook / IDE (run blocks of code, not the whole program)
All python code for this talk was run in the browser-based iPython interpreter
import numpy as np
import pandas as pd
# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt
turn a csv into a DataFrame (for example, an export from excel in csv form)
mattermark_df = pd.read_csv('mattermark_data.csv')
=> Mattermark data about funding rounds in New York City in the last five years
sample different parts of the data
sample the first ten rows of our DataFrame
use .iloc to index into row location 0
sample the column
show some standard statistics about that column (for numeric data)
show some standard statistics about all numeric columns
mattermark_df.sort('amount', ascending=False)
sort entire table (descending) by amount amount of funding
mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)
Count the number of null values in the column
What is the most common stage for funding?
count the values in each category
plot in a bar graph (grouped by series) to get a quick idea of relative scale
Leads to Question: What is the typical funding amount by round?
by_series = mattermark_df.groupby('series')
group records by series column (stored in a variable)
print by_series['amount'].mean().astype(int)
within each grouping, calculate the mean (and do some explicit type conversion)
How many of these are mobile companies?
mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads'])
we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.
compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion
For more context, see the video & slides)
- The Open Source Data Science Masters - A curated curriculum of open source resources to get you working with and understanding data
- pandas cookbook - great beginning resource from Julia Evans
- Python for Data Analysis / Book - the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python (with numpy, pandas, and matplotlib)