Follow me on twitter @clarecorthell!
September 9 2014 @ Hackbright Academy, SF
- numpy multi-dimensional container of data
- pandas data structures analysis tools
- matplotlib python plotting library
- iPython browser-based code notebook / IDE (run blocks of code, not the whole program)
All python code for this talk was run in the browser-based iPython interpreter
import numpy as np
import pandas as pd
# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt
turn a csv into a DataFrame (for example, an export from excel in csv form)
mattermark_df = pd.read_csv('mattermark_data.csv')
=> Mattermark data about funding rounds in New York City in the last five years
sample different parts of the data
mattermark_df[:10]
sample the first ten rows of our DataFrame
mattermark_df.iloc[0]
use .iloc to index into row location 0
mattermark_df['cached_uniques']
sample the column
mattermark_df['cached_uniques'].describe()
show some standard statistics about that column (for numeric data)
mattermark_df.describe()
show some standard statistics about all numeric columns
mattermark_df.sort('amount', ascending=False)
sort entire table (descending) by amount amount of funding
mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)
len(np.where(mattermark_df['amount'].isnull())[0])
Count the number of null values in the column
What is the most common stage for funding?
mattermark_df['series'].value_counts()
count the values in each category
mattermark_df['series'].value_counts().plot(kind='bar')
plot in a bar graph (grouped by series) to get a quick idea of relative scale
Leads to Question: What is the typical funding amount by round?
by_series = mattermark_df.groupby('series')
group records by series column (stored in a variable)
print by_series['amount'].mean().astype(int)
within each grouping, calculate the mean (and do some explicit type conversion)
How many of these are mobile companies?
mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads'])
we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.
mattermark_df.shape
mobile_df.shape
compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion
For more context, see the video & slides)
- The Open Source Data Science Masters - A curated curriculum of open source resources to get you working with and understanding data
- pandas cookbook - great beginning resource from Julia Evans
- Python for Data Analysis / Book - the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python (with numpy, pandas, and matplotlib)