Skip to content

Commit f7b395e

Browse files
committed
New
1 parent d1acf40 commit f7b395e

19 files changed

+37659
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Exercise
2+
3+
# NumPy and pandas working together
4+
# Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute .values to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as pd and loaded world population data every 10 years since 1960 into the DataFrame df. This dataset was derived from the one used in the previous exercise.
5+
6+
# Your job is to extract the values and store them in an array using the attribute .values. You'll then use those values as input into the NumPy np.log10() method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy np.log10() method and compare the results.
7+
8+
# Instructions
9+
10+
# Import numpy using the standard alias np.
11+
# Assign the numerical values in the DataFrame df to an array np_vals using the attribute values.
12+
# Pass np_vals into the NumPy method log10() and store the results in np_vals_log10.
13+
# Pass the entire df DataFrame into the NumPy method log10() and store the results in df_log10.
14+
# Inspect the output of the print() code to see the type() of the variables that you created.
15+
16+
# Import numpy
17+
import numpy as np
18+
# Create array of DataFrame values: np_vals
19+
np_vals = df.values
20+
21+
# Create new array of base 10 logarithm values: np_vals_log10
22+
np_vals_log10 = np.log10(np_vals)
23+
24+
# Create array of new DataFrame by passing df to np.log10(): df_log10
25+
df_log10 = np.log10(df)
26+
27+
28+
# Print original and new data containers
29+
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
30+
31+
# Exercise
32+
# Zip lists to build a DataFrame
33+
# In this exercise, you're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. list_keys contains the column names 'Country' and 'Total'. list_values contains the full names of each country and the number of gold medals awarded. The values have been taken from Wikipedia.
34+
35+
# Your job is to use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, you'll make use of the list(), zip(), dict() and pd.DataFrame() functions. Pandas has already been imported as pd.
36+
37+
# Note: The zip() function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this zip object into a list, you'll need to use list(). You can learn more about the zip() function as well as generators in Python Data Science Toolbox (Part 2).
38+
39+
# Instructions
40+
41+
# Zip the 2 lists list_keys and list_values together into one list of (key, value) tuples. Be sure to convert the zip object into a list, and store the result in zipped.
42+
# Inspect the contents of zipped using print(). This has been done for you.
43+
# Construct a dictionary using zipped. Store the result as data.
44+
# Construct a DataFrame using the dictionary. Store the result as df.
45+
# Zip the 2 lists together into one list of (key,value) tuples: zipped
46+
zipped = list(zip(list_keys,list_values))
47+
48+
# Inspect the list using print()
49+
print(zipped)
50+
51+
# Build a dictionary with the zipped list: data
52+
data = dict(zipped)
53+
54+
# Build and inspect a DataFrame from the dictionary: df
55+
df = pd.DataFrame(data)
56+
print(df)
57+
58+
# Exercise
59+
# Labeling your data
60+
# You can use the DataFrame attribute df.columns to view and assign new string labels to columns in a pandas DataFrame.
61+
62+
# In this exercise, we have imported pandas as pd and defined a DataFrame df containing top Billboard hits from the 1980s (from Wikipedia). Each row has the year, artist, song name and the number of weeks at the top. However, this DataFrame has the column labels a, b, c, d. Your job is to use the df.columns attribute to re-assign descriptive column labels.
63+
64+
# Instructions
65+
66+
# Create a list of new column labels with 'year', 'artist', 'song', 'chart weeks', and assign it to list_labels.
67+
# Assign your list of labels to df.columns.
68+
69+
# Build a list of labels: list_labels
70+
list_labels = ['year', 'artist', 'song', 'chart weeks']
71+
72+
# Assign the list of labels to the columns attribute: df.columns
73+
df.columns = list_labels
74+
75+
# Exercise
76+
# Building DataFrames with broadcasting
77+
# You can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this exercise, you're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list cities.
78+
79+
# Your job is to construct a DataFrame from the list of cities and the string 'PA'.
80+
81+
# Instructions
82+
83+
# Make a string object with the value 'PA' and assign it to state.
84+
# Construct a dictionary with 2 key:value pairs: 'state':state and 'city':cities.
85+
# Construct a pandas DataFrame from the dictionary you created and assign it to df.
86+
87+
# Make a string with the value 'PA': state
88+
state = 'PA'
89+
90+
# Construct a dictionary: data
91+
data = {'state':state, 'city':cities}
92+
93+
# Construct a DataFrame from dictionary data: df
94+
df = pd.DataFrame(data)
95+
96+
# Print the DataFrame
97+
print(df)
98+
99+
# Exercise
100+
# Reading a flat file
101+
# In previous exercises, we have preloaded the data for you using the pandas function read_csv(). Now, it's your turn! Your job is to read the World Bank population data you saw earlier into a DataFrame using read_csv(). The file is available in the variable data_file.
102+
103+
# The next step is to reread the same file, but simultaneously rename the columns using the names keyword input parameter, set equal to a list of new column labels. You will also need to set header=0 to rename the column labels.
104+
105+
# Finish up by inspecting the result with df.head() and df.info() in the IPython Shell (changing df to the name of your DataFrame variable).
106+
107+
# pandas has already been imported and is available in the workspace as pd.
108+
109+
# Instructions
110+
111+
# Use pd.read_csv() with the string data_file to read the CSV file into a DataFrame and assign it to df1.
112+
# Create a list of new column labels - 'year', 'population' - and assign it to the variable new_labels.
113+
# Reread the same file, again using pd.read_csv(), but this time, add the keyword arguments header=0 and names=new_labels. Assign the resulting DataFrame to df2.
114+
# Print both the df1 and df2 DataFrames to see the change in column names. This has already been done for you.
115+
116+
# Read in the file: df1
117+
df1 = pd.read_csv('/usr/local/share/datasets/world_population.csv')
118+
119+
# Create a list of the new column labels: new_labels
120+
new_labels = ['year','population']
121+
122+
# Read in the file, specifying the header and names parameters: df2
123+
df2 = pd.read_csv('/usr/local/share/datasets/world_population.csv', header=0, names=new_labels)
124+
125+
# Print both the DataFrames
126+
print(df1)
127+
print(df2)
128+
129+
# Exercise
130+
# Delimiters, headers, and extensions
131+
# Not all data files are clean and tidy. Pandas provides methods for reading those not-so-perfect data files that you encounter far too often.
132+
133+
# In this exercise, you have monthly stock data for four companies downloaded from Yahoo Finance. The data is stored as one row for each company and each column is the end-of-month closing price. The file name is given to you in the variable file_messy.
134+
135+
# In addition, this file has three aspects that may cause trouble for lesser tools: multiple header lines, comment records (rows) interleaved throughout the data rows, and space delimiters instead of commas.
136+
137+
# Your job is to use pandas to read the data from this problematic file_messy using non-default input options with read_csv() so as to tidy up the mess at read time. Then, write the cleaned up data to a CSV file with the variable file_clean that has been prepared for you, as you might do in a real data workflow.
138+
139+
# You can learn about the option input parameters needed by using help() on the pandas function pd.read_csv().
140+
141+
# Instructions
142+
143+
# Use pd.read_csv() without using any keyword arguments to read file_messy into a pandas DataFrame df1.
144+
# Use .head() to print the first 5 rows of df1 and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv() can clean up this mess.
145+
# Using the keyword arguments delimiter=' ', header=3 and comment='#', use pd.read_csv() again to read file_messy into a new DataFrame df2.
146+
# Print the output of df2.head() to verify the file was read correctly.
147+
# Use the DataFrame method .to_csv() to save the DataFrame df2 to the variable file_clean. Be sure to specify index=False.
148+
# Use the DataFrame method .to_excel() to save the DataFrame df2 to the file 'file_clean.xlsx'. Again, remember to specify index=False.
149+
# Read the raw file as-is: df1
150+
df1 = pd.read_csv(file_messy)
151+
152+
# Print the output of df1.head()
153+
print(df1.head())
154+
155+
# Read in the file with the correct parameters: df2
156+
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment="#")
157+
158+
# Print the output of df2.head()
159+
print(df2.head())
160+
161+
# Save the cleaned up DataFrame to a CSV file without the index
162+
df2.to_csv(file_clean, index=False)
163+
164+
# Save the cleaned up DataFrame to an Excel file without the index
165+
df2.to_excel('file_clean.xlsx', index=False)
166+
167+
# Exercise
168+
# Plotting series using pandas
169+
# Data visualization is often a very effective first step in gaining a rough understanding of a data set to be analyzed. Pandas provides data visualization by both depending upon and interoperating with the matplotlib library. You will now explore some of the basic plotting mechanics with pandas as well as related matplotlib options. We have pre-loaded a pandas DataFrame df which contains the data you need. Your job is to use the DataFrame method df.plot() to visualize the data, and then explore the optional matplotlib input parameters that this .plot() method accepts.
170+
171+
# The pandas .plot() method makes calls to matplotlib to construct the plots. This means that you can use the skills you've learned in previous visualization courses to customize the plot. In this exercise, you'll add a custom title and axis labels to the figure.
172+
173+
# Before plotting, inspect the DataFrame in the IPython Shell using df.head(). Also, use type(df) and note that it is a single column DataFrame.
174+
175+
# Instructions
176+
177+
# Create the plot with the DataFrame method df.plot(). Specify a color of 'red'.
178+
# Note: c and color are interchangeable as parameters here, but we ask you to be explicit and specify color.
179+
# Use plt.title() to give the plot a title of 'Temperature in Austin'.
180+
# Use plt.xlabel() to give the plot an x-axis label of 'Hours since midnight August 1, 2010'.
181+
# Use plt.ylabel() to give the plot a y-axis label of 'Temperature (degrees F)'.
182+
# Finally, display the plot using plt.show().
183+
# Create a plot with color='red'
184+
df.plot(color='red')
185+
186+
# Add a title
187+
plt.title('Temperature in Austin')
188+
189+
# Specify the x-axis label
190+
plt.xlabel('Hours since midnight August 1, 2010')
191+
192+
# Specify the y-axis label
193+
plt.ylabel('Temperature (degrees F)')
194+
195+
# Display the plot
196+
plt.show()
197+
198+
# Exercise
199+
# Plotting DataFrames
200+
# Comparing data from several columns can be very illuminating. Pandas makes doing so easy with multi-column DataFrames. By default, calling df.plot() will cause pandas to over-plot all column data, with each column as a single line. In this exercise, we have pre-loaded three columns of data from a weather data set - temperature, dew point, and pressure - but the problem is that pressure has different units of measure. The pressure data, measured in Atmospheres, has a different vertical scaling than that of the other two data columns, which are both measured in degrees Fahrenheit.
201+
202+
# Your job is to plot all columns as a multi-line plot, to see the nature of vertical scaling problem. Then, use a list of column names passed into the DataFrame df[column_list] to limit plotting to just one column, and then just 2 columns of data. When you are finished, you will have created 4 plots. You can cycle through them by clicking on the 'Previous Plot' and 'Next Plot' buttons.
203+
204+
# As in the previous exercise, inspect the DataFrame df in the IPython Shell using the .head() and .info() methods.
205+
206+
# Instructions
207+
208+
# Plot all columns together on one figure by calling df.plot(), and noting the vertical scaling problem.
209+
# Plot all columns as subplots. To do so, you need to specify subplots=True inside .plot().
210+
# Plot a single column of dew point data. To do this, define a column list containing a single column name 'Dew Point (deg F)', and call df[column_list1].plot().
211+
# Plot two columns of data, 'Temperature (deg F)' and 'Dew Point (deg F)'. To do this, define a list containing those column names and pass it into df[], as df[column_list2].plot().
212+
# Plot all columns (default)
213+
df.plot()
214+
plt.show()
215+
216+
# Plot all columns as subplots
217+
df.plot(subplots=True)
218+
plt.show()
219+
220+
# Plot just the Dew Point data
221+
column_list1 = ['Dew Point (deg F)']
222+
df[column_list1].plot()
223+
plt.show()
224+
225+
# Plot the Dew Point and Temperature data, but not the Pressure data
226+
column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
227+
df[column_list2].plot()
228+
plt.show()

0 commit comments

Comments
 (0)