DataScience jupyter notebook in Docker

If you want to dive in deeper this are great places to start:

Instructions

make sure you update the path of the volume where your data will be saved in the docker-compose.yml file
run: docker-compose up
from the logs copy the url to localhost with token and paste it in the browser
stop: docker-compose down

Workshop

Data exploration

read the data

filename='top50.csv'
df=pd.read_csv(filename,encoding='ISO-8859-1', index_col=0)

preview the data
What is the shape of the data?
Rename the columns

df.rename(columns={'Track.Name':'track_name','Artist.Name':'artist_name','Beats.Per.Minute':'beats_per_minute','Loudness..dB..':'Loudness(dB)','Valence.':'Valence','Length.':'Length', 'Acousticness..':'Acousticness','Speechiness.':'Speechiness'},inplace=True)
df.head()

Check for null values

df.isnull().sum()

Fill the null values

# df.fillna(0)
df.fillna(df.mean(), inplace=True)
df.head()

Get a list of all genres genre_list=df['Genre'].values.tolist()
list the frequency of all artists

popular_artist=df.groupby('artist_name').size()
print(popular_artist)

list all artists

artist_list=df['artist_name'].values.tolist()
print(artist_list)

describe the data
make a nice plot

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
columns = list(newdf.columns)

for col in columns:
    plt.ylabel('frequency')
    plt.xlabel(col)
    plt.hist(newdf[col], bins=20)
    plt.show()

Training a model

Extract features and target:

x=df.loc[:,['Energy','Danceability','Length','Loudness(dB)','Acousticness']].values
y=df.loc[:,'Popularity'].values

Split in training and testing data

# Creating a test and training dataset
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

Make a linear regressor and get the model values

regressor = LinearRegression()
regressor.fit(X_train, y_train)

analyse the results

#Displaying the difference between the actual and the predicted
y_pred = regressor.predict(X_test)
df_output = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df_output)

Are these good or bad?

Quantify this

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
DataSciencePresentation.pptx		DataSciencePresentation.pptx
DiabeticClassificationExam.ipynb		DiabeticClassificationExam.ipynb
README.md		README.md
docker-compose.yml		docker-compose.yml
popular-music.ipynb		popular-music.ipynb
top50.csv		top50.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataScience jupyter notebook in Docker

Instructions

Workshop

Data exploration

Training a model

About

Uh oh!

Releases

Packages

Languages

Nxtra/datascience-tutorial

Folders and files

Latest commit

History

Repository files navigation

DataScience jupyter notebook in Docker

Instructions

Workshop

Data exploration

Training a model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages