Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

Review from first #1

Open
wants to merge 3 commits into
base: review
Choose a base branch
from
Open

Review from first #1

wants to merge 3 commits into from

Conversation

fanneiOZ
Copy link

No description provided.

Copy link
Author

@fanneiOZ fanneiOZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

By following requirements, this exercise work doesn't meet them.

Clean design and sound software engineering practices.
Structured repo so can be used as working repo in future.
Production-ready solution.

Schema designing need more critical thinking. For junior level, this is okay. for mid, acceptable, but for senior, sorry to say it's not there yet.

Comment on lines +1 to +5
DBHOST=localhost
DBUSER=demouser
DBPASSWORD=demopassword
DATABASE=financial_database
TABLE=financial
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ETL script refer to .env file but none found in repository. I add this one my self

cols = '`,`'.join([str(i) for i in list_col])

''' CONNECT TO THE DATABASE '''
connection = pymysql.connect(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that we need MySQL running on default port 3306 only. No flexibility for port connection

USE financial_database;

DROP TABLE IF EXISTS financial;
CREATE TABLE financial ( financial_id int unsigned not null auto_increment, segment varchar(128), country varchar(128), product varchar(128), discount_band varchar(128), units_sold float, manufacturing_price int, sale_price int, gross_sales float, discounts float, sales float, cogs float, profit float, date datetime, month_number int, month_name varchar(128), year int, primary key (financial_id) );
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About schema design

month_name varchar(128)
We would never have any english month name exceed 128 length.

product varchar(128)
Excel has 32,767 characters limit per cells. These will easily lead to truncation issue if data in xlsx is change.

manufacturing_price int, sale_price int
Price values as Integer ???

date datetime
Why date time type if data will never specify the time?

Mostly Top set 128-char limit for every varchar fields. Some won't need that long, Some might exceed. So in point of foreseeing / scalable design, this design doesn't meet this scenario.

@@ -0,0 +1,9 @@
INSERT INTO mysql.user (User,Host,authentication_string,ssl_cipher,x509_issuer,x509_subject) VALUES('demouser','localhost',PASSWORD('demopassword'),'','','');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PASSWORD() is compatible only for MySQL 5 or less. As the repository doesn't specific the version of MySQL (they released version 8) already. This simply breaks the execution.

@@ -0,0 +1,9 @@
INSERT INTO mysql.user (User,Host,authentication_string,ssl_cipher,x509_issuer,x509_subject) VALUES('demouser','localhost',PASSWORD('demopassword'),'','','');
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's CREATE USER for simply user creation, but Top chose to use INSERT into user table. This is not wrong but it looks like he thinks too complex for simply requirement but too simply for detailed requirement. IMO 🤔

for i, row in df['Sheet1'].iterrows():
sql = 'INSERT INTO `'+table+'` (`'+cols +'`) VALUES ('+'%s,'*(len(row)-1)+'%s)'
# row represents the data points in DataFrame
cursor.execute(sql, tuple(row))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://medium.com/analytics-vidhya/speed-up-bulk-inserts-to-sql-db-using-pandas-and-python-61707ae41990

If there's bulk statement compatibility or cursor.executemany, why opt to INSERT one by one??

Comment on lines +9 to +14
load_dotenv() # Load .env
dbhost = os.getenv('DBHOST', "")
dbuser = os.getenv('DBUSER', "")
dbpassword = os.getenv('DBPASSWORD', "")
database = os.getenv('DATABASE', "")
table = os.getenv('TABLE', "")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking in developer world, this one needs encapsulation. It's okay to do this for single script, but what in real world, we never have only single magic script. Every scripts will have to change this part if connection have to be changed.

Comment on lines +9 to +14
load_dotenv() # Load .env
dbhost = os.getenv('DBHOST', "")
dbuser = os.getenv('DBUSER', "")
dbpassword = os.getenv('DBPASSWORD', "")
database = os.getenv('DATABASE', "")
table = os.getenv('TABLE', "")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment variable for single table name ??

@@ -0,0 +1,58 @@
from dotenv import load_dotenv
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About ETL Script

This script is fixed to one task. It will require some work to reuse this for other task.

cursor.execute(sql, tuple(row))

# The connection is not autocommitted by default, so we must commit to save our changes
connection.commit()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Txn will be commit but if this one failed it will never be rollback.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants