-
Notifications
You must be signed in to change notification settings - Fork 3
/
DataTransformation.malloysql
156 lines (126 loc) · 4.87 KB
/
DataTransformation.malloysql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
>>>markdown
This is Malloy's .malloysql file. It has the same format (notebook) as .malloynb but serves different purpose. We can write both Malloy and SQL syntax in this file so it is perfect for transforming data, creating views in database or copying csv to parquet.
I am using it here to:
1. change .csv source files to parquet (better performance) using duckdb's COPY..TO
2. clean the data (mainly changing 'NULL' values to proper nulls and not selecting some of the columns)
>>>markdown
As first step i've specified .csv sources in 100% Malloy cell. The names of these sources will be then used in COPY..TO operation.
This is a second version of this notebook. Thrown away first version as i was unaware that duckdb can substitue choosen string with null value. Thanks to Malloy's Slack channel i now know that and it simplified the code a lot.
(An example of my first approach where i used Malloy's pick expression to change 'NULL' to proper null:
```malloy
region is region ? pick region when != 'NULL' else null
```
It worked, but it is not efficient.)
In sources i mix sql (only SELECT * FROM) with Malloy (source specification, source extension with except to remove unwanted columns):
>>>malloy
source: csv_categories is duckdb.table('../northwind_test/raw_csv/categories.csv')
--duckdb.table('../northwind_test/raw_csv/products.parquet')
extend {
except: picture
}
source: csv_customers is duckdb.sql("""
SELECT * FROM read_csv('../northwind_test/raw_csv/customers.csv',
auto_detect=true,
nullstr = 'NULL',
HEADER = true)
""")
source: csv_employee_terr is duckdb.table('../northwind_test/raw_csv/employee_territories.csv')
source: csv_employees is duckdb.sql("""
SELECT * FROM read_csv('../northwind_test/raw_csv/employees.csv',
auto_detect=true,
nullstr = 'NULL')
""")
extend {
except: notes, photoPath, photo, homePhone
}
source: csv_orders_det is duckdb.table('../northwind_test/raw_csv/order_details.csv')
source: csv_orders is duckdb.sql("""
SELECT * FROM read_csv('../northwind_test/raw_csv/orders.csv',
auto_detect=true,
nullstr = 'NULL')
""")
source: csv_products is duckdb.sql("""
SELECT * FROM read_csv('../northwind_test/raw_csv/products.csv',
auto_detect=true,
nullstr = 'NULL')
""")
source: csv_regions is duckdb.table('../northwind_test/raw_csv/regions.csv')
source: csv_shippers is duckdb.table('../northwind_test/raw_csv/shippers.csv')
source: csv_territories is duckdb.table('../northwind_test/raw_csv/territories.csv')
source: csv_suppliers is duckdb.sql("""
SELECT * FROM read_csv('../northwind_test/raw_csv/suppliers.csv',
auto_detect=true,
nullstr = 'NULL')
""")
extend {
except: fax, phone, homePage
}
>>>markdown
Second step is writing Duckdb 'COPY...TO' for each of .csv files. Here i am using Malloy SQL cells. As i removed unwanted columns earlier i can use project: * in each activity:
>>>sql
-- connection: duckdb
copy %{
csv_categories -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\categories.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_customers -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\customers.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_employee_terr -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\employee_territories.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_employees -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\employees.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_orders_det -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\order_details.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_orders -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\orders.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_products -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\products.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_regions -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\regions.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_shippers -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\shippers.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_suppliers -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\suppliers.parquet' (FORMAT 'parquet', CODEC 'ZSTD')
>>>sql
copy %{
csv_territories -> {
project: *
}
}% to 'C:\Malloy\Northwind\northwind_test\parquet\territories.parquet' (FORMAT 'parquet', CODEC 'ZSTD')