You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which is very inefficient for both storage and performance.
In DP+, we're inferring data types using its stats function. We can guarantee correct data type inferences as we scan the whole file. Whilst scanning, we also compile descriptive statistics, e.g. from the qsv whirlwind tour:
$ qsv stats wcp.csv --everything | qsv table
field type sum min max min_length max_length mean stddev variance lower_fence q1 q2_median q3 iqr upper_fence skew mode cardinality nullcount
Country String ad zw 2 2 ru 231 0
City String al lusayli Þykkvibaer 1 87 san jose 2008182 0
AccentCity String Al Lusayli özlüce 1 87 San Antonio 2031214 0
Region String 00 Z4 0 2 -29.5 5 11 28 23 62.5 1.3036035769599401 04 392 4
Population Integer 2290536125 7 31480498 0 8 48730.66387966977 308414.0418510231 95119221210.88461 -33018 3730.5 10879 28229.5 24499 64978 0.36819008290764255 28460 2652350
Latitude Float 76585211.19776328 -54.9333333 82.483333 1 12 28.371681223643343 21.938373536960917 481.292233447227 -35.9076389 12.9552778 33.8666667 45.5305556 32.5752778 94.3934723 -0.7514210842155992 50.8 255133 0
Longitude Float 75976506.66429423 -179.9833333 180 1 14 28.14618114715278 62.472858625866486 3902.8580648875004 -98.49166745000002 2.383333 26.8802778 69.6333333 67.25000030000001 170.50833375000002 0.060789759344963286 23.1 407568 0
With min\max we can see if an int and float types will fit to PostgreSQL's smallint\integer\bigint and real\double precision types respectively, and only use numeric\decimal when it would overflow the more efficient types.
With min length\max length we can probably use character varying as well (though, per Postgres documentation, there is really no performance benefit to not use text, but perhaps for short strings with a low maxlength, we can use char varying for enforcing inferred schema constraints at the DB level ).
With cardinality, we can even automatically index certain columns based on some rules - i.e. cardinality = rowcount means a unique index; low cardinality below a certain threshold means creating an index to facilitate datastore_search_sql performance. (see #30)
With qsv frequency, we can also compile frequency tables and perhaps even explore exploiting PostgreSQL's enumerated types, if a column's frequency table is only a few values under a certain threshold.
Doing more efficient data type mapping will make the datastore more performant and space efficient, allowing it to support faster searches with datastore_search_sql queries, and take it beyond what IMHO, is currently a de facto "FTS-enabled tabular blob store" and be a big installment in taking CKAN beyond just metadata catalog use cases to being an enterprise datastore with "data lake"-like capabilities.
The text was updated successfully, but these errors were encountered:
Currently, all numeric fields (int and float) are mapped to PostgreSQL's
numeric
type.https://www.postgresql.org/docs/current/datatype-numeric.html
Which is very inefficient for both storage and performance.
In DP+, we're inferring data types using its
stats
function. We can guarantee correct data type inferences as we scan the whole file. Whilst scanning, we also compile descriptive statistics, e.g. from the qsv whirlwind tour:With
min\max
we can see if an int and float types will fit to PostgreSQL'ssmallint\integer\bigint
andreal\double precision
types respectively, and only usenumeric\decimal
when it would overflow the more efficient types.Having min\max also allow us to explore PostgreSQL's range types.
With
min length\max length
we can probably usecharacter varying
as well (though, per Postgres documentation, there is really no performance benefit to not usetext
, but perhaps for short strings with a lowmaxlength
, we can use char varying for enforcing inferred schema constraints at the DB level ).With
cardinality
, we can even automatically index certain columns based on some rules - i.e. cardinality = rowcount means a unique index; low cardinality below a certain threshold means creating an index to facilitatedatastore_search_sql
performance. (see #30)With
qsv frequency
, we can also compile frequency tables and perhaps even explore exploiting PostgreSQL's enumerated types, if a column's frequency table is only a few values under a certain threshold.Doing more efficient data type mapping will make the datastore more performant and space efficient, allowing it to support faster searches with
datastore_search_sql
queries, and take it beyond what IMHO, is currently a de facto "FTS-enabled tabular blob store" and be a big installment in taking CKAN beyond just metadata catalog use cases to being an enterprise datastore with "data lake"-like capabilities.The text was updated successfully, but these errors were encountered: