You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have recently tried to implement the piicatcher library using api mode into my workflow to identfy any Pii data in our databases. I am running in AWS RDS Clusters which can have up to 660 databases inside it. When running PIICatcher in this environment, the library tries to make queries to the INFORMATION_SCHEMA table which holds information about every other database in the rds cluster as well. Here is the exact query it runs
SELECT
lower(c.column_name) AS col_name,
c.column_comment AS col_description,
lower(c.data_type) AS col_type,
lower(c.ordinal_position) AS col_sort_order,
c.table_catalog AS cluster,
lower(c.table_schema) AS "schema",
lower(c.table_name) AS name,
t.table_comment AS description,
case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
FROM
INFORMATION_SCHEMA.COLUMNS AS c
LEFT JOIN
INFORMATION_SCHEMA.TABLES t
ON c.TABLE_NAME = t.TABLE_NAME
AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
WHERE
c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')
ORDER by cluster, "schema", name, col_sort_order
This query still runs even when you are trying to scan a single database and pass include/exclude schema flags.
When you have a large number of databases, this query can take up to 100+ hours to complete.
I have come up with a more efficient query that can take up to 10min to run (even with a huge INFORMATION_SCHEMA table)
SELECT
c.column_name AS col_name,
c.column_comment AS col_description,
c.data_type AS col_type,
c.ordinal_position AS col_sort_order,
c.table_catalog AS cluster,
c.table_schema AS "schema",
c.table_name AS name,
t.table_comment AS description,
case when lower(t.table_type) = "view" then "true" else "false" end AS is_view
FROM
INFORMATION_SCHEMA.TABLES t
left outer JOIN INFORMATION_SCHEMA.COLUMNS AS c
on c.TABLE_SCHEMA = t.TABLE_SCHEMA
and c.TABLE_NAME = t.TABLE_NAME
where c.TABLE_SCHEMA="<insert_requested_db_name>";
I believe that this should be implemented into the lib in order to speed up the scanning of databases in cluster configurations (like RDS). This would be a huge time-saving method and I can see it increasing the adoption of the project. I would like to contribute to this, however, I have no idea where this needs to be added to the lib. I have taken a look into the PII catcher code and could not find where this query comes from. Is this from a dependency (sqlalchemy perhaps?).
The text was updated successfully, but these errors were encountered:
The only way to customise that query is via the where_clause_suffix, which is being set in the dbcat _create_mysql_extractor method. It defaults to
WHERE
c.table_schema NOT IN ('information_schema', 'performance_schema', 'sys', 'mysql')
Maybe you can try changing that clause to filter on only the selected DB names? I'm not sure if that would yield the same performance improvement as the modified query you suggested, though.
This is a useful optimization. @tokernjan is right about the flow control. Another possibility is to implement a custom extractor which is similar to mysql_metadata_extractor and rewrite the query on the information schema.
I have recently tried to implement the piicatcher library using api mode into my workflow to identfy any Pii data in our databases. I am running in AWS RDS Clusters which can have up to 660 databases inside it. When running PIICatcher in this environment, the library tries to make queries to the
INFORMATION_SCHEMA
table which holds information about every other database in the rds cluster as well. Here is the exact query it runsThis query still runs even when you are trying to scan a single database and pass include/exclude schema flags.
When you have a large number of databases, this query can take up to 100+ hours to complete.
I have come up with a more efficient query that can take up to 10min to run (even with a huge
INFORMATION_SCHEMA
table)I believe that this should be implemented into the lib in order to speed up the scanning of databases in cluster configurations (like RDS). This would be a huge time-saving method and I can see it increasing the adoption of the project. I would like to contribute to this, however, I have no idea where this needs to be added to the lib. I have taken a look into the PII catcher code and could not find where this query comes from. Is this from a dependency (sqlalchemy perhaps?).
The text was updated successfully, but these errors were encountered: