-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using information schema #364
Conversation
Hey @dbeatty10 is this |
Or do you want me to be the judge of that :D |
@Fleid I think that @colin-rogers-dbt might have taken a look at this after me, but I'm not sure if he was able to get a breakthrough or ran into the same walls. Trade-offsIf I recall correctly, there are some trade-offs if this PR is merged as-is. Here are the key considerations :
The degradations above would affect the backwards-compatibility for folks that do have elevated permissions today, so Potential paths forwardIt sounds like General Mills has a way to get the If we can somehow infer the relevant region(s) to template within the Two different ideas we could try:
|
Hey @dbeatty10, it looks like you're using the better regex pattern to identify shards, like discussed here. |
That's awesome @Fleid ! Credit goes to @hassan-mention-me for the regex (and basically the rest of the implementation seen in this PR). I branched from his implementation in #238, and if this PR is merged @hassan-mention-me is listed in the changelog entry and commits are preserved as well. Giving #260 a re-read, it does look like the regex in this PR would solve it. However, I'd prefer to see explicit test cases added to confirm that the regex is working properly. Here are some test cases listed here that should not be considered shards:
|
Hey @dbeatty10, do you think we can make that one not a draft, and flag it |
@Fleid there's two things that make me uncomfortable with marking this as being "ready for review", both of which I would consider a breaking change:
My opinion: we should ensure that it's non-breaking first. Here's two ideas to make it non-breaking, neither of which I have tried:
Advantage of option 1:
Advantage of option 2:
|
I hear you. I'm thinking that Do you want another stab at this, or are you good with passing the baton? |
I'm marking this as
I'm passing the baton to whoever reviews this! Here's the best TLDR for you to read about proposed things to resolve: #364 (comment) |
Now tracking at #585 |
If the the biggest concern is losing the row count and size, wouldn't you be able to solve that by using the INFORMATION_SCHEMA.PARTITIONS table? Something like the below: SELECT
table_catalog as table_database,
table_schema as table_schema,
table_name as original_table_name,
concat(table_catalog, '.', table_schema, '.', table_name) as relation_id,
row_count as row_count,
size_bytes as size_bytes,
case when table_type = 'EXTERNAL' then 'external' else 'table' end as table_type,
regexp_contains(table_name, '^.+[0-9]{8}$') and table_type = 'BASE TABLE' as is_date_shard,
regexp_extract(table_name, '^(.+)[0-9]{8}$') as shard_base_name,
regexp_extract(table_name, '^.+([0-9]{8})$') as shard_name
FROM (
SELECT
table_catalog,
table_schema,
table_name,
table_type,
sum(total_rows) row_count,
sum(total_logical_bytes) size_bytes
FROM dataset_name.INFORMATION_SCHEMA.TABLES
LEFT OUTER JOIN dataset_name.INFORMATION_SCHEMA.PARTITIONS
USING (table_catalog, table_schema, table_name)
GROUP BY table_catalog, table_schema, table_name, table_type
) |
I'm new to this issue but my naive impression is that this looks like a good suggestion, are there any issues with doing it this way?
|
Closing in favor of #1213 |
resolves #113
branched from #238
Problem
As described here, the current implementation uses the
project.dataset.__TABLES__
metadata table which requires elevated permissions versus the information schema tables. Service accounts provisioned for dbt usage frequently do not have access to this table. This affectsdbt docs generate
.Proposed solution
Use
project.dataset.INFORMATION_SCHEMA.TABLES
instead.Trade-offs and outstanding questions
The problem is that it doesn't contain
row_count
orsize_bytes
, so it is not a perfect drop-in replacement. This could be possibly overcome by using theINFORMATION_SCHEMA.TABLE_STORAGE
view to get the number of rows + the size in logical bytes, but that has its own complications (see below).0
for both?Alternatives
project.dataset.__TABLES__
and fall-back to the new method in case of failure?location
values withinINFORMATION_SCHEMA.SCHEMATA
and then iterate through them to generate all the uniqueregion-REGION.INFORMATION_SCHEMA.TABLE_STORAGE
queries? Then we'd have the row counts and sizes in bytes?Checklist
This PR includes tests, ornew tests are not required/relevant for this PRI have opened an issue to add/update docs, ordocs changes are not required/relevant for this PRchangie new
to create a changelog entry