-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Remove expensive logs table migration #11920
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11920 +/- ##
==========================================
- Coverage 67.68% 67.64% -0.05%
==========================================
Files 930 932 +2
Lines 45132 45161 +29
Branches 4331 4331
==========================================
+ Hits 30549 30550 +1
- Misses 14480 14508 +28
Partials 103 103
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
a5a8eb2
to
f89b9da
Compare
superset/utils/log.py
Outdated
@@ -224,6 +224,9 @@ def log( # pylint: disable=too-many-arguments,too-many-locals | |||
logs = list() | |||
for record in records: | |||
json_string: Optional[str] | |||
record.update( | |||
{"path": path, "path_no_int": path_no_int, "ref": ref,} | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably happen in log_context
so you don't have to pass these arguments around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
1f221c4
to
6cf8428
Compare
superset/migrations/shared/utils.py
Outdated
insp = reflection.Inspector.from_engine(engine) | ||
has_column = False | ||
try: | ||
for col in insp.get_columns(table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe,
return any(col["name"] == column for col in insp.get_columns(table))
This would replace line 37 and lines 40 - 43.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: https://stackoverflow.com/a/52865284
but i'm realizing now, this is a pretty janky function. let me fix it
6cf8428
to
23cf625
Compare
def upgrade(): | ||
with op.batch_alter_table("logs") as batch_op: | ||
if utils.table_has_column("logs", "path"): | ||
batch_op.drop_column("path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etr2460 this isn't quite right. If the column exists one should really iterate over the non-NULL values and migrate those to the json
column. That could be somewhat of an expensive operation but given the relevancy (or lack there of) and recency of the original migration it may be ok to accept the data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i made the executive decision to drop any extra metadata that came in in the last week from the previous change. I think it's reasonable since i doubt many folks are relying on them thus far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to drop the metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for picking up the pieces here.
For the record, listing out some reasons why we don't recommend using the DBEventLogger
at scale:
- analytics event logging happens in the main thread and adds a tax to every logged request (slows down the web server a bit)
- puts significant extra load on the metadata database, that's already busy with the OLTP workload
logs
quickly becomes the largest table in your database [by far!], making backup and restore operations much more expensive and time consuming than they should be
For the record, at Preset we use Segment and configure it to send our logs to BigQuery.
23cf625
to
89bd58e
Compare
I think it's useful to keep logs for things like object creation/deletion for auditing purposes. But for UX analytics and simple access logs, we should probably target moving it out of the metadata database? Should we create two separate loggers for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one naming suggestion.
superset/utils/log.py
Outdated
payload.update( | ||
{ | ||
"path": request.path, | ||
"path_no_int": strip_int_from_path(request.path), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we name this to path_no_param
in case we want to update the logic in strip_int_from_path
as discussed in #11714 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned in the summary, this PR:
Does not:
- Address any naming concerns about the new columns
I'd rather keep this PR just scoped to fixing the migration issue. We can fast follow with naming changes if we all agree they're preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think path_no_param
was agreed on. Just thought it's better to make that change sooner so we don't have any dirty data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you just merge this after CI is green, I can make another followup PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to completely drop path_no_int
and the strip_int_from_path
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also ok fine with ref
-> object_ref
too. I'm happy to do in a fast follow PR too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard. I'm a fan of measure twice cut once, as renaming things (or dealing with legacy names) in the future is always somewhat painful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i see the path_no_param
agreement from the previous pr. tbh, CI is sooo slow tonight i probably won't merge this until tomorrow. So i'll make the change now.
89bd58e
to
14fd7fc
Compare
Since CI is green, i'm going to merge this now. Feel free to follow up with any naming PRs today or next week |
(cherry picked from commit 77d362d)
@etr2460 @mistercrunch #11927 is ready for review |
SUMMARY
Addresses issues with #11714 that prevented the migration from being run on MySQL dbs with large
logs
tables.Does the following:
Does not:
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
ADDITIONAL INFORMATION
to: @mistercrunch @john-bodley @dpgaspar @ktmud @graceguo-supercat