-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL: Add is_active to sys.segments, update examples and docs. #11550
Conversation
is_active is short for: (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1 It's important because this represents "all the segments that should be queryable, whether or not they actually are right now". Most of the time, this is the set of segments that people will want to look at. The web console already adds this filter to a lot of its queries, proving its usefulness. This patch also reworks the caveat at the bottom of the sys.segments section, so its information is mixed into the description of each result field. This should make it more likely for people to see the information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc part is quite clear and helpful, thanks. Suggested a few refinements.
docs/querying/sql.md
Outdated
|is_available|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.| | ||
|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.| | ||
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.| | ||
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the wording a bit? Seems the key bit for a user to know is: For a published segment, the number will either be null or accurate. If null, then the Broker has not received the row count yet. For an unpublished segment, the number will be slightly out of date as new data arrives. (Assuming this is an accurate statement.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little bit of delay between when a segment is published and when num_rows becomes fully accurate, because it's fetched via doing a query to a data server, rather than appearing in the published segment descriptor. I updated the wording to the following, which is hopefully more clear:
Number of rows in this segment, or zero if the number of rows is not known.
This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a
count(*)
query because the cachednum_rows
on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.
(I also changed "null" to "zero" because that's what it actually is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate that the state of "we don't know the number of rows yet because we haven't finished checking" is zero. I rather have it null or some other indications ("?" or "processing"... I know is not easy to find an alternative). Until now I puzzled why there is a lag of time when rows are zero in the web console and suddenly they are not. And this happened when I was working in tombstones because "zero" rows was an indication to me that the segment might be a tombstone until it was not lol....
docs/querying/sql.md
Outdated
|is_realtime|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.| | ||
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.| | ||
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.| | ||
|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second (third) place in the docs that emphasizes should. Is this notion explained anywhere? Does this mean that the segment is scheduled to load into a Historical, but has not yet done so? Or, does it mean there is some kind of problem that the user must resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context with the "should be" is that everything with regard to ingestion and segment availability happens in the background and is asynchronous. So some segments maybe should be available, but aren't right now, and the system will work to make them available. Some others maybe are available, but shouldn't be (because they were dropped or replaced), and the system will work to make them unavailable.
I changed the wording to hopefully be more clear:
True for segments that represent the latest state of a datasource.
Equivalent to
(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1
. In steady state, when no ingestions or data management operations are happening,is_active
will be equivalent tois_available
. However, they may differ from each other when ingestions or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given byis_active
.
docs/querying/sql.md
Outdated
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.| | ||
|num_rows|LONG|Number of rows in this segment. This field is updated in the background and cached on the Broker. It may be null if the Broker has not gathered a row count for this segment yet. It may not match the result of `count(*)` queries on realtime data, because the cached value on the Broker may be out of date, and because different replicas of realtime segments may not be in sync with each other. Once a segment is published, its row count will settle and stop changing.| | ||
|is_active|LONG|Boolean represented as long type where 1 = true, 0 = false. True for segments that are either available and queryable, or _should be_ available and querayble. Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`.| | ||
|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably "published to the metadata store" means "by the MiddleManager at the completion of ingestion"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
docs/querying/sql.md
Outdated
|is_published|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 represents this segment has been published to the metadata store with `used=1`. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.| | ||
|is_available|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is currently being served by any process(Historical or realtime). See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.| | ||
|is_realtime|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is _only_ served by realtime tasks, and 0 if any historical process is serving this segment.| | ||
|is_overshadowed|LONG|Boolean represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always 0 for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [segment lifecycle documentation](../design/architecture.md#segment-lifecycle) for more details.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: consistent use of code font: is_overshadowed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
docs/querying/sql.md
Outdated
|shard_spec|STRING|JSON-serialized form of the segment `ShardSpec`| | ||
|dimensions|STRING|JSON-serialized form of the segment dimensions| | ||
|metrics|STRING|JSON-serialized form of the segment metrics| | ||
|last_compaction_state|STRING|JSON-serialized form of the compaction task's config (compaction task which created this segment). May be null if segment was not created by compaction task.| | ||
|
||
For example to retrieve all segments for datasource "wikipedia", use the query: | ||
For example to retrieve all currently-active segments for datasource "wikipedia", use the query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example to retrieve all currently-active segments for datasource "wikipedia", use the query: | |
For example to retrieve all currently active segments for datasource "wikipedia", use the query: |
I've merged master with this branch and re-pushed it. The doc changes are now made in |
|is_overshadowed|LONG|Boolean is represented as long type where 1 = true, 0 = false. 1 if this segment is published and is _fully_ overshadowed by some other published segments. Currently, is_overshadowed is always false for unpublished segments, although this may change in the future. You can filter for segments that "should be published" by filtering for `is_published = 1 AND is_overshadowed = 0`. Segments can briefly be both published and overshadowed if they were recently replaced, but have not been unpublished yet. See the [Architecture page](../design/architecture.md#segment-lifecycle) for more details.| | ||
|shard_spec|STRING|JSON-serialized form of the segment `ShardSpec`| | ||
|num_rows|LONG|Number of rows in this segment, or zero if the number of rows is not known.<br /><br />This row count is gathered by the Broker in the background. It will be zero if the Broker has not gathered a row count for this segment yet. For segments ingested from streams, the reported row count may lag behind the result of a `count(*)` query because the cached `num_rows` on the Broker may be out of date. This will settle shortly after new rows stop being written to that particular segment.| | ||
|is_active|LONG|True for segments that represent the latest state of a datasource.<br /><br />Equivalent to `(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1`. In steady state, when no ingestion or data management operations are happening, `is_active` will be equivalent to `is_available`. However, they may differ from each other when ingestion or data management operations have executed recently. In these cases, Druid will load and unload segments appropriately to bring actual availability in line with the expected state given by `is_active`.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nit: ... At the end of this great explanation, just to repeat it so it sticks: "given by is_active
. In other words, a segment that is in the is_active
state may not be available, not queryable, yet, but it will be in the near future".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"might be"? I guess it is possible that due to some other activities (segment was overshadowed before being available for instance) a segment in is_active
may never make it to is_available
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah: there's a couple reasons a segment in is_active
state won't eventually become is_available
. Maybe it's dropped before that happens. Or maybe something is broken. In the interest of keeping the doc from getting too long I'm thinking to leave it as-is. But I invite follow-up patches that improve things 🙂
@@ -313,7 +316,10 @@ public Enumerable<Object[]> scan(DataContext root) | |||
(long) segment.getShardSpec().getPartitionNum(), | |||
numReplicas, | |||
numRows, | |||
IS_PUBLISHED_TRUE, //is_published is true for published segments | |||
//is_active is true for published segments that are not overshadowed | |||
val.isOvershadowed() ? IS_ACTIVE_FALSE : IS_ACTIVE_TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm.. isn't it a requirement for being active that is_overshadow and is_publish both be true? Oh...got it. We already know that it is published if we are here. So it is fine...never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the idea. The branch is for published segments only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM..Thanks @gianm this PR will be very helpful to the community.
thanks for reviewing @loquisgon! |
is_active is short for:
It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.
The web console already adds this filter to a lot of its queries,
proving its usefulness.
This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.