-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetic source #85649
Synthetic source #85649
Conversation
I hacked together something to test the differences:
Which spits out:
The test data has values like |
One thing I've noticed that we probably don't want but I don't know how to get rid of is |
I've forbidden copy_to for synthetic source indices in this PR. We can figure out how to allow it later. |
Two skipped
@romseygeek could you have another look at this? I've pushed some extra testing for round trips and it all passes. Well, sort of. I have to stub out a little of it because of mystery precision things. But I think we can get those in a follow up change. |
@@ -203,4 +206,24 @@ protected void randomFetchTestFieldConfig(XContentBuilder b) throws IOException | |||
protected boolean allowsNullValues() { | |||
return false; // null is an error for constant keyword | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have enough test cases that have to implement these four identical 'empty' methods, that maybe it's worth consolidating them into a NoSyntheticSourceTest
interface with default methods and the test cases can just implement them?
For those following along at home this used to be activated with |
I didn't need him.
Now that this is merged I've moved the follow up work to a meta issue: #86603 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for all the back and forth, let's get this merged and look at the follow-ups.
I have some perf numbers from a hack that turns off
The short version is about 11% improvement in docs per second in TSDB, probably more in non-TSDB. Significantly faster merges, flushes, and refreshes - at least in TSDB, probably much faster in non-TSDB. TSDB in it's current form has a somewhat inefficient indexing pipeline, mostly because it can never skip the The merge time is funny to read - it looks like a 2% speed up, but I believe a lot of that speed up is being throttled. See the 55% bump in merge throttling time. My guess is that we're looking at a reduction in load from merge in the 25% range, similar to flush and refresh. Here's what the disk looks like with
Note the bursty writes. Here's what it looks like without
The writes are less bursty. Still bursty, but less so. I believe the infrastructure that I used to run this captured graphs of the this data over a longer period of time, but I don't know how to access it. I'm digging. Edit:
This one is better in the neighborhood of 17.5% rather than 11%. |
Is there any way to disable creation of |
We've talked a little about this - rebuilding the _source on the fly using synthetic _source. At the time we decided it wasn't worth it because folks were looking at doing other kinds of replication. I believe they are still working on that. In that replication mechanism we wouldn't need |
Nik, thank you for your answer! |
Bleh. And I do think |
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
And we just stop storing the
_source
field - kind of. When you go to accessthe
_source
we regenerate it on the fly by loading doc values. Doc valuesdon't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like
_reindex
.Fetching the
_source
from doc values does slow down loading somewhat. Seenumbers further down.
Supported fields
This only works for the following fields:
boolean
byte
date
double
float
geo_point
(with precision loss)half_float
integer
ip
keyword
long
scaled_float
short
text
(when there is akeyword
sub-field that is compatible with this feature)Educated guesses
The synthetic source generator makes
_source
fields that are:These are mostly artifacts of how doc values are stored.
sorted alphabetically
becomes
as "objecty" as possible
becomes
pushes all arrays to the "leaf" fields
becomes
sorts most array values
becomes
removes duplicate text and keyword values
becomes
_recovery_source
Elasticsearch's shard "recovery" process needs
_source
sometimes. So doescross cluster replication. If you disable source or filter it somehow we store
a
_recovery_source
field for as long as the recovery process might need it.When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce
_recovery_source
and relies on it for recovery. It's possibleto synthesize
_source
during recovery but we don't do it.That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
perf numbers
I loaded the entire tsdb data set with this change and the size:
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
expect this performance impact is based on the number of doc values fields
in the index and how sparse they are.