Skip to content

Commit

Permalink
Synthetic source (#85649)
Browse files Browse the repository at this point in the history
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
  "mappings": {
    "_source": {
      "synthetic": true
    }
  }
}
```

And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.

Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.

## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)


## Educated guesses

The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values

These are mostly artifacts of how doc values are stored.

### sorted alphabetically
```
{
  "b": 1,
  "c": 2,
  "a": 3
}
```
becomes
```
{
  "a": 3,
  "b": 1,
  "c": 2
}
```

### as "objecty" as possible
```
{
  "a.b": "foo"
}
```
becomes
```
{
  "a": {
    "b": "foo"
  }
}
```

### pushes all arrays to the "leaf" fields
```
{
  "a": [
    {
      "b": "foo",
      "c": "bar"
    },
    {
      "c": "bort"
    },
    {
      "b": "snort"
    }
}
```
becomes
```
{
  "a" {
    "b": ["foo", "snort"],
    "c": ["bar", "bort"]
  }
}
```

### sorts most array values
```
{
  "a": [2, 3, 1]
}
```
becomes
```
{
  "a": [1, 2, 3]
}
```

### removes duplicate text and keyword values
```
{
  "a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
  "a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`

Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.

That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.

## perf numbers

I loaded the entire tsdb data set with this change and the size:

```
           standard -> synthetic
store size  31.0 GB ->  7.0 GB  (77.5% reduction)
_source  24695.7 MB -> 47.6 MB  (99.8% reduction - synthetic is in _recovery_source)
```

A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.

With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
  • Loading branch information
nik9000 authored May 10, 2022
1 parent cf2fcae commit a589456
Show file tree
Hide file tree
Showing 88 changed files with 2,886 additions and 144 deletions.
5 changes: 5 additions & 0 deletions docs/changelog/85649.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 85649
summary: Synthetic source
area: Mapping
type: feature
issues: []
228 changes: 116 additions & 112 deletions docs/reference/search/profile.asciidoc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import org.elasticsearch.test.VersionUtils;
import org.elasticsearch.xcontent.ToXContent;
import org.elasticsearch.xcontent.XContentBuilder;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Collection;
Expand Down Expand Up @@ -677,4 +678,14 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
assumeFalse("Test implemented in a follow up", true);
return null;
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import org.elasticsearch.index.mapper.MapperBuilderContext;
import org.elasticsearch.index.mapper.NumberFieldMapper;
import org.elasticsearch.index.mapper.SimpleMappedFieldType;
import org.elasticsearch.index.mapper.SourceLoader;
import org.elasticsearch.index.mapper.SourceValueFetcher;
import org.elasticsearch.index.mapper.TextSearchInfo;
import org.elasticsearch.index.mapper.TimeSeriesParams;
Expand Down Expand Up @@ -349,6 +350,20 @@ private double scale(Object input) {
public TimeSeriesParams.MetricType getMetricType() {
return metricType;
}

@Override
public String toString() {
StringBuilder b = new StringBuilder();
b.append("ScaledFloatFieldType[").append(scalingFactor);
if (nullValue != null) {
b.append(", nullValue=").append(nullValue);
;
}
if (metricType != null) {
b.append(", metricType=").append(metricType);
}
return b.append("]").toString();
}
}

private final Explicit<Boolean> ignoreMalformed;
Expand Down Expand Up @@ -641,4 +656,29 @@ public Object nextValue() throws IOException {
};
}
}

@Override
public SourceLoader.SyntheticFieldLoader syntheticFieldLoader() {
if (hasDocValues == false) {
throw new IllegalArgumentException(
"field [" + name() + "] of type [" + typeName() + "] doesn't support synthetic source because it doesn't have doc values"
);
}
if (ignoreMalformed.value()) {
throw new IllegalArgumentException(
"field [" + name() + "] of type [" + typeName() + "] doesn't support synthetic source because it ignores malformed numbers"
);
}
if (copyTo.copyToFields().isEmpty() != true) {
throw new IllegalArgumentException(
"field [" + name() + "] of type [" + typeName() + "] doesn't support synthetic source because it declares copy_to"
);
}
return new NumberFieldMapper.NumericSyntheticFieldLoader(name(), simpleName()) {
@Override
protected void loadNextValue(XContentBuilder b, long value) throws IOException {
b.value(value / scalingFactor);
}
};
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import org.elasticsearch.xcontent.XContentBuilder;
import org.elasticsearch.xcontent.XContentFactory;
import org.hamcrest.Matchers;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Collection;
Expand Down Expand Up @@ -162,4 +163,14 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
protected void randomFetchTestFieldConfig(XContentBuilder b) throws IOException {
assumeFalse("We don't have a way to assert things here", true);
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import org.elasticsearch.index.mapper.ParsedDocument;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.xcontent.XContentBuilder;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Arrays;
Expand Down Expand Up @@ -157,4 +158,14 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
assumeFalse("Test implemented in a follow up", true);
return null;
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.xcontent.XContentBuilder;
import org.hamcrest.Matchers;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Arrays;
Expand Down Expand Up @@ -176,4 +177,14 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
protected boolean allowsNullValues() {
return false; // TODO should this allow null values?
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@

package org.elasticsearch.index.mapper.extras;

import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.DocValuesType;
import org.apache.lucene.index.IndexableField;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.core.Tuple;
import org.elasticsearch.index.mapper.DocumentMapper;
import org.elasticsearch.index.mapper.MappedFieldType;
import org.elasticsearch.index.mapper.MapperParsingException;
Expand All @@ -23,6 +25,7 @@
import org.elasticsearch.xcontent.XContentBuilder;
import org.elasticsearch.xcontent.XContentFactory;
import org.elasticsearch.xcontent.XContentType;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Arrays;
Expand All @@ -31,6 +34,7 @@

import static java.util.Collections.singletonList;
import static org.hamcrest.Matchers.containsString;
import static org.hamcrest.Matchers.equalTo;

public class ScaledFloatFieldMapperTests extends MapperTestCase {

Expand Down Expand Up @@ -349,4 +353,83 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
default -> throw new IllegalArgumentException();
};
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
return new SyntheticSourceSupport() {
private final double scalingFactor = randomDoubleBetween(0, Double.MAX_VALUE, false);
private final Double nullValue = usually() ? null : round(randomValue());

@Override
public SyntheticSourceExample example() {
if (randomBoolean()) {
Tuple<Double, Double> v = generateValue();
return new SyntheticSourceExample(v.v1(), v.v2(), this::mapping);
}
List<Tuple<Double, Double>> values = randomList(1, 5, this::generateValue);
List<Double> in = values.stream().map(Tuple::v1).toList();
List<Double> outList = values.stream().map(Tuple::v2).sorted().toList();
Object out = outList.size() == 1 ? outList.get(0) : outList;
return new SyntheticSourceExample(in, out, this::mapping);
}

private Tuple<Double, Double> generateValue() {
if (nullValue != null && randomBoolean()) {
return Tuple.tuple(null, nullValue);
}
double d = randomValue();
return Tuple.tuple(d, round(d));
}

private double randomValue() {
return randomBoolean() ? randomDoubleBetween(-Double.MAX_VALUE, Double.MAX_VALUE, true) : randomFloat();
}

private double round(double d) {
long encoded = Math.round(d * scalingFactor);
return encoded / scalingFactor;
}

private void mapping(XContentBuilder b) throws IOException {
b.field("type", "scaled_float");
b.field("scaling_factor", scalingFactor);
if (nullValue != null) {
b.field("null_value", nullValue);
}
if (rarely()) {
b.field("index", false);
}
if (rarely()) {
b.field("store", false);
}
}

@Override
public List<SyntheticSourceInvalidExample> invalidExample() throws IOException {
return List.of(
new SyntheticSourceInvalidExample(
equalTo("field [field] of type [scaled_float] doesn't support synthetic source because it doesn't have doc values"),
b -> b.field("type", "scaled_float").field("scaling_factor", 10).field("doc_values", false)
),
new SyntheticSourceInvalidExample(
equalTo(
"field [field] of type [scaled_float] doesn't support synthetic source because it ignores malformed numbers"
),
b -> b.field("type", "scaled_float").field("scaling_factor", 10).field("ignore_malformed", true)
)
);
}
};
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected void validateRoundTripReader(String syntheticSource, DirectoryReader reader, DirectoryReader roundTripReader)
throws IOException {
// Intentionally disabled because it doesn't work yet
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
import org.elasticsearch.index.search.QueryStringQueryParser;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.xcontent.XContentBuilder;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.ArrayList;
Expand Down Expand Up @@ -793,4 +794,14 @@ protected Object generateRandomInputValue(MappedFieldType ft) {
assumeFalse("We don't have doc values or fielddata", true);
return null;
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.elasticsearch.index.mapper.SourceToParse;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.xcontent.XContentBuilder;
import org.junit.AssumptionViolatedException;

import java.io.IOException;
import java.util.Arrays;
Expand Down Expand Up @@ -186,4 +187,14 @@ protected String generateRandomInputValue(MappedFieldType ft) {
protected void randomFetchTestFieldConfig(XContentBuilder b) throws IOException {
b.field("type", "token_count").field("analyzer", "standard");
}

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
throw new AssumptionViolatedException("not supported");
}

@Override
protected IngestScriptSupport ingestScriptSupport() {
throw new AssumptionViolatedException("not supported");
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
unsupported:
- skip:
version: " - 8.2.99"
reason: introduced in 8.3.0

- do:
catch: bad_request
indices.create:
index: test
body:
mappings:
_source:
synthetic: true
properties:
join_field:
type: join
relations:
parent: child
Loading

0 comments on commit a589456

Please sign in to comment.