Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP First draft for version field #58256

Closed
wants to merge 21 commits into from

Conversation

cbuescher
Copy link
Member

This is a first draft of a new field type aimed at storing software version strings like e.g. versions following the Semantic Versioning scheme. The field should offer exact matching and range queries similar to e.g. what we currently have for IP fields.
This POC contains of some early ideas about how we could encode the version strings into a binary representation that would allow ordering according to e.g. the Semantiv Versioning specification (i.e. precedence in part 11. in https://semver.org) but also potentially allow for more flexibility. Early tests for a "version" field type and a "version_range" range type show how basic search and aggs are possible with this encoding.

@cbuescher cbuescher added the WIP label Jun 17, 2020
@cbuescher cbuescher force-pushed the versionField-draft branch from 638ada3 to 9b9b556 Compare June 17, 2020 13:45
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first quick round, this looks like a great start. The encoding makes sense to me.

}
System.arraycopy(min.bytes, 0, bytes, 0, BYTES);
System.arraycopy(max.bytes, 0, bytes, BYTES, BYTES);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it take the BytesRef offset into account?

import org.elasticsearch.index.mapper.VersionEncoder;
import org.elasticsearch.index.mapper.VersionEncoder.SortMode;

public class VersionRangeField extends Field {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename it to Binary16RangeField or something like that since it doesn't do anything that is specific to versions?

+ VersionEncoder.decodeVersion(new BytesRef(min), SortMode.SEMVER)
+ " : "
+ VersionEncoder.decodeVersion(new BytesRef(max), SortMode.SEMVER)
+ "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that since the BytesRef may be truncated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure either, the alternative is only to print the BytesRef hex output if we need some "toString".

VERSION("version_range", LengthType.VARIABLE) {

// TODO check if these are really safe min/max values
private BytesRef MIN_VALUE = new BytesRef(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this helps not create an empty byte[]

Suggested change
private BytesRef MIN_VALUE = new BytesRef(0);
private BytesRef MIN_VALUE = new BytesRef();

public BytesRef nextDown(Object value) {
// TODO currently no up/down
return (BytesRef) value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to look at NumericUtils#add in Lucene which does something along those lines

throw e;
}
}
if (fieldType.indexOptions() != IndexOptions.NONE || fieldType.stored()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather like that we encode the first 16 bytes of the encoded version in points to get efficient range queries

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this as TODO, some pointers would be helpful


if (includeDefaults || ignoreMalformed.explicit()) {
builder.field("ignore_malformed", ignoreMalformed.value());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must support all options that are supported by the parser, so it looks like some of them are missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added output for "mode" for now

// encoded string, need to re-encode
return encodeVersion(((BytesRef) value).utf8ToString(), mode);
} else {
throw new IllegalArgumentException("Illegal value type: " + value.getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include the value as well in the error message?

// encode whether version has pre-release parts
if (preReleaseId != null) {
encodedVersion.append(PRERELESE_SEPARATOR_BYTE); // versions with pre-release part sort before ones without
String[] preReleaseParts = preReleaseId.substring(1).split(DOT_SEPARATOR_REGEX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support version strings that have consecutive dots, then I think we should use another split method as this one would skip the empty string in such a case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen reason for supporting consecutive dots yet, currently I would try to catch that in initial validation. Do you have something in mind that would require this?

if (first == false) {
encodedVersion.append(DOT_SEPARATOR_BYTE);
}
boolean isNumeric = preReleasePart.chars().allMatch(x -> Character.isDigit(x));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should treat the empty string as a numeric or as a string, does the spec say anything about this corner case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specs say that "Identifiers MUST NOT be empty". Identifiers are the dot-separated parts, so I think we can check this in some form of validation already (e.g. check and error on consecutive dots like mentioned above)

@cbuescher
Copy link
Member Author

@jpountz thanks for the reviews and guidance, I opened #59773 with the "version" field part of this draft, will follow up with a separate PR for the "version_range" work thats based on that, so I will close this draft.

@cbuescher cbuescher closed this Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants