Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start migrate Field to &str #1772

Merged
merged 1 commit into from
Jan 18, 2023
Merged

start migrate Field to &str #1772

merged 1 commit into from
Jan 18, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Jan 10, 2023

start migrate Field to &str in preparation of columnar
return Result for get_field

@codecov-commenter
Copy link

Codecov Report

Merging #1772 (b3a4367) into main (7c6cc81) will increase coverage by 0.01%.
The diff coverage is 95.97%.

@@            Coverage Diff             @@
##             main    #1772      +/-   ##
==========================================
+ Coverage   94.12%   94.13%   +0.01%     
==========================================
  Files         281      282       +1     
  Lines       52824    52944     +120     
==========================================
+ Hits        49719    49839     +120     
  Misses       3105     3105              
Impacted Files Coverage Δ
src/query/query_parser/logical_ast.rs 80.00% <ø> (ø)
src/indexer/index_writer.rs 96.77% <90.00%> (ø)
src/query/range_query/range_query.rs 93.41% <92.45%> (+2.05%) ⬆️
src/fastfield/readers.rs 90.27% <94.28%> (+0.79%) ⬆️
src/aggregation/agg_req_with_accessor.rs 95.48% <100.00%> (-0.20%) ⬇️
src/aggregation/bucket/histogram/histogram.rs 99.56% <100.00%> (-0.01%) ⬇️
src/aggregation/intermediate_agg_result.rs 98.42% <100.00%> (-0.01%) ⬇️
src/collector/filter_collector_wrapper.rs 84.84% <100.00%> (ø)
src/collector/tests.rs 98.18% <100.00%> (ø)
src/collector/top_score_collector.rs 96.85% <100.00%> (ø)
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@PSeitz PSeitz requested a review from fulmicoton January 10, 2023 09:25
@fulmicoton fulmicoton marked this pull request as draft January 16, 2023 05:01
@fulmicoton
Copy link
Collaborator

Switching as draft. This kind of change is a one way street, so this requires a lot of more reflexion.

@adamreichold
Copy link
Collaborator

I would also be interested if this has any performance impact &str is more costly to search/compare/store than u64 (which Field currently wraps).

I also fear that this makes API usage much more awkward as downstream code might be forced to manage Cow<'static, str> instead of Field. ATM, I take care to convert field names as early as possible into Field at my module boundaries so that I can easily pass/store/clone the Field values in its implementation.

This is doubly worrying as neither the commit message nor the PR cover letter give any rationale as to why this change is made beyond "in preparation of columnar" so that it is hard to understand whether this is a fundamental requirement or whether there are options/alternatives that could be discussed.

@PSeitz
Copy link
Contributor Author

PSeitz commented Jan 16, 2023

I would also be interested if this has any performance impact &str is more costly to search/compare/store than u64 (which Field currently wraps).

I also fear that this makes API usage much more awkward as downstream code might be forced to manage Cow<'static, str> instead of Field. ATM, I take care to convert field names as early as possible into Field at my module boundaries so that I can easily pass/store/clone the Field values in its implementation.

This is doubly worrying as neither the commit message nor the PR cover letter give any rationale as to why this change is made beyond "in preparation of columnar" so that it is hard to understand whether this is a fundamental requirement or whether there are options/alternatives that could be discussed.

Thanks for the feedback. The preparation which was mentioned is adding fastfield support for JSON fields.
We can't use Field for that use case and we want to keep the API consistent between JSON and non-JSON fastfields. The name of the field is a better identifier that can cover both use cases.
The typical workflow as of involves &str anyway. Schema::get_field("name") => load fastfield via Field

As of performance impact, a single lookup in a Hashmap won't show in any performance profile. Even if there are hundreds of fastfields, the lookups will likely be dwarfed by operations on the fastfield.

@fulmicoton
Copy link
Collaborator

The change has a lot of other benefits. It will evenutally make it possible to modify tantivy's schema

@adamreichold
Copy link
Collaborator

The preparation which was mentioned is adding fastfield support for JSON fields.

The change has a lot of other benefits. It will evenutally make it possible to modify tantivy's schema

If I understand you correctly, this is basically a prerequisite for schema-less operation? Otherwise, the schema could for example contain paths into JSON fields instead of just names, couldn't it?

The typical workflow as of involves &str anyway. Schema::get_field("name") => load fastfield via Field

I think the main ergonomic downside (that is admittedly very Rust-specific) is that borrowing &str "infects" code with lifetimes (Cow<'static, str> often being the compromise used to avoid that) whereas with Field(u64), passing by value is even cheaper than passing by reference.

Looking at this from an efficiency angle again, since the number of distinct field names is low in schema-bound operation, I would probably end up deduplicating those strings using something string_cache which is just a generic and hence less efficient variant of what Field enables today. (For example, during indexing I currently do the above &str to Field conversion exactly once during start up.)

As of performance impact, a single lookup in a Hashmap won't show in any performance profile. Even if there are hundreds of fastfields, the lookups will likely be dwarfed by operations on the fastfield.

In principle, I agree. There is such a thing "distributed fat" though, which individually is hard to measure, yet it doesn't preclude that it has a measurable cost globally. For example, a hash table look-up might increase instruction cache pressure compared which can result in slowdowns elsewhere. But of course, without benchmarks this is just FUD.

And in the end, you are not accountable to me in any way and I also do not want to hold back any changes necessary for Tantivy to grow. I would be glad if big changes like this one would come with detailed rationales so that an outsider like me can understand that this is either necessary to move forward at all or will have additional benefits that are worth the costs.

start migrate Field to &str in preparation of columnar
return Result for get_field
@PSeitz PSeitz marked this pull request as ready for review January 17, 2023 06:33
@fulmicoton fulmicoton merged commit f687b3a into main Jan 18, 2023
@fulmicoton fulmicoton deleted the use_columnar branch January 18, 2023 07:12
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023
start migrate Field to &str in preparation of columnar
return Result for get_field
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants