Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

short names in collection: smaller state.json #167

Open
wants to merge 14 commits into
base: release/8.8
Choose a base branch
from

Conversation

noblepaul
Copy link
Collaborator

@noblepaul noblepaul commented Jul 21, 2022

PoC: do not merge

  1. 100% backward compatible
  2. use the flag compact=true while creating a collection. This is an opt-in feature

sample collection with 10 shard replication factor =1

normal COLL:  state.json :3460 , children sz : 166
compact COLL:  state.json :3158 , children sz : 76

there is a 10% savings in state.json
and PRS data has a savings of > 55%

sample state.json

"shards":{
    "0":{
      "range":"80000000-9998ffff",
      "state":"active",
      "replicas":{"2":{
          "core":"COLL2_1",
          "state":"active",
          "node_name":"127.0.0.1:63936_solr",
          "type":"NRT",
          "base_url":"http://127.0.0.1:63936/solr",
          "leader":"true"}}}}}

PRS data

[6:2:A:L, 18:2:A:L, 12:2:A:L, 20:2:A:L, 14:2:A:L]

@noblepaul noblepaul changed the title short in collection names short names in collection: smaller state.json Jul 21, 2022
@hiteshk25
Copy link

Some thoughts for compact versions

  1. things those can be computed, we can ignore - for example base_url
  2. Similarly prefix can be removed ; "core": "15S_shard1_0_replica_n69" => n69
  3. Make property name small = > coreNodename => core or "stateTimestamp" => ts
  4. Make property value small => active => A or True => T
  5. shard can have replicas, probably don't need "replicas":

Then see if we can reduce overall size of state.json to ~1 mb for 4096 shards and NRT+Pull replica.

With indent size to 0, 4096 shards with NRT takes 1120096 bytes

Would be good to make one page writeup.

@justinrsweeney
Copy link

My two cents is that we probably don't need to go this far in terms of changes. With compression we can pretty easily reduce the size of state.json to something reasonable and it is a simpler change in my view.

What is the case for making these more invasive changes if compression solves the size over the network issue?

@hiteshk25
Copy link

Agree compression should be enough!

The only thing is we will lose text format, which is very helpful to debug any issue. Sometime we may need to edit state.json, which is very convenient with zk-shell.

@noblepaul
Copy link
Collaborator Author

noblepaul commented Jul 21, 2022

Some thoughts for compact versions

The changes in this PR are 100% backward compatible. Solr does not really care if the name of shard is shard1 or 1 .It's just an opaque string value. So, I have not tried to change any other variable eg: ACTIVE to A . I think they are more far reaching

things those can be computed, we can ignore - for example base_url

useful . but, this requires changes in reading

Similarly prefix can be removed ; "core": "15S_shard1_0_replica_n69" => n69

that's mostly achieved in this . we need to prefix the collection name to avoid collision of names in a node

Make property name small = > coreNodename => core or "stateTimestamp" => ts
this is doable, but not backcompat

Make property value small => active => A or True => T
this is doable, but not backcompat

My two cents is that we probably don't need to go this far in terms of changes. With compression we can pretty easily reduce the size of state.json to something reasonable and it is a simpler change in my view

Yes. a combination compression and this can take us pretty far.

The next level of optimization has to be done on memory footprint of parsing and storing this object in memory. We should avoid using the Map<String, Object> and get this into an efficient object deserialization mechanism

@hiteshk25
Copy link

we need compact format to save less data on zk.while reading the data it should remain as it is, what we have today. Verbose names are very useful in logging and debugging purpose. That means we can update replica and slice classes while serializing /de the state.json.

Having said that, if we can't compact 50% or so then there is not much value.

As compression works very well with state.jjson. my only concern is its not text format. And we look state.json file every day. Go to zk-shelll and look various data.

@noblepaul
Copy link
Collaborator Author

Verbose names are very useful in logging and debugging purpose

Yes. But do we ever log the replica name anywhere? even if we do is core_node1 any more readable than 1 ?

Having said that, if we can't compact 50% or so then there is not much value.

just by shortening the replica name we are saving >55% in PRS states

@hiteshk25
Copy link

Verbose names are very useful in logging and debugging purpose

Yes. But do we ever log the replica name anywhere? even if we do is core_node1 any more readable than 1 ?

Having said that, if we can't compact 50% or so then there is not much value.

just by shortening the replica name we are saving >55% in PRS states

  1. Most of the time we find corename from replica and then use that. Also, we emphasis more replica name in solrman because of PRS. Some day we should do same in solr.
  2. I was more talking about state.json file. Looks like prs state consumes 20 bytes per replica. So for 4096 replicas 80k, and with HA 160k.

@noblepaul
Copy link
Collaborator Author

I was more talking about state.json file. Looks like prs state consumes 20 bytes per replica. So for 4096 replicas 80k, and with HA 160k.

We can only reduce the PRS state size if only we reduce the size of the replica name (core node name)

@hiteshk25
Copy link

I was more talking about state.json file. Looks like prs state consumes 20 bytes per replica. So for 4096 replicas 80k, and with HA 160k.

We can only reduce the PRS state size if only we reduce the size of the replica name (core node name)

I would leave as it is unless we see some major gain! it is more readable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants