Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove SASI index on dependency column family #790

Closed
vprithvi opened this issue Apr 26, 2018 · 5 comments
Closed

Remove SASI index on dependency column family #790

vprithvi opened this issue Apr 26, 2018 · 5 comments

Comments

@vprithvi
Copy link
Contributor

vprithvi commented Apr 26, 2018

While most SASI indexes were removed as part of #80, the one in the dependencies column family still exists, leading to problems when using older versions of cassandra that don't support SASI indexes, or using alternate storage like ScyllaDB .

We can update the dependency schema to not have SASI indexes, and provide a migration script from the old schema to the new schema.

We should also ensure that https://github.com/jaegertracing/spark-dependencies works with the new schema.

@yurishkuro
Copy link
Member

btw, I suggest adding a "source" column to the schema, potentially to represent data coming from different sources, i.e. not just from traces (where source would be an aggregation job), but say from service mesh, or network sniffing. The UI diagram can aggregate all sources together, and use different viz to distinguish the links.

@vprithvi
Copy link
Contributor Author

I suggest adding a "source" column to the schema, potentially to represent data coming from different sources

I like this idea, but I don't think it should be part of the migration, I created #791 to capture this.

@yurishkuro
Copy link
Member

I am not suggesting we implement all of the relevant business logic, but if we are already making a breaking schema change, why not include an extra field?

@vprithvi
Copy link
Contributor Author

but if we are already making a breaking schema change, why not include an extra field?

Because it's unrelated to this change, and is unusable without the business logic. Why shouldn't the change be done along with the business logic?

@vprithvi
Copy link
Contributor Author

I'm thinking that changing the data model to include a date bucket while making the time stamp as a clustering key would enable us to maintain the current query patterns while removing the SASI index.

The schema looks something like this:

CREATE TABLE IF NOT EXISTS ${keyspace}.dependencies (
ts timestamp,
date_bucket text
dependencies list<frozen<dependency>>,
PRIMARY KEY (bucket, ts)
) WITH CLUSTERING ORDER BY (ts DESC)

While the write path is largely unaffected, reads become a bit more involved, as we need to compute buckets that we want to retrieve dependencies from.

We also need to update the spark dependencies job.

The migration path seems to be the following:

  1. Stop dependencies job
  2. Use the Cassandra COPY command to export to a CSV file (which works for tables with less than 2 million rows)
  3. Delete existing dependencies column family
  4. Create dependencies column family with new schema
  5. Massage the CSV file into the new format and load into new schema
  6. Run new version of dependencies job which writes to new schema

@yurishkuro @black-adder @jpkrohling @pavolloffay wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants