Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create a go kafka consumer to save off kafka data to cassandra #6

Open
joestein opened this issue May 13, 2015 · 3 comments
Open

create a go kafka consumer to save off kafka data to cassandra #6

joestein opened this issue May 13, 2015 · 3 comments
Assignees

Comments

@joestein
Copy link

We want todo this in a table with DataStaxEnterprise running so we can have it solr indexed, this is VERY important for us to search the logline.line and such. It also allows us to index the data in the different ways you can categorize the object (e.g. based on source and tag and logtype, etc).

@joestein
Copy link
Author

We should use https://github.com/gocql/gocql

@edgefox edgefox self-assigned this Jun 19, 2015
@edgefox
Copy link

edgefox commented Jun 19, 2015

@joestein What structure should we expect from kafka and what should be exported to Cassandra?

@joestein
Copy link
Author

We are getting Avro LogLine from Kafka. This could be materialized as a few different Cassandra tables. You can save the timeuuid using the value from the LogLine which is nice, clean and unique http://docs.datastax.com/en/cql/3.1/cql/cql_reference/timeuuid_functions_r.html and save that instead of using the now(). We can keep the time ordered more atomically from when the event happened vs from when the system though it happened. We should have another table for storing now() as a cluster key for similar partitions key. A spark job with kafka and cassandra can provide continued monitoring, audits and alerts for when these are drifting and what the current drift is for each point "touching" the event and what time the system actually thinks it is. This value should accompany the ntp drift value so we can also compare the time the server thinks it is too.

Other tables should be made up from tag structures. Every tag should have a value as a partition with the cluster key being the time of the event. We should also do source and every combination of source too. The actually partition index breaking up should come from the log type index value. Sometimes different tag keys together will make up a primary key along with some of the values matched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants