Q. What kind of data providers can I use as data source? What media the results can be written to?
A. At the moment, Capillaries can only read and write CSV and Parquet files. It can read from:
- file system
- http/https
- sftp URIs
- S3 buckets
and write to:
- file system
- sftp URIs
- S3 buckets
Q. I have my data file(s) ready, but I am not sure how to create a Capillaries script that would process them. What's the best way to start?
A. You may want to try Toolbelt proto_file_reader_creator
command, it can read your source file and generate a simple Capillaries script that reads data from your file in /tmp/capi_in
, saves it to the Capillaries table, and writes it back to the file with the same name in /tmp/capi_out
. Check out proto_file_reader_creator integration test, it creates Capillaries scripts on the fly and runs them against your Capillaries setup.
Make sure the script this command generated from your data file works as you expect, and add more processsing nodes to it.
Q. Is there a limit on the number of nodes, runs, and indexes?
A. The number of nodes in the script and runs performed for a keyspace are virtually unlimited. But keep in mind that each run-node pair creates a separate table in Cassandra (if an index is created, account for an extra table for each run-node-index triad), and Cassandra does not scale well when the number of tables in a keyspace exceeds a couple of hundreds.
Q. I can't see any code/example that works with NULLs. Are they supported?
A. There is no support for NULL values. To mitigate it, Capillaries offers support for custom default values. See default_value
in Table write field definition. Whever an empty value found in the source CSV or Parquet file, this default_value will be written to the table.
Q. I can see that Capillaries offers runs as a tool to handle scenarios when some data was not processed properly. After making all necessary fixes, operators can re-start a run (or multiple runs) to overwrite data for all nodes affected by this run - in intermediate Cassandra tables and in the result files. But it may unnecessarily affect too many nodes and take too long. Can I re-process a single failed script node? A single failed batch?
A. Re-processing nodes: yes, to some extent. But it has to be part of the script design. You can design your Capillaries script in a way so the node in question and all its dependants can only be started manually. This means that even on successful script execution, the operator will have to manually start a run that processes only one node in question, and manually start a run that processes its dependants.
Re-running single batches: no. Capillaries data tables that hold results of a run are immutable, re-running batches would violate this restriction.
Q. When a run is started, I can see RabbitMQ messages created immediately for every batch, and every node affected by the run. And those messages linger in the queue for a while until the node is ready for processing. Why doesn't Capillaries send RabbitMQ messages to a node only after dependency node processing is complete?
A. Because it would be extremely hard (if even possible) to guarantee the atomicity/idempotency of batch handling by code that:
- modifies data tables
- modifies workflow tables
- generates the next set of RabbitMQ messages
The trickiest part would be to guarantee that only one copy of a batch message for a specific node is created and handled. The approach that Capillaries takes (creating a set of messages in one shot at the run start) guarantees it. If a node is not ready to process the batch message, it goes to the the dead letter exchange where it waits for dead_letter_ttl milliseconds and re-routed to the queue again.
This article explains RabbitMQ dead letter exchange use.
Q. Can Capilaries survive RabbitMQ service disruption?
A. No. Guaranteed delivery of RabbitMQ messages for each Capillaries batch is one of the cornerstones of Capillaries architecture.
Q. For each row in my data table, I need to acquire some new data from an external source (say, via web service), providing some row fields as arguments.
A. Start a run that dumps the table into files via file writer with some unique row identifiers, acquire data from the external source, save acquired data into new files that use the same unique row identifiers, and start a run that uses those new files.
Q. Is there a UI for Capillaries?
A. Yes. See Capillaries UI project, which is a simple web single-page application that shows the status of every run in every keyspace. UI requirements tend to be very business-specific, it's not an easy task to come up with a cookie-cutter UI framework that would be flexible enough. Dedicated solution developers are encouraged to develop their own UI for Capillaries workflows, using Capillaries Webapi as a back-end and Capillaries UI as an example.
Also please note that Toolbelt can produce rudimentary visuals using DOT diagram language - see Toolbelt validate_script
, get_run_status_diagram
commands.
Yes. See Capillaries 100% Docker-based demo.
Q. Can I run Capillaries against cloud-based Cassandra?
A. As of this writing (2022), Azure CosmosDB and AWS Keyspaces have notoriously high latency. For example, Azure can complete "CREATE TABLE" command successfully, but an "INSERT" command executed immediately after that may return an error saying that the table does not exist.
This situation can be mitigated to some extent by creating all tables for a specific run in advance and verifying that all tables are in place. A toolbelt command producing CQL statements that creates all tables for a run may look like this:
go run capitoolbelt.go get_table_cql -script_file=... -params_file=... -keyspace=... -run_id=... -start_nodes=...
The tricky part is to specify the correct run id for a run that has not started yet.
Another tricky part is to run this CQL against the cloud infrastructure and wait until all tables are guaranteed to be created.
Bottom line: Capillaries' use of cloud-based Cassandra is questionable at the moment.
Q. What are the potential directions for improvement?
A. Here are some, in no particular order:
-
Performance enhancements, espcecially those related to the efficient use of Cassandra.
-
Read/write from/to other file formats, maybe databases. Update 2023: Apache Parquet support was added, see Parquet reader and Parquet writer.
-
Creating node configuration is a tedious job. Consider adding a toolbelt command that takes a CSV or Parquet file as an input and generates JSON for a corresponding file_table/table_file node. Update 2023: done, see proto_file_reader_creator test.
-
Is the lack of NULL vsalues support a deal-breaker? Update March 2024: support for *_if aggregate functions was added, it should help mitigate the lack of NULL support.
-
Python formulas. (a) Need a strategy to mitigate potential security threats introduced by py_calc. SELinux/AppArmor? (b) Consider generic lambda functions?
-
Keep an eye on Azure/AWS/GCP progress with Cassandra-compatible databases (latency!) and RabbitMQ offerings.
-
Select distinct field values from a table: it can be implemented easily using a set, but it will not scale and it will be limited by the size of the map. Alternatively, it can be implemented using Cassandra features, but it will require Capillaries to support tables without rowid (so the unique values are stored in a partitioning key field). Update March 2024: done, see distinct_table node.
-
Keep adding support for Go library functions