ApacheCon, Upgrading Cassandra Using Automation
Upgrades using cstar - A presentation from ApacheCon @Home 2020
I recently did an upgrade of 200+ nodes of Cassandra across multiple environments sitting behind multiple applications using the cstar tool. I chose the cstar tool because, out of all automation options, it has topology awareness specific to Cassandra. I will share my experience with this upgrade, including observations and surprises, as well as a walk-through of the process using a Cassandra cluster provisioned in Docker.
Note: This presentation was in the middle of Covid, and you can see we are still adapting to how to hold conferences and do remote presentations during this time!
Key points:
-
Why cstar?
-
Chosen for its Cassandra-specific topology awareness, which is crucial for safe rolling upgrades in distributed environments.
-
Preferred over custom scripts and other tools due to its robustness, community support, and ability to handle complex cluster topologies.
How cstar works:
-
Runs commands in parallel across nodes, respecting Cassandra’s token distribution and data center layout.
-
Requires only minimal dependencies (Python 3 on a jump host, SSH access to nodes).
-
Does not need to be installed on each Cassandra node—just the jump host.
-
Supports running custom scripts, distinct tasks, and custom commands for flexible automation.
Upgrade process:
-
Preparation: Pre-checks included verifying SSH access, permissions, disk space (with 60% overhead recommended), and ensuring no leftover SSTables or snapshots from previous upgrades.
-
Execution: Used cstar for rolling upgrades and configuration changes, ensuring only one node per data center was down at a time (using “strategy one”).
-
Verification: Used cstar to quickly check Cassandra versions and disk space cluster-wide post-upgrade.
-
Post-upgrade: Ran custom scripts for SSTable upgrades and cleanup using cstar’s built-in commands.
Lessons learned:
-
cstar’s output can be too quiet; enabling verbose logging is recommended for troubleshooting.
-
The tool halts on errors; sometimes manual intervention is needed, but jobs can often resume without restarting the whole process.
-
The cstar jobs folder is useful for tracking job status and output, especially when using screen sessions for long-running tasks.
-
Automating more verification steps and integrating backup tools like Medusa would improve future upgrades.
Takeaways:
-
cstar is a powerful and reliable tool for automating large-scale Cassandra upgrades, especially when cluster topology and operational safety are priorities.
-
Proper preparation, monitoring, and iterative testing (in staging before production) are essential for success.
-
Community tools like cstar offer advantages over custom scripts in terms of maintainability and shared expertise.
See the video here: https://www.youtube.com/watch?v=xcX_0UXjEvo