ApacheCon, Upgrading Cassandra Using Automation

Upgrades using cstar - A presentation from ApacheCon @Home 2020

ApacheCon, Upgrading Cassandra Using Automation

I recently did an upgrade of 200+ nodes of Cassandra across multiple environments sitting behind multiple applications using the cstar tool. I chose the cstar tool because, out of all automation options, it has topology awareness specific to Cassandra. I will share my experience with this upgrade, including observations and surprises, as well as a walk-through of the process using a Cassandra cluster provisioned in Docker.

Note: This presentation was in the middle of Covid, and you can see we are still adapting to how to hold conferences and do remote presentations during this time!

Key points:

  • Why cstar?

  • Chosen for its Cassandra-specific topology awareness, which is crucial for safe rolling upgrades in distributed environments.

  • Preferred over custom scripts and other tools due to its robustness, community support, and ability to handle complex cluster topologies.

How cstar works:

  • Runs commands in parallel across nodes, respecting Cassandra’s token distribution and data center layout.

  • Requires only minimal dependencies (Python 3 on a jump host, SSH access to nodes).

  • Does not need to be installed on each Cassandra node—just the jump host.

  • Supports running custom scripts, distinct tasks, and custom commands for flexible automation.

Upgrade process:

  • Preparation: Pre-checks included verifying SSH access, permissions, disk space (with 60% overhead recommended), and ensuring no leftover SSTables or snapshots from previous upgrades.

  • Execution: Used cstar for rolling upgrades and configuration changes, ensuring only one node per data center was down at a time (using “strategy one”).

  • Verification: Used cstar to quickly check Cassandra versions and disk space cluster-wide post-upgrade.

  • Post-upgrade: Ran custom scripts for SSTable upgrades and cleanup using cstar’s built-in commands.

Lessons learned:

  • cstar’s output can be too quiet; enabling verbose logging is recommended for troubleshooting.

  • The tool halts on errors; sometimes manual intervention is needed, but jobs can often resume without restarting the whole process.

  • The cstar jobs folder is useful for tracking job status and output, especially when using screen sessions for long-running tasks.

  • Automating more verification steps and integrating backup tools like Medusa would improve future upgrades.

Takeaways:

  • cstar is a powerful and reliable tool for automating large-scale Cassandra upgrades, especially when cluster topology and operational safety are priorities.

  • Proper preparation, monitoring, and iterative testing (in staging before production) are essential for success.

  • Community tools like cstar offer advantages over custom scripts in terms of maintainability and shared expertise.

See the video here: https://www.youtube.com/watch?v=xcX_0UXjEvo