Seamless Upgrade Strategy for Apache Cassandra and OS on EC2 - Proactive Insights and Support For Open-Source Applications

Problem:

The client was using Apache Cassandra 4.1.5 installed via a tarball extraction on an AWS EC2 machine and wanted to upgrade both their Cassandra version and the operating system. The installation was done manually using the tarball method, and the client needed to understand the feasibility and potential challenges involved in upgrading the OS and Cassandra to newer versions. They were particularly concerned about minimizing downtime, avoiding data corruption, and ensuring compatibility with client applications during the upgrade process.

Process:

Step 1: Initial analysis

The expert began by analyzing the Cassandra setup, which was installed manually via the tarball extraction method (i.e., $ tar -xzvf apache-cassandra-4.1.5-bin.tar.gz). The client provided key details about the cluster, such as the cluster name (Production IPAAS Cluster), seed nodes, replication strategy, and partitioner used. The configuration settings indicated a well-established Cassandra deployment, with specific parameters for heap memory, storage, and performance tuning.

To provide a thorough upgrade strategy, the expert requested additional details, such as the OS version, Java compatibility, and a few outputs from Cassandra tools like nodetool status and the contents of cassandra.yaml and cassandra-env.sh. The goal was to assess the current state of the system, identify any potential incompatibilities, and prepare a detailed upgrade plan.

Step 2: Proposed solutions

The expert proposed a structured approach to upgrade both the operating system and Cassandra. The solution was designed to ensure data integrity, compatibility, and minimal downtime during the transition.

Cassandra upgrade considerations:

The expert recommended a rolling upgrade approach for Cassandra. This would involve upgrading one node at a time, ensuring that the rest of the cluster remains operational. The upgrade path would be as follows:

Minor upgrade: From Apache Cassandra 4.1.5 to 4.1.x (recommended for fewer compatibility issues).
Major upgrade: From 4.1.5 to 5.0.x, with additional compatibility checks due to breaking changes, particularly with schema metadata storage and internode messaging.

The expert also noted the importance of a rolling restart to ensure compatibility with client applications and avoid downtime. Each node would be stopped, upgraded, and restarted individually. Before upgrading, it was crucial to backup configuration files, snapshot the data and commitlogs, and disable hinted handoff to avoid potential corruption.

OS upgrade considerations:

The operating system upgrade was to be performed after the Cassandra upgrade. The expert emphasized the need to check Java compatibility before upgrading the OS, as Cassandra 4.x requires Java 11 or later. The OS upgrade should follow these steps:

Stop the Cassandra service.
Perform the OS upgrade using the package manager (yum update -y for Amazon Linux).
Reboot the machine.
Restart Cassandra, ensuring that the Java version is compatible and that no errors occur during startup.

The expert highlighted that if there was a major OS upgrade (e.g., from Amazon Linux 2 to Amazon Linux 2023), Cassandra might require a reinstallation. Therefore, testing the upgrade process in a non-production environment was strongly recommended to ensure compatibility and avoid surprises during the actual upgrade.

Solution:

The expert recommended a step-by-step approach to upgrading both the Cassandra version and the OS. For the Cassandra upgrade, the following actions were to be taken:

Backup everything: Including cassandra.yaml, cassandra-env.sh, data, and commitlogs.
Rolling upgrade: Perform the upgrade on each node one by one. For each node:
- Drain the node (nodetool drain), stop Cassandra (sudo systemctl stop cassandra), and perform the upgrade by extracting the new Cassandra version (e.g., apache-cassandra-4.1.6).
- Update environment variables and restart the node (sudo systemctl start cassandra).
- Verify the node’s health (nodetool status).
Once all nodes were upgraded, the expert recommended running a nodetool repair -pr to ensure data consistency across the cluster.

For the OS upgrade, the process was as follows:

Stop Cassandra and perform the OS upgrade via the package manager (sudo yum update -y for Amazon Linux).
Reboot the machine, check Java compatibility (java -version), and restart Cassandra.
After upgrading, verify the OS version (uname -r), check for any Java issues, and monitor the Cassandra logs (tail -f /var/log/cassandra/system.log).

Conclusion:

The solution provided a comprehensive and structured approach to upgrading both Cassandra and the operating system. By performing a rolling upgrade, the client would avoid downtime and ensure that the upgrade process would not disrupt production services. The steps were tailored to address the specific setup the client had, including their tarball-based Cassandra installation and custom configurations. The expert also emphasized the importance of testing the upgrade in a non-production environment to identify potential issues before applying them in production. This approach would allow the client to successfully upgrade their system while minimizing the risk of data loss, service interruptions, and compatibility issues.