Creation of Sorted String Table Folders in Cassandra - Proactive Insights and Support For Open-Source Applications

Problem:

The client wanted to understand the logic behind the creation of SSTable (sorted string table) folders in Cassandra when data is written to a database table within a keyspace. Specifically, the question was about why each table has two directories in certain scenarios and how this logic affects the ability to achieve data backups based on snapshots.

Process:

The expert team initiated the preliminary investigation and promptly requested information from the client. The client provided information describing that in their case, they have two folders named table_name-unique_ID/, where they have not deleted or recreated the table according to history. The client would like to understand if there is any other reason for this scenario.

The folder structure provided by the client:

table_name-uniqueID-1/
- backups
- data files
table_name-uniqueID-2/
- backups

Solution:

After the investigation, the expert team provided an explanation describing how to manage backups and data files effectively in Cassandra following this folder structure:

1. Folder Creation:

Create a folder structure named table_name-uniqueID/ for each table.

table_name: Represents the exact table name in Cassandra.
uniqueID: Concatenated unique identifier to distinguish tables and prevent data loss upon table recreation.

2. Subfolders:

backups: Store snapshots or backups of the table’s data.
data files: Hold the actual data files associated with the table.

3. Explanation:

Purpose of table_name: Indicates the specific Cassandra table for easy identification.
Concatenated uniqueID: Ensures data persistence even if the table is dropped and recreated, as Cassandra does not automatically delete associated data files.
backups Folder: Used for storing snapshots or backup copies of the table’s data, facilitating data restoration.
Data Files: Contains the actual data files related to the table, crucial for maintaining data integrity and availability.

The restoration process is described in this article here for detailed steps on restoring data from snapshots in Cassandra.

In the described scenario where there’s only one folder, it suggests that the table hasn’t been dropped and recreated with the same name, and the unique hash is created during the initial table creation. The folder structure’s purpose is primarily for internal organization within Cassandra, and there is no specific reason for this structure beyond ensuring data integrity and differentiation between tables.

The expert team concluded that Cassandra creates a folder with a table name and concatenates a unique hash. The first part of the folder name is for indicating the underlying table. The second is a unique hash, in case the table is dropped or recreated with the same name. If the table has not been dropped and recreated with the same name, there is only one folder. The unique tail hash is created and concatenated during table creation, so regardless of other conditions, this hash will be added.

Conclusion:

The client sought clarity on the logic behind the creation of SSTable (Sorted String Table) folders in Cassandra, particularly the occurrence of two directories per table in specific scenarios and its impact on data backup via snapshots. The investigation revealed that Cassandra uses a unique folder structure for each table, which includes a table name and a concatenated unique identifier. This structure is designed to ensure data integrity and distinguish tables, even when they are dropped and recreated.

Understanding the logic behind the creation of SSTable directories in Cassandra allows for effective data management and backups. The unique directory structure ensures that data remains organized and distinguishable, even if tables are deleted and recreated. This knowledge helps maintain data integrity and simplifies the backup processes in Cassandra.