Ceph Storage Capacity Issue: OSDs Limited Space Despite Expected Availability - Proactive Insights and Support For Open-Source Applications

Problem:

Ceph Storage Almost Full but Should Have Space. The client reported that the Ceph storage is nearly full, even though there should be sufficient space available. The output of ceph osd status indicates that some OSDs have limited available space. The most common cause identified is not deleting the lost+found directory after a crash or deleting files.

Process:

Step 1: Manual Deletion in lost+found

The immediate workaround is to manually delete files inside the lost+found directory to free up space. This involves identifying unnecessary files and removing them.

Step 2: Debugging Information

Determine whether it’s a CephFS issue or if the client is using Rados or RBD. Request outputs of the following commands:
ceph osd tree
ceph df
rados lspools

Step 3: Additional Information

Review the output of ceph osd tree to check the weights and statuses of OSDs.

Analyze the output of ceph df to understand the overall storage utilization.

Verify the pools using rados lspools.

Step 4: Cluster and Pool Configuration

Ensure that clocks are synchronized across the cluster.

Check if each OSD has a unique ID.

Assess the replication factor and adjust it if necessary.

Step 5: Cluster Expansion

Consider adding more OSDs to the cluster to increase storage capacity.

Evaluate changing OSD disks to larger ones, if feasible.

Adjust the replication factor to optimize disk space, keeping in mind the trade-off with data redundancy.

Step 6: Additional Recommendations

Run command to observe storage space growth.

Create another ceph pool.

Add this pool to Openstack as storage.

Migrate Running VPM and needed templates to the new pool.

Delete old pool.

Solution:

The immediate solution involves manually clearing space by deleting unnecessary files. However, for a more sustainable solution, understanding the cluster and pool configurations is crucial. Debugging information, such as outputs from relevant commands, helps identify the root cause. The long-term approach may involve adjusting replication factors, adding OSDs, or changing disk configurations to meet storage requirements. Regular monitoring and maintenance are recommended to prevent similar issues in the future.

Conclusion:

This case study provides a structured approach to addressing the reported Ceph storage issue, combining immediate remedies with long-term strategies for better storage management.