Problem:

The client reported a critical issue with Solr 6.6.5, where Solr cores were becoming corrupt when new data was pushed through the Data Import Handler (DIH). This issue led to significant data loss and functionality impact, as existing indexed data became inaccessible, resulting in zero search results despite having content in Solr.

Process:

Step 1 – Initial investigation and troubleshooting:

The expert team initiated the preliminary investigation and promptly requested information from the client. Our expert team began by replicating the client’s production environment. This involved setting up virtual machines, containers, and virtual networking infrastructure based on the provided configuration, log, and data files. The environment and configurations were successfully replicated.

Step 2 – Detailed Analysis and Hypotheses:

Recreated the client’s production cluster setup, including one master node and 5 slave nodes. Analyzed log files to identify patterns and potential indicators of the issue. Hypothesized that the issue might be due to network disruptions, disk/resource limitations, or ZooKeeper service and Apache Solr configuration synchronization issues.

Step 3 – Client Communication and Further Data Collection:

Maintained regular communication with the client to gather more details:

  1. Requested information on network topology and VPN/overlay technologies.
  2. Requested command execution output and configurations used to start the service (Start Solr, Start Zookeper);
  3. The following information has been provided by the client:
    • Log files (Solr logs, ZooKeeper logs);
    • Network details and information on DIH configurations;
    • Startup scripts JEE/SolrProduct/scripts/SolrDomain_SolrSlaveServer/SolrSlaveServer/startSolrSlaveServer.sh;
    • Logs structure:
      • JEE/SolrProduct/logs/SolrDomain_SolrMasterServer/SolrMasterServer
      • JEE/SolrProduct/logs/SolrDomain_SolrSlaveServer/SolrSlaveServer

Solution:

To improve the availability of the slave server, it was recommended to reduce the frequency of replication or updates. This can be achieved by increasing the pollInterval setting, which at the time of the problem was set to run every 60 seconds.

Based on the log analysis, it was suggested that the issue may be caused by network disruptions or disk/resource limitations on the slave server during peak replication times. This is because each replication event doubles the searchable data, as the system searches the older version while uploading new data. Adjusting the replication frequency can alleviate these performance bottlenecks and enhance overall system stability.

  1. Increase Memory: The experts recommended increasing the memory for all Solr nodes to handle peak replication times better.
  2. Network and Resource Optimization: Identified potential network issues and suggested improvements in network topology and resource allocation to prevent future occurrences of the issue.

Conclusion:

The client faced critical issues with Solr core corruption when using the Data Import Handler, affecting search functionality and resulting in data loss. Our team successfully replicated the client’s environment and identified potential causes, including network issues and resource limitations. We provided detailed recommendations, such as adjusting replication settings and enhancing system monitoring, to mitigate the problem.