SANGFOR aSAN Actual Test Reveals: How Hard Disk Soft Isolation Technology Surpasses VMware and Solves the Sub-Health Problem of Storage

In the process of digital transformation, enterprises have increasingly high requirements for the efficiency and stability of storage systems. However, the hard disk sub-health problem faced by distributed storage in complex hardware environments has become an "invisible killer" affecting business continuity and stability.

The aSAN hard disk soft isolation technology launched by SANGFOR has effectively solved this problem through an innovative soft isolation framework, bringing a revolutionary breakthrough to the stability of storage systems.

Hard Disk Sub-Health: The "Invisible Killer" of Business Stability

Distributed storage systems usually adopt a strong consistency algorithm for multi-replica data writing, and only return to the application after all replicas have been written. However, when components such as hard disks and hosts enter a sub-health state (e.g., disk IO response time increases from 10ms to more than 100ms), it will cause write IO stutters, seriously affect business performance, and even trigger large-scale business interruptions. Especially in the context of independent innovation in information technology (IT), the hardware failure rate has increased, making the sub-health problem more prominent.

Currently, traditional solutions convert sub-healthy hard disks to a faulty state through out-of-band detection. Although this can maintain business continuity, it has two major defects: first, the sub-health state may return to normal; second, treating sub-health as a fault will accelerate hardware wear and increase maintenance costs.

To effectively solve the above problems, the industry has proposed an improved idea — a business-linked soft isolation framework. This framework first regularly monitors indicators such as the latency and IOPS of the hard disk state. When the hard disk state reaches a specific threshold, it is determined to be stuck or slow, and then a series of disposal actions are initiated.

Although this improved solution is progressive, it still has room for optimization:

  • Poor timeliness: Replica consistency detection requires listing all shards on the hard disk. It takes at least several minutes from the occurrence of disk stuck to the final completion of hard disk isolation, which may have seriously affected business continuity.
  • Wide impact range: If a few shards on a stuck or slow disk are inconsistent, the entire disk cannot be isolated, resulting in continuous business interruption.
  • Poor timeliness: Replica consistency detection requires listing all shards on the hard disk. It takes at least several minutes from the occurrence of disk stuck to the final completion of hard disk isolation, which may have seriously affected business continuity.
  • Wide impact range: If a few shards on a stuck or slow disk are inconsistent, the entire disk cannot be isolated, resulting in continuous business interruption.

SANGFOR aSAN Hard Disk Soft Isolation: Reshaping a New Paradigm for Storage Fault Response

SANGFOR's aSAN hard disk soft isolation solution has powerful functions such as business-linked fault perception, heuristic fault diagnosis, and precise silencing of failed components, effectively avoiding the impact of single-point problems on business continuity.

Soft Isolation Framework Architecture: A Two-Pronged Approach to Ensure Storage Stability

The aSAN soft isolation framework takes the physical virtual storage volume as the management unit and consists of two parts:

  • Storage client data plane: Through the sub-health perception technology of data replicas, it ensures the availability of data replicas, temporarily isolates sub-healthy replicas within seconds, and quickly reports fault information to ensure that the business is not affected.
  • Storage client data plane: Through the sub-health perception technology of data replicas, it ensures the availability of data replicas, temporarily isolates sub-healthy replicas within seconds, and quickly reports fault information to ensure that the business is not affected.
  • Soft isolation framework control plane: With the Fault Disposal Center (DFC) as the core, it collects fault information reported by the Distributed Fault Node (DFN) plug-ins. It comprehensively analyzes the reported data for accurate diagnosis and avoids false alarms. For faults that can be recovered in a short time, the sub-healthy replicas are re-enabled after recovery to prevent the data from running with insufficient replicas for a long time; for faults that cannot be recovered for a long time, the faulty data replicas are completely isolated and reconstructed to ensure data reliability.
  • Soft isolation framework control plane: With the Fault Disposal Center (DFC) as the core, it collects fault information reported by the Distributed Fault Node (DFN) plug-ins. It comprehensively analyzes the reported data for accurate diagnosis and avoids false alarms. For faults that can be recovered in a short time, the sub-healthy replicas are re-enabled after recovery to prevent the data from running with insufficient replicas for a long time; for faults that cannot be recovered for a long time, the faulty data replicas are completely isolated and reconstructed to ensure data reliability.

(Note: The following is a textual description of the architecture diagram content)
Host where the physical volume master control is located; Routing calculation is as distributed as possible; Exclude faulty bricks; Algorithm; Fault Disposal Center (DFC); Distributed faults; Fault alarm; Fault event persistence; Alarm MongoDB database; Fault data reconstruction; Fault status display and configuration; DTS UI; Report events and status; Diagnosis results; Diagnosis results; Report events and status; Distributed Fault Node (DFN); Distributed Fault Node (DFN); Fault; Fault; Fault; Fault; Plug-in; Plug-in; Plug-in; Plug-in; Report IO stuck/slow events; Obtain and set status; Report IO stuck/slow events; Obtain and set status; Storage client data plane process; Storage client data plane process; Host 1; Host 2; Host n; Physical Volume 1.

SANGFOR aSAN Hard Disk Soft Isolation Framework Architecture

New Breakthrough in Application-Layer Software Isolation: Abandoning the Traditional Operation of Hardware Disk Removal

Compared with the industry's mainstream hardware disk removal and business-linked soft isolation solutions, SANGFOR aSAN's soft isolation framework fully adopts a pure software isolation mechanism, abandons the hardware disk removal operation, avoids compatibility issues of hardware from different brands, and improves the versatility and stability of the solution.

At the same time, DFN provides a fault plug-in interface, which integrates the data plane client plug-in and the stuck/slow disk detection plug-in to expand fault handling capabilities and achieve precise disposal.

In short, SANGFOR aSAN hard disk soft isolation solution demonstrates innovation and practicality in both architecture design and application layer, providing an innovative idea for fault handling of distributed storage systems. Then, what is the actual effect? We will use two sets of comparative data to show you!

Actual Tests Prove: SANGFOR aSAN Performance Leads in All Aspects

Comparative Test with VMware Stuck/Slow Disks

We will conduct strict test comparisons and practical application verifications with VMware from five aspects: disk stuck IO detection, disk slow IO detection, RAID card slow fault detection, disk stuck fault reconstruction, and disk stuck fault business IO.

ItemaSAN (SANGFOR)vSAN (VMware)Scheme Comparison
IO Detection - Disk StuckIf the disk IO stuck exceeds 500ms, it is determined as a stuck disk, and the disk will not be kicked out.For HDD disks, if the disk does not respond to the abort command within 120 seconds (default timeout period), vSAN will set the disk/disk group to the offline state to prevent affecting the entire vSAN cluster.SANGFOR aSAN has a faster detection speed. Combined with the subsequent reconstruction scheme, it can effectively avoid misjudgment caused by temporary faults and achieve faster and more accurate detection.
IO Detection - Disk SlowFor HDD disks, if the IO latency is greater than 70ms for 1 minute within 5 minutes, it is determined as a slow disk, and the disk will not be kicked out either.The degradation drive determination standard is: within approximately six hours, the average write IO round-trip latency of four or more randomly distributed delay intervals exceeds the drive's predetermined delay threshold. Among them, the write IO latency threshold for hard disk drives (HDD) is 200ms, and the read IO latency threshold for flash memory devices (SSD) is 50ms.The slow disk determination of SANGFOR aSAN is more timely and the determination conditions are relatively simple and direct; the determination standard of vSAN is relatively complex with a longer time span.
RAID Card Slow Fault DetectionThe storage layer does not separately detect IO stuck for RAID hardware; it treats RAID card stuck the same as disk stuck at the system software layer.When the real-time IO transmission between the RAID card and the host is unresponsive for 20 to 30 seconds, vSAN will immediately mark the device as Degraded.SANGFOR aSAN does not rely on monitoring and disposal at the hardware layer, reducing the dependence on specific hardware monitoring and having wider adaptability.
Disk Stuck Fault ReconstructionReconstruction will be initiated only after the disk fault persists for 5 minutes or the fault recurs and recovers automatically within 1 hour for 3 times.After the disk is degraded, if the capacity meets the conditions, reconstruction will be initiated immediately.SANGFOR aSAN's continuous fault detection mechanism significantly reduces the possibility of misjudgment and reconstruction caused by temporary faults, improving system stability.
Disk Stuck Fault Business IOWhen the IO times out for 8 seconds or continues to be slow, the faulty replica is degraded. Business IO is no longer sent to the faulty replica; however, for a single data shard, data can still be sent to the stuck faulty replica in the case of multi-point faults to maximize business continuity.After the disk status is marked as degraded, all IO will be suspended, and then the accessibility of each related object will be recalculated. The entire inspection process takes about 5-7 seconds; if the FTT of a VM is 0, the original disk or backup plug-in must be reinserted for vSAN to recover the VM.SANGFOR aSAN can handle faults at a finer shard-level granularity in the case of multi-point faults, with more accurate disposal effects.

Comparative Test with VMware Stuck/Slow Disks

In the test comparison with VMware vSAN in handling disk stuck/slow issues, it is found that SANGFOR aSAN hard disk soft isolation solution has more outstanding performance and a more comprehensive monitoring mechanism.

Comparison with a Competitor's Soft Isolation Solution

Scheme DetailsSANGFORCompetitor (a certain manufacturer)Scheme Comparison
Disposal SchemeAdopts a multi-level isolation strategy based on intelligent IO analysis to achieve flexible and precise hierarchical isolation operations, minimizing the impact on the overall storage system.Uses a hard disk-level isolation method; after iostat monitors an abnormality, the entire hard disk is isolated.SANGFOR has a finer isolation granularity and more accurate disposal effect. The competitor's out-of-band monitoring can only achieve hard disk-level isolation, which has a small overall workload but poor effect and may affect the use of some normal data.
Detection PointEmploys embedded monitoring technology to monitor each IO in real time, enabling the acquisition of rich and detailed indicator data.Relies on out-of-band monitoring and focuses more on the overall performance indicators of the hard disk level.SANGFOR's embedded monitoring has a finer granularity, more comprehensive indicator dimensions, and lower misjudgment risk. The competitor's out-of-band monitoring is simple to operate and has a small workload, but the monitoring granularity is coarse, the indicator dimensions are limited, and it is difficult to accurately judge abnormal situations.

Overall Scheme Effect Comparison with a Competitor's Soft Isolation Solution

Through the comparison with a competitor's overall soft isolation scheme, it is found that:

  • In terms of the disposal scheme, SANGFOR aSAN hard disk soft isolation solution adopts a multi-level isolation strategy based on intelligent IO analysis, with finer isolation granularity and more accurate disposal effect. The competitor can only perform hard disk-level isolation.
  • In terms of detection points, SANGFOR aSAN hard disk soft isolation solution uses embedded monitoring to monitor each IO of the storage client, with finer granularity, more indicator dimensions, and lower misjudgment risk; while the competitor uses out-of-band monitoring to monitor hard disks through iostat, which has a small workload but poor effect.
  • In terms of the disposal scheme, SANGFOR aSAN hard disk soft isolation solution adopts a multi-level isolation strategy based on intelligent IO analysis, with finer isolation granularity and more accurate disposal effect. The competitor can only perform hard disk-level isolation.
  • In terms of detection points, SANGFOR aSAN hard disk soft isolation solution uses embedded monitoring to monitor each IO of the storage client, with finer granularity, more indicator dimensions, and lower misjudgment risk; while the competitor uses out-of-band monitoring to monitor hard disks through iostat, which has a small workload but poor effect.

In conclusion, relying on accurate and efficient detection, intelligent and flexible strategies, and full-scenario response capabilities, SANGFOR aSAN hard disk soft isolation solution provides a strong guarantee for the stable operation of businesses and is a reliable choice for enterprise storage systems.

Practical Application: SANGFOR aSAN Empowers MES to Achieve a Leap in Stability and Reduce O&M Costs

"Take our company's MES (Manufacturing Execution System) as an example. In the past, the MES system response speed dropped sharply frequently due to hard disk sub-health problems. All links such as production scheduling, material management, and quality monitoring were affected, resulting in an economic loss of hundreds of thousands of yuan per month. In addition, O&M was time-consuming and labor-intensive, which seriously affected the normal production and operation of our company." — An executive of a manufacturing enterprise.

After introducing SANGFOR aSAN hard disk soft isolation solution, the stability of the MES system has been significantly improved:

  1. Rapid response: It takes 15 seconds to resolve the low-frequency stuck situation of HDD hard disks and quickly restore normal business operations.
  2. Continuous stability: It has dealt with sub-health problems multiple times within half a year, and the business interruption time has been controlled within an extremely short range each time.
  3. Efficiency improvement: The MES system operates stably and efficiently, making production scheduling more timely and accurate, greatly improving the efficiency of material management, enabling real-time and effective quality monitoring, and significantly reducing the product defect rate.

Of course, the IT O&M team has also been freed from the tedious hardware fault troubleshooting, and can focus more on optimizing the functions of the internal IT system and improving user experience, injecting new vitality into the enterprise's digital transformation and production efficiency improvement.

The innovation of SANGFOR aSAN hard disk soft isolation solution stems from the precipitation of years of self-developed technology in the storage field and continuous innovation by the R&D team. In the future, SANGFOR will continue to adhere to the concept of technological leadership, provide users with more high-quality and reliable storage solutions, and help enterprises move forward steadily in the wave of digitalization.

"Cloud Talk Technology" is a cloud technology content column created by SANGFOR. It will regularly push content related to cloud computing, such as technical analysis and scenario practices, to deeply explain SANGFOR's innovation capabilities, technical trends, scenario applications, and forward-looking analysis in the field of cloud computing.