Follow

Introduction

InfiniBox Active-Active replication provides zero-RPO and zero-RTO, enabling mission critical business-services to keep operating even through a complete site failure:

  • Symmetric synchronous replication solution, applications can be geographically clustered
  • Fully integrated into InfiniBox, allows simple management of application spread across data centers 
  • Migrate data non-disruptively between InfiniBox systems at the storage level, without interrupting customer facing services

InfiniBox Active-Active Solution

From the ground up, three key concepts guide the architecture of InfiniBox Active-Active replication:

  • Resilience and high-availability
  • Minimal performance impact, and 
  • Ease-of-use 

Therefore, the solution is built on the following basic components, that describe the behavior of datasets on the two InfiniBox systems that participate in Active/Active replication:

Peer-to-peer replication 

Both sides of the Active-Active replication are equal, i.e. they are peers in every sense, there are no roles and there's no hierarchy between the datasets. 

As long as the datasets are synchronized, both systems will serve I/O and behave exactly the same. Both datasets are writable and readable by the mapped hosts. 

Low latency sync replication 

Active-Active replication uses the same replication mechanism used by the Sync replication, so it has very low overhead in addition to the latency of the link. Just like Sync replication, Active-Active replication uses IP connection, and requires a max 5ms round trip latency on the replication link.  

specifically for Active-Active, any I/O updating a dataset is transferred once over the replication link, which means the latency is the same regardless of which system serves the I/O.

Automatic fail-over and fail-back

Once the replica is created, everything is automatic: the system automatically detects failures, takes the appropriate measures to make sure the dataset remains available, and when problem is resolved the systems automatically performs fail-back (if necessary).

When properly configured, no human intervention is required at any point.

Highest level of protection and HA

Many failure scenarios are possible with Active-Active replication, but the hardest to mitigate is a split brain scenario. InfiniBox provides several way to prepare for and mitigate such failures. A split brain is a situation when two systems serve I/O to the same dataset without keeping the data in sync between the systems (for clarity: InfiniBox never allows this to happen). Split brain might happen, for example, if the replication link fails and there is no connectivity between the systems, which means information cannot be synchronized.

In order to avoid split brain InfiniBox provides have two levels of protection, Witness and/or Preferred system, which when used together provide complete protection.

  • The witness is an arbitrator (an outside observer), connected to both systems and acts as another path of communication between the systems. Using the witness, replicas are able to manage the failures in the optimal way while keeping the data accessible to applications.
  • A preferred system provides a fallback for cases where the witness is not available to the systems. When both systems are up and can both provide access to a dataset, the preferred system definition allows that system to keep specific dataset accessible while the non-preferred system will prevent access to the data, thus eliminating the danger of a split brain.

I/O flow with Active-Active replication

With InfiniBox Active-Active replication the data is available on both systems (peers) with zero RPO and zero RTO. Moreover, Active-Active replication allows hosts and applications to read and write data to both systems at the same time, and with optimal performance.

InfiniBox uses ALUA, Asymmetric LUN access, a standard SCSI mechanism that allows a storage array to communicate with the host multipathing software about path priorities and availability. The host sees both datasets (one one each system) as a single device with multiple paths, and the InfiniBox systems control access to the datasets using ALUA hints to the host. 

When creating an Active-Active replica, the systems use the same serial ID to both datasets in the replica. As a result, hosts that are mapped to these datasets will see them as the same device with multiple paths. Hosts are required to use multipathing software that supports ALUA, and they can deliver I/O requires through any path, depending on the hints that InfiniBox provides. The systems (not hosts or applications) maintain IO ordering, protect the data and ensure its consistency at all times.

Read operations are served by the system that receives the I/O request. Write operations are replicated by from the system that receives the I/O request to the peer system before sending the acknowledgement to the host (similar to Sync replication). It's important to note that the I/O latency is the same, regardless of which system receives the write request since InfiniBox makes sure that the data is transferred only once over the replicaton link.

How InfiniBox handles failures

In an InfiniBox system becomes unavailable, e.g. power outage of the entire site, the peer system will provide access to all the datasets. 

If the replication link between the systems fails, then only one system will continue to serve I/O until the failure is resolved and the replica is back to synchronized state. The system that remains active may depend on the replica, which means some dataset will remain active on site A while others will remain active on site B, depending on the replica configuration.

InfiniBox has two mechanism to handle failures for Active-Active replica: Witness and preferred system.

InfiniBox Witness

The witness is an arbitrator entity residing in a 3rd site (separate from the two InfiniBox systems involved in Active-Active replication), that acts as quorum in case of failure. The witness is a lightweight stateless software deployed as a VM. 

When a failure occurs, the decision which system remains active is based on (A) the witness' connectivity to the systems and the (B) the preferred system definition of each replica.

As long as both systems can communicate with the witness, it will make the take-over decisions. The witness decision is done per replica and is based on the following logic:

  • If the witness can communicate with the preferred system of the replica:
    The paths to the dataset on the preferred system will stay active, and the paths to the remote dataset will become stand-by (preventing hosts from writing to the non-preferred system).
  • If the witness cannot communicate with the preferred system of the replica:
    The paths to the dataset on the preferred system will become stand-by and the paths to the remote dataset will stay active (preventing hosts from writing to the preferred system).
  • In any case, only one system will keep the paths to the dataset active.

While the witness makes the decision, the I/O to the datasets on both systems will pause for a few seconds.

If the witness is inaccessible, the InfiniBox systems will automatically switch to follow the preferred system configuration, which is designed so that systems' decision can never interfere or contradict the witness decisions. A failure in the witness does not have any effect on the I/O path and all the replicas will continue to replicate the data regardless of the witness state, as long as the replication link is intact.

Preferred system 

Each replica has a definition for preferred system, which the witness uses to make correct decisions. This definition also affects the behavior should the witness become unavailable.

If the witness is inaccessible, the InfiniBox systems will automatically switch to follow the preferred system configuration (preferred system mode), which is designed so that systems' decision can never interfere or contradict the witness decisions. 

If the witness is not available to the systems, the decision on which side stays active will be done per replica based on the preferred system using this logic:

  • The paths to the dataset on the preferred system will stay active and the paths to the remote dataset will become stand-by.

InfiniBox Active-Active Topology

An InfiniBox Active-Active solution contains 2 InfiniBox systems running version 5.0 or above, and a witness VM at a 3rd site. 

The systems are connected via an TCP/IP-based replication link, which uses the InfiniBox replication network spaces and can serve all types of replicas. 

The witness is a lightweight software deployed as a VM by the customer in a separate failure domain (3rd site). This is very important for the redundancy of the solution: if the witness is installed in the same site as one of the InfiniBox systems then an entire site failure might cause data unavailability of the replicated volumes.

Managing the Replica

Creating an Active-Active replica 

The replica entity in InfiniBox matches a pair of replicated datasets, including other settings essential for the replication. When created the connection between the datasets on the two InfiniBox systems is established.

When an Active-Active replica is created, the remote InfiniBox system changes the serial ID of the dataset to be the same as the local dataset, the datasets on the two systems have the same serial ID. This allows hosts to identify the datasets as one.

An Active-Active replica between existing datasets, i.e. create using a staging are, the the remote dataset cannot have any host mapping.

Creating a replica in InfiniBox is very simple and can be done using the GUI or CLI: all the user needs to select is the replication type, local dataset (source), the remote system and the remote dataset or pool.

There are 2 option for creating a replica in InfiniBox:

  1. Create new 
    The user selects a pool on the remote system.
    A remote dataset is automatically created on the remote system in the selected pool, and the replica is created between the source dataset and the new remote dataset.
  2. Select existing 
    The user select a target dataset on the remote system.
    The replica is created between the source dataset and a remote dataset.

Once the replica is created, the source dataset will start replicating the target dataset until they are synchronized. During this time the remote replica is in lagging state (see below), and the paths to the remote dataset remain in standby ALUA state.

Active-Active replica states

An Active-Active replica state presents the local sync state of the replica. The replica state appears in the GUI under the replication workspace, or in the CLI replica.query command.

These are the possible states:

Replica state
Local system
Serving I/O
Local system
Replica state
Remote system
Serving I/O
Remote system
Additional info
Synchronized YesSynchronizedYes
Lagging No

Sync in progress, or
Sync stalled, or
Initializing, or
Initializing pending

Yes (unless it is fenced)

The local dataset is lagging behind the peer
Depending on the replication link state, the remote replica is synchronizing to the local replica

Sync in progressYesLaggingNoThe remote dataset is lagging behind the local data
The local replica is re-syncing the remote dataset  
Sync stalledYesLaggingNoThe remote dataset is lagging behind the local data
The local replica can't re-sync the remote dataset (link disconnect or configuration error)
Initializing
Initializing pending
YesLaggingNoThe remote dataset is lagging behind the local data
The local replica is initializing the remote dataset for a newly created replica
Fenced NoLaggingNo

The remote dataset is lagging behind the local data
The local dataset holds the updated data but it is not serving I/O due to a system failure
This situation may occur when a link failure (which caused the remote system to be lagging) is followed by a system failure

Deleting an Active-Active replica

Deleting an Active-Active replica disconnects the relationship between the peer datasets, but leaves the datasets on both InfiniBox systems.

There are several requirements in order to delete an Active-Active replica:

  • When an Active-Active replica is deleted the dataset on the remote host remains, but it's serial ID is modified in order to avoid the presence of two unrelated volumes with the same serial ID.
    This means that the dataset on the remote system cannot be mapped to hosts: before deleting the replica make sure the remote dataset is not mapped.
  • The serial ID remains as-is for the dataset on the system where the delete operation is executed. The serial ID of the peer data is modified.
  • An Active-Active replica can be deleted only when the replication link between the systems is up.
  • If the replica is in synchronized state then it can be deleted on either system. Otherwise, the replica can only be deleted from the non-lagging system. 

When a replica is deleted, the user can choose to keep the staging area of the replica. The staging area is a snapshot containing the last consistent data that was replicated between the datasets, and can be used when re-creating the replica later on to avoid the full initial sync. We highly recommend to keep the staging area and not use these snapshots for cases that call for replica re-create. 

Host Topologies

InfiniBox Active-Active replication supports hosts connected to InfiniBox using FC (Fibre Channel) only.

InfiniBox Active-Active replication supports 2 major types of host topologies - uniform and non-uniform. There is no need to define the type of topology deployed, the topology is determined by the way hosts are connected to the storage.

ALUA mechanism

InfiniBox Active-Active replication uses SCSI Asymmetric Logical Unit Access (ALUA), also known as SCSI Target Port Groups or Target Port Group Support. ALUA is an industry standard protocol for identifying available and optimized paths between a storage system and hosts.

ALUA allows the initiator to query the target about path attributes, such as active paths and standby paths. It also allows the target to communicate events back to the initiator. 

Uniform host topology 

In a uniform topology the hosts are connected to both InfiniBox systems, and the volumes are mapped to the hosts on both systems. In this topology the host can do I/O's on both systems simultaneously.

In this topology, the application is fully protected, with zero recovery time. There is no need to manage VM or database instance affinity to a site, as hosts can immediately continue to send I/O requests to the paths connected to the correct InfiniBox system.

In the uniform topology, where hosts are connected to both InfiniBox systems we highly recommend setting one host as optimized and one as not-optimized especially when, for example, there is an added latency or lower bandwidth between the host and the remote site. This will allow the systems to perform better and maximize the cache and system resources on both systems.

InfiniBox Active-Active solution uses ALUA, which allows the storage array to send hints about path priorities to a host. This way, the host can distribute the I/O operations to optimized paths and avoid sending them to non-optimized paths. 

The administrator has the option of defining the host paths as optimized or non-optimized, on each InfiniBox system, and applying this setting to all mapped volumes. By default, all hosts are defined as optimized, which means symmetric uniform topology. If the host setting is changed on one of the systems, then the topology becomes asymmetric uniform topology. 

In all cases, it is recommended to define the host on the remote system as not-optimized. 

It is important to note that setting a host as not-optimized does not reduce the level of resilience; InfiniBox still guarantees zero RPO and RTO.

The optimized/non-optimized definition will only affect Active-Active volumes mapped to the host.

Non-uniform host topology 

In a non-uniform topology each host is connected to its local InfiniBox only. This architecture ensures zero-RPO and near-zero RTO.

Since the host has no access to the remote array, in the event of a complete failure of the local InfiniBox system the application will have to fail-over to the remote site in order to access the data that is available on the remote system. In the event of compete site failure, application on the remote site can immediately access data on the remote system. 

As an example for non-uniform topology, consider a VMware vSphere cluster that spreads the two sites. Each VM represents a workload, that accesses data from the local InfiniBox system (local depending on the vSphere host where the VM is running). If an InfiniBox system fails, the VM can be migrated to a host on the other site, and resume work instantly since the datastore is synchronized across the InfiniBox systems. 


Although hosts are not connected to the remote InfiniBox systems, connectivity between the InfiniBox systems and the witness must still exist.


Connectivity

InfiniBox Active-Active replication has the same network requirements as Sync replication. It requires a TCP/IP connection between the the two InfiniBox systems, with a maximum round trip time (RTT) latency of 5ms between the systems.

To use Active-Active replication there are a few additional setup activities that the administrator must do:

  1. Deploy the witness on a VM in a separate failure domain.
    See Active-Active replication witness for more information.

  2. Create replication network space on each InfiniBox system.
    An existing replication network space that supports Sync replication can be used.

  3. Create a replication link between the InfiniBox systems or, if there is a Sync replication link already, the witness definition to the link.

See InfiniBox Best Practices Guide for Setting Up the Replication Service for more information regarding network setup.

Failure scenarios

InfiniBox Active-Active replication is designed so that data will be available and protected at all times. InfiniBox systems detect failures automatically and initiate a fail-over / recovery, without any intervention from the user.

Note that application behavior during some failure scenarios depends on the host topology, for example uniform vs. non-uniform. However, the InfiniBox systems will always thrive to allow host access to data.

InfiniBox will automatically fail-over access to datesets in the following cases:

  • A link disconnection between the sites 
  • An InfiniBox system failure or complete loss of access 
  • Failure of an entire site

During these failures the InfiniBox systems will maintain access to the data on one of the systems, depending on the failure and the replica definition.


if both systems are available but not synchronized (e.g. when a replication link disconnects), the dataset on the preferred system of the replica will remain active. The preferred system is defined when the replica is created, and can be either local or remote.

There are other failures when the replica remains active on both InfiniBox systems, e.g. witness failure.

Witness failure

As long as both InfiniBox systems can communicate with the witness, the systems will set the link resiliency mode to witness. If a witness fails or a witness disconnects from either InfiniBox systems, the systems will change the link resiliency mode to preferred only. 

This in itself does not imply any changes to the replicas behavior or the data flow: Active-Active dataset remain online on both systems. The systems continue to replicate I/O operations synchronously over the link regardless of the witness state and the resiliency mode of the link.

The resiliency mode of the link affects decisions that the InfiniBox systems make in case of further failures. The witness resiliency mode offers higher protection and resiliency.

Link failure

Normally, both systems have access to the witness. When InfiniBox detects a replication link failure, it notifies the witness that there is a disconnection. Since the I/O operations cannot be replicated to the peer system, both systems pause I/O on the Active-Active volumes on this link briefly. The InfiniBox systems wait for the witness to inform which systems needs to resume I/O for each Active-Active volume. The witness will send these instructions as soon as both systems report the link failure, which takes a few seconds.

If either system was unable to access the witness before the link failure, then the systems switch to a preferred only resiliency mode. In this mode, the InfiniBox systems will not rely on the witness to choose which system remains active. Instead, the replicas will remain active on their preferred system.

System or site failure

If a peer InfiniBox system becomes unresponsive, which means neither the local InfiniBox system nor the witness can communicate with the peer system, then Active-Active datasets remain active on the surviving InfiniBox system.

If the link resiliency mode at before the failure was witness (i.e. both systems were able to communicate with the witness), then the Active-Active datasets will resume I/O as soon as the witness detects the failure. This takes several seconds for replicas where the failed system is non-preferred, and up to 15 seconds for replicas where the failed system is preferred. 

This is the reason it is important that the witness will reside on a third site and on a different failure domain from both systems, and that the link connectivity will be separate from the connection to the witness.

The link resiliency mode before the failure was Preferred Only (i.e. at least one of the systems was unable to communicate with the witness), then the surviving InfiniBox system will resume I/O to Active-Active datasets for which it is the preferred system. Active-Active datasets whose preferred system is the failed one will remain unavailable, in order to avoid split brain scenarios, and will require manual action in order to allow access to the datasets.

This scenario describe a multi-point failure, when both witness and an InfiniBox system become unavailable.

Failures impact on host I/O operations

Failure ScenarioInfiniBox BehaviorImpact on Host I/O
Replication link failure

Datasets are accessible their preferred InfiniBox system.

When the link recovers, datasets remain accessible on the preferred system, while they re-sync to the peer system.

Fail-over and recovery are automatic and transparent.

Hosts connected in uniform topology continue working using the paths to the preferred InfiniBox system.

Hosts connected in non-uniform topology continue to work only with those datasets whose preferred system is the one the host is connected to.

Use host clustering software (e.g. VMware vCenter) to migrate applications to the host connected to the preferred InfiniBox. This may be automatic depending on the application cluster.



InfiniBox system failure

Datasets will be accessible on the active system.

Fail-over and recovery are automatic and transparent.

Witness failure followed by a second failure of replication

Datasets remain accessible on the preferred system or on the system holding the up-to-date data in case of re-syncing replicas.

Fail-over and recovery are automatic and transparent.

Entire site failure

Datasets will be accessible on the active system.

Fail-over and recovery are automatic and transparent.

Hosts on the surviving site connected in uniform topology continue working using the paths to the preferred InfiniBox system.

Hosts on the surviving site connected in non-uniform topology continue to work only with those datasets whose preferred system is the one the host is connected to.

Use host clustering software (e.g. VMware vCenter) to migrate applications to the host connected to the preferred InfiniBox. This may be automatic depending on the application cluster.

Witness failure or loss of access to the witnessNo change to the replicas, replication process continues undisturbed.Host I/O continues through all paths on both InfiniBox systems.

Failure recovery

InfiniBox Active-Active recovery is completely automatic; no storage administrator intervention is necessary to trigger a re-sync and recover replication.

If the InfiniBox systems got disconnected, replication will internally fallback to async mode.Once the connectivity between the systems recovers, synchronization jobs will start replicating the missing data to the lagging system. During this time, from disconnection and through the re-sync progress the Active-Active datasets on the synchronized system serve I/O operations, while the remote side will be in lagging state until all data is synchronized between the datasets. 

Once the datasets are nearly in sync, they will smoothly transition to Sync replication mode. The host paths to the lagging side will be automatically restored, allowing the hosts to perform I/O operations through both systems again.

Failures and dataset availability summary

ScenarioInfiniBox AInfiniBox BReplication LinkWitnessDataset Availability
Preferred system is A
Dataset Availability
Preferred system is B
OptimalUPUPUPUPAvailable on both systemsAvailable on both systems
Systems & link are up, no witnessUPUPUPDOWNAvailable on both systemsAvailable on both systems
Systems are up, link is downUPUPDOWNUPAvailable on InfiniBox AAvailable on InfiniBox B
UPUPDOWNDOWNAvailable on InfiniBox AAvailable on InfiniBox B
System A up, system B downUPDOWNDOWNUPAvailable on InfiniBox AAvailable on InfiniBox A
UPDOWNDOWNDOWNAvailable on InfiniBox AUnavailable*
System A down, system B upDOWNUPDOWNUPAvailable on InfiniBox BAvailable on InfiniBox B
DOWNUPDOWNDOWNUnavailable*Available on InfiniBox B
When one of the systems is down, the link is also considered as down since the systems are unable to communicate.

* these scenarios represent multiple failures. If the witness is down, the InfiniBox systems resiliency mode switches to preferred mode, where only the preferred system can resume I/O, to avoid a split brain.



Was this article helpful?
0 out of 0 found this helpful

0 out of 0 found this helpful

Comments