Follow

Replication Overview

InfiniBox replication is a complete feature set for data recovery, remote copy and high availability of the customer's data between Infinibox systems.

Infinibox Async Replication is a snapshot-based solution that protects the data by replicating it to a remote site without adding latency to the host I/Os. The replication of the data is done asynchronously with a minimal 4 secs RPO(Recovery Point Objective).

Infinibox Sync Replication is a synchronous replication solution that protects the data with zero RPO. Every write to the local system is immediately written to a copy of the dataset on the remote system before returning to the host.

Infinibox Active-Active Replication is a synchronous replication solution that protects the data with zero RPO and zero RTO(Recovery Time Objective). The data of the replicated datasets can be accessed by the host on both systems at the same time.

Async Replication Overview

The InfiniBox Async Replication is a snapshot-based solution that allows the users to protect their data by replicating it to a remote site without adding latency to the host I/Os.

The async replication allows overcoming large geographic distances by:

  • Sending the I/O to the remote site after it was already acknowledged by the host
  • Allowing the user to define the interval between the snapshots that are sent to the remote site
  • Supporting a minimal 4 secs RPO (Recovery Point Objective) if the link quality requirements between the sites are fulfilled.

Async replication topology diagram

With InfiniBox async replication, when the host sends a write request to a dataset on the source system, the write is acknowledged to the host immediately and then replicated to the target system.

The following diagram shows the async replication data flow:

  1. The host sends a write I/O request to the source InfiniBox
  2. The source InfiniBox acknowledges the write I/O to the host
  3. The source InfiniBox replicates the data to the target InfiniBox
  4. The target InfiniBox acknowledges the replication to the source InfiniBox

Sync Replication Overview

The InfiniBox Sync replication allows the user to protect the data synchronously, sending the I/O to the remote site before acknowledging the host.

The synchronous replication will have an impact on the latency of the write I/O to the host since the acknowledge to the host will be sent only after the data was written in both sites.

Read I/O will be served locally from the source system so no latency will be added.

The InfiniBox synchronous replication solution depends on the quality of the link between the sites. If the link requirements are not fulfilled, the InfiniBox will move the replica to an internal asynchronous mode until the link requirements are met.

Synchronous replication topology diagram

With InfiniBox synchronous replication, when the host sends a write request to a dataset on the source system, the write is replicated to a dataset on the target system prior to any acknowledge to the host. Only after the target system has acknowledged the write to the source system, the source system acknowledges the write to the host.

The following diagram shows the synchronous replication data flow:

  1. The host sends a write I/O request to the source InfiniBox
  2. The source InfiniBox synchronously replicates the data to the target InfiniBox
  3. The target InfiniBox acknowledges the replication to the source InfiniBox
  4. The source InfiniBox acknowledges the write I/O to the host

Active-Active Replication Overview

Infinibox Active-Active replication is the perfect solution for keeping business continuity for clustered applications.

Active-Active replication is a symmetric synchronous replication solution that allows the application to run with no down time and immediate recovery in case of failures.

Helps spreading the app workload across data centers with minimal management and even provide a solution to non-disruptive data mobility on the storage level.

Infinibox Active-Active solution is basics

  1. Peer-to-Peer replication - Both sides of the replication are equal. Both are writable and readable by the hosts.
  2. Low latency sync replication - Data is transferred only once over the link, the latency is the same on both systems.

  3. Automatic failover and failback - Zero RPO and Zero RTO - The system automatically detects failures. When problem is resolved, automatic failback is performed.

  4. Highest level of protection and HA - Two levels of protection: Witness and “Preferred system” definition for a replica.

I/O with Active-Active replication

  • Active-Active replicated volumes are two separate volumes with the same serial ID.
  • Hosts that are mapped to these volumes will see them as the same entity with multiple path to it.
  • The volumes are available for read and writes on both systems when the reads are served locally and the writes are replicated to the remote system before sending the acknowledgement to the host. 
  • Write latency is the same on both systems since we make sure that the data is transferred only once over the link.
  • In case of a link failure between the systems, only one system will continue to serve I/O's.
  • The solution is based on ALUA (Asymmetric LUN access)  
  • Infinibox has two mechanism to handle failures for active-active replica: Witness and Preferred system definition

For more info about Infinibox Active-Active solution see Active-Active replication

Mobility Replication Overview

InfiniBox Mobility replication enables non-disruptive workloads movement between InfiniBox systems without any downtime. (Online Data Mobility)

  • Mobility replicas behavior and functionalities are identical to an Active-Active replica.

Mobility replicas are used during an Online Data Mobility process, which is initiated and managed from within the hosts using Host PowerTools.

  • InfiniBox GUI or InfiniShell cannot manage online data mobility, however allow to query for the Mobility replicas status. 
  • For more information about the Online Data Mobility process refer to the Host PowerTools documentation.

Replication Systems Connectivity

Overview

In order to replicate data from one InfiniBox system to another, the user needs to connect the two systems by defining Replication Network Spaces on each of the InfiniBox systems.

On top of Network Spaces, the user creates a bi-directional Replication Link that will define the connection between the two Network Spaces.

The same link can be used for all replication types.

Access control for connecting between the systems

Accessibility to the replication provisioning commands is available only from the InfiniBox systems that participate in the replication.

All of the user operations relevant for Replication Network Spaces and Replication Links require Admin user role.

The Admin permissions are required on both local and remote systems. When running replication operations, the local system tries to log into the remote system with the same credentials. If the access is denied the system asks the user to provide credentials to the remote system.

Replicating data from the local system to the remote system can be carried out by both Admin and Pool Admin user roles.

  • The Admin user can replicate any of the system's datasets that are available for replication.
  • The Pool Admin can replicate only datasets from the relevant pool.

Defining a Replication network spaces

InfiniBox Best Practices Guide for setting up the Replication service

The Network Space groups ethernet interfaces from all three InfiniBox nodes to assure a reliable and redundant replication service between InfiniBox systems.

The Network Space definition requires the user to define minimum 4/7 IP's depending on the replication type:

  • All replication types network space
    • Replication network space requires a minimum 7 IPs
      • The first IP will be used as the control IP
      • 3 IP addresses will be used as data IPs for Sync/Active-Active replication
      • 3 to 6 IP addresses will be used as data IPs for Async replication (3 IP addresses will suffice but 6 IP addresses we allow for a smooth failover is the case of a node unavailability)
  • Async Only network space
    • Choosing Async Only when defining a network space will assign all the IP's to the async replicas and you will not be able to define Sync/Active-Active replicas on links using this network space
    • Async Only Network Space requires a minimum of 4 IPs
      • The first IP will be used as the control IP
      • 3 to 6 IP addresses will be used as data IPs for Async replication

The control IP transfers the replica configuration and management commands between the local and remote systems.

Firewall ports that need to be open:

  • Control - TCP 80, TCP 443
  • Data - TCP 8067

The link is the entity that connects the local InfiniBox system to the remote system using predefined Network Spaces on both systems.

For link creation, the user must have an admin permission on both local and remote systems.

The link is bi-directional. It can be created on any of the systems by identifying the second system via the control IP address.

Link states 

The replica link can be in either of the following states:

  • Connected - All of the IP addresses are reachable
  • Degraded - Not all of the IP addresses are unreachable 
  • Disconnected - None of the data IP addresses are available
  • Unknown - The remote management could not be reached or local async service is not available

In case the user wants to change IPs on a linked replication network space, the link will have to be detached from the network space and re-attached after the network space is updated, or attach it to a new network space.

  • Detaching a link - When the link is detached, the relationship between the link and the network space is disconnected. All of the replicas that use the link are automatically suspended.
  • Attaching the link - When the link is attached, all of the replicas that were suspended as a result of the detach operation will be automatically resumed.
    • Replicas that were suspended prior to the detach, will not be automatically resumed.

Infinibox Replica

Overview

A replica is an entity that matches a pair of local and remote datasets and other settings essential for the replication.

  • A replica can be of type async (asynchronous), sync (synchronous) replication or Active-Active replication.
  • All replication types share most of the commands and attributes described below.
  • For each local replica entity, there is a paired replica entity on the remote system. Actions on the replica can be local or cross-system, depending on the action and the connection state between the systems.
  • In async and sync replication - one side of the replica will be defined as the source, this is the side that holds the replicated dataset. The linked side will be defined as the target and holds the dataset we replicated to.
  • The replica will be created by the user on the local dataset, and the remote system will automatically create the linked replica on the remote dataset.

Supported datasets for replication

InfiniBox version 5.0 supports the creation of Async replica on:

  • Volumes
  • File Systems
  • Consistency Groups

Infinibox version 5.0 supports the creation of Sync replica on:

  • Volumes
  • Consistency Groups

Infinibox version 5.0 supports the creation of Active-Active replica on:

  • Volumes

General replica attributes

Managing a replica is mainly based on the following attributes:

  • Replica type - Sync, Async or Active-Active
  • Replica role - Source, Target or Active-Active Peer
  • Replicated dataset - volume, filesystem or consistency group on the local system
  • Remote dataset - Volume, filesystem or consistency group on the remote system
  • RPO (for Async only) - The amount of data, measured in time units, that is at risk and might be lost, in the case of a disaster.
  • Interval (for Async only) - The amount of time between two planned sync jobs.

Replica states

InfiniBox uses replica states in order to manage a prompt functionality of the replication, the replica state definitions are the same for sync and async replicas:

  • Active - The replica is active and data is replicated between the datasets as defined
  • Suspended - The replica is suspended by the user. No data is transferred between the datasets
  • Auto-suspended - The replica is automatically suspended by InfiniBox due to a permanent error or a timeout. The user will have to manually resume the replica once the cause of the suspension is fixed.

For Active-Active replicas, the replica state is not available since the replica is always Active.

Initial Sync

When the replica is defined, and the source is not yet replicated to the target, the replica is in an Initialization state. In this state, all of the source data has to be replicated.

This state may take a long time depending on the amount of data that needs to be replicated.

Once the replica is synchronized (the source is fully replicated to the target), the InfiniBox system will replicate only the new I/Os or differences between the source dataset and the target.

When the replica is in an Initialization state:

  • The replica role cannot be changed
  • It is impossible to take a snapshot on the target dataset (as the target is not yet consistent with the source)

Snapshot-based Initial Sync (For Async and Active-Active replicas)

For async and active-active replicas, the user can retain the internal snapshots when the replica is deleted and see them as regular snapshots.

These retained snapshots are uniquely identified by the system and can be used as base for re-creating the replica between the source and target datasets, as long as they were not changed by the user.

If the user creates the replica using the retained snapshots as base, the initial sync job is skipped, a new snapshot of the source dataset is taken and only the difference between the retained snapshot and the new snapshot is transferred to the remote dataset.

Operations on replicated datasets

Most of the user operations on a dataset will work as they normally do when the dataset is replicated.

The following user operations that are usually available on datasets are blocked on all replicated datasets:

  • Delete dataset
  • Restore dataset

Specific restrictions for Async replicas:

  • Disable Write-protect dataset on a target dataset
  • Export a filesystem on a target dataset

Specific restrictions for Sync replicas:

  • Resize dataset

Specific restrictions for Active-Active replicas:

  • Resize dataset
  • Add QoS policy
  • Enable Write-protect dataset
  • Change dataset provisioning type

Once the replica object will be deleted, the operations on the dataset will be allowed as usual.

Replica Operations

Replica Create

Creating the replica will be done on the source system since this is the system holding the dataset data and connected to the application active host.

When creating a replica, the user provides the following:

  • The replication type
  • The source dataset - the volume, filesystem or consistency group that will be replicated
  • The target system
    • As a prerequisite, the target system has to be connected to the source system via a link
    • The link has to be defined prior to the replica creation
    • The link has to be in a Connected state
  • The target pool or target dataset:
    • If the target pool is provided - the dataset will be created as a part of the replica creation in the specified pool
    • If the target dataset is provided (for volumes/CG only), it has to be created in advance, empty and the same size as the source dataset (In case of a consistency group, please see CG section below)
  • Interval and RPO have to be supplied during the creation of an Async replica only
  • Preferred system can be supplied (optional) during the creation of an Active-Active replica only

Replica Delete

A replica entity should be deleted from the source replica if possible. The system will automatically delete the target replica as well.

When the replica is deleted, the pairing between the replicated datasets is deleted and both source and target datasets return to regular usage. Deleting the replica will not delete the replicated datasets on either side.

In case there is no connectivity between the systems or if there is a configuration mismatch between the replicas, there is an option to delete the replica locally using a force flag. This will require the user to do the cleanup on the other system.

When deleting an async or active-active replica, there is an option to retain the staging area (the last snapshot replicated) and expose it to the user for future use (see snapshot-base init above). If the replica is in Initializing state, there are no snapshots to retain.


In active-active, when deleting the replica, the serial number of the remote volume will be changed. Due to this behavior, the remote volume has to be unmapped before the replica is deleted.

If the force flag is used, the user has the option to keep the serial on the local volume. This has to be carefully selected since this can cause a "split brain" between the two volumes, where both volumes stay with the same serial number but not connected to each other.

Replica Suspend and Resume (Sync and Async replicas only)

The user can suspend and resume the replica at any time from the source replica only.

Suspend Replica command will cause the source replica to stop transferring data to the target replica, and Resume Replica will resume the data transfer between the datasets if possible.

When there is no connectivity between the source and target or in case of a configuration mismatch between them, the resume command will fail.

Replica Change Role (Sync and Async replicas only)

The user can change the replica role at any time except when the replica is not yet initialized.

Change role can be done on the source or the target replica, the source replica has to be suspended prior to changing the role.

When changing source to target, the source dataset will be changed to target dataset and will not accept user writes. This may cause a loss of updated source data that was not replicated to the target yet.

When changing target to source the target dataset will be changed to source dataset and will accept host writes. I/O from the another system will be blocked.

After a change role commands, the replica will have to be manually resumed in order to continue replicating.

Replica Switch Role (For Sync replicas only)

In Sync Replication there is an option to switch the replica direction by synchronously change both sides roles.

The switch role command can be used only from the source replica and only if the link between the systems is connected and the replica is in synchronized state.

Consistency Group Replica (Sync and Async replicas only)

Consistency Group is an entity that groups several volumes together and allows the user to take a consistent snapshot of these volumes.

A replicated CG (consistency group) keeps the volumes in the CG on the target side consistent with each other.

The consistency group is replicated as a whole.

Create a CG Replica

Creating a replicated CG is done just like a volume or filesystem replica create.

In CG replica there are several options for the local dataset at the time of creation:

  • An empty CG - The replica entity will be created between the 2 empty CGs and once the user will add volumes to the CG they will automatically be replicated.
  • A CG with volumes - All the volumes in the source CG will be paired with target volumes and start a replication process.
    • The members on the target side can be created automatically or previously created by the user and paired specifically.

Delete a CG replica

Deleting a CG replica is similar to the deleting a volume.

CG replica delete should be done from the source replica if possible and will delete the replica entity on both sided.

Deleting the CG replica will not delete the CG on either side or change it in any way.

In case there is no connectivity between the systems or if there is a configuration mismatch between the replicas, there is an option to delete the replica locally using a force flag. This will require the user to do the cleanup on the other system.

When deleting an async CG replica, there is an option to retain the staging area, this will expose a snap-group of all the last replicated snapshots of the CG.

Add and remove a replicated CG member

Add and remove a member from a replicated CG can be done on an Async replica only!!!

For Sync CG replica, the user will have to change the replica type of the CG replica to async and wait for the completion of the running sync job prior to the add/remove operation.

Add a member to Async CG Replica

Adding a member to a replicated CG can be done on the source replica only.

The added member can be a volume or an async replica. In both cases, the new member will get the async replica definitions from the CG (RPO and interval).

Adding a member to a replicated CG will change the CG replica sync state to initializing until all the volumes are replicated and the targets are consistent.

Remove a member from an Async CG Replica

Removing a member to a replicated CG can be done on the source replica only.

In order to remove a member, the replica link must be connected and the replica has to be from async type. To remove a member to a sync CG, the replica type has to be changed to async beforehand.

When removing a member, the user can choose to keep the member replicated, and a new replica entity will be created for the removed member.

The user can also choose to retain the staging area for the removed member, as done in delete replica.

Async replication Specifics

Async replication mechanism

The InfiniBox Async replication feature is snapshot-based replication, based on Sync jobs that are scheduled automatically by the system.

The sync job creates a snapshot on the source dataset and delivers it to the target. The next sync job takes a new snapshot of the source, calculates the diff from the previous snapshot and sends only the data that was changed since the previous sync job. 

The amount of time between two scheduled sync jobs is called sync interval. The sync interval can be changed by the user. Changes to the replica sync interval take effect on the next sync job. 

The async snapshots are internal snapshots and are not visible to the user. The capacity will be presented to the user in the replica information as staging area capacity.

Sync Now Command

There is an option for the user to define a None interval for a replica. In this case, the system will not initiate any sync jobs for the replica and the user will have to manually trigger a sync job using a "Sync now" command.

The user can use the "Sync now" operation on any async replica, regardless of the interval defined, and a sync job will be initiated (if there is no sync job currently replicating).

Sync Job states

When the replica is in Active state, it initiates a Sync Job as needed. InfiniBox manages the Sync Job through the following states:

  • Pending - This sync job is planned but not yet executed
  • Initializing - This is an initial sync job that replicates all of the source data to the target

  • Replicating - The sync job is now running
  • Done - The sync job has finished
  • Paused - The sync job is paused because the replica is suspended by the user
  • Stalled - The sync job is stalled due to link problems

The syns job states and the replica state is visible only on the source replica only. On the target replica the state is N/A.


RPO State

In addition to the replica state, Async replicas there is also a RPO state.

The RPO state is presented on both sides of the replica and is calculated locally.

The RPO state can be:

  • RPO OK - The replica recovery point is within the RPO defined
  • RPO Lagging - The replica passed the RPO defined and potential data loss in case of a disaster might be larger than planned. This state might be reached when there are connectivity issues preventing a proper data flow or an incorrect RPO definition.

Possible Async replica states

Replica States

Sync Job States

RPO states

Active

Pending

RPO OK / RPO Lagging


Initializing

Replicating

Done

Stalled

Suspended

Paused

Done

Auto Suspend

Paused

Done

Best practices for setting the Sync interval and RPO

Since the recovery point is based on the sync interval defined, the best practice is to set the RPO to be at least twice as large as the sync interval.

Sync Replication Specifics

Sync replication mechanism

In sync replication, as each host write is replicated to the target system prior to acknowledging the host, the replica depends on the link quality.

InfiniBox takes measures to handle the synchronous replica in case that the link between the source and the target cannot support the replica, including safely return to synchronous replication when the connectivity conditions are back to normal.

Internal fail-over to async replication

In case of a problem that prevents the replica from being synchronized, the replica fails back to an internal async replication mode.

The replica operates as if it is async until a Synchronization state is reached.

The replica state will be Out Of Sync, or Sync In Progress until InfinBox will return the replica to a Synchronized state.

The replication type will stay sync at all time and the user will not be able to perform async replication operations, such as sync now, nor configure the replica RPO and Interval.

The fallback to sync replication does not require an Initialization and will be done automatically by the system.

Replica Sync states

Sync and Active-Active replicas have a sync state similar to the sync job states in async:

Possible states in sync state:

  • Synchronized - The source and target datasets data is identical
  • Sync in progress - The replica is in the process of returning to synchronized state (using the async internal mode)
  • Initializing - The replica initial process is running, copying the entire source data to the target
  • Initializing pending - Initialization process will start once other initialization processes will end
  • Out of sync - The replica source cannot send data to the target and the replication process is paused.

The state of the replica is visible only on the source. On the target the state is N/A.

Possible Sync replica states of the source replica

Replica States

Sync States

Active

Synchronized

Initializing

Sync in progress

Out of sync

Suspended

Out of sync

Auto Suspend

Out of sync

 Changing the replication type between Sync and Async

The replication type of a replica is determined by the user when the replica is created and can be changed anytime later on (except when the replica undergoes Initialization).

  • When the replication type is changed from Synchronous to Asynchronous the user has to specify the Interval and RPO
  • When the replication type is changed from Asynchronous to Synchronous:
    • The Sync state is set to Sync In Progress until the replica is Synchronized

Disaster Recovery scenarios for async and sync replication

Disaster Recovery using target snapshots

Since the target dataset is always write-protected, for DR scenarios, InfiniBox supports taking a snapshot of the target.

In async replicas, the snapshot that is taken on the target is always consistent with the last replication cycle.

This ability to take a snapshot on the target is ideal for Disaster Recovery tests that aim to verify the integrity of the data on the target without affecting the replication process.

It allows the user to map the snapshot to the remote host without stopping the replication of the source dataset to the target.

Note that it is also possible to map the target dataset itself to a remote host, taking into account that the dataset will be consistent and write-enabled only after the replica was changed to source or deleted.

Testing the Disaster Recovery site (Firedrill)

Failover and failback are operations that handle a situation in which the connectivity between the local and the remote systems is down. These operations switch the roles of the source and the target.

Failover

  • The link between the source and the target is down and the target is connected to a host and can serve it.
  • The target has to have its role changed to a source and will now accept host writes.
  • The replica on the original source side gets into an auto-suspended state.
  • During this phase, the target and the source are no longer consistent.

Failback

The user returns both source and target to their original roles, and the replica returns to the state it was prior to the failover.

  • The original target should be changed back to target (was changed to source in the failover).
  • The replica should be resumed from the original source side.
  • A sync job will start in order to return the replica to synchronize state.

Switching the replica in case of a real disaster

In some disaster scenarios, the workload will have to be moved to the target system and the applications will continue working on the target system.

In this case, the user might want to replicate the data changed on the target datasets back to the source datasets.

To do so, the user will need to do the following:

Failover

  • The source system is down and the target is connected to a host and can serve it.
  • The target has to have its role changed to a source and will now accept host writes.
  • During this phase, the source datasets are unavailable to the user due to the disaster.

Failback

The user switches the original roles of the replica and synchronizes the data from the new source to the old source.

  • The original source should be changed to target.
  • The replica should be resumed from the new source side.
  • A sync job will start in order to return the replica to synchronize state.

After this procedure, the user can decide whether to keep the replicas roles reversed or change them back to the original sites (using switch role for sync if possible, or change role on both sides when there is no new data on the source, to prevent data loss in the process)

Active-Active Replication Specifics

For more info about Infinibox Active-Active solution see Active-Active replication

Was this article helpful?
0 out of 0 found this helpful

0 out of 0 found this helpful

Comments