- High Availability and RAS
- Backup and recovery
- Database Cloning using Snapshots
- Database Replication vs. Storage Replication
- Oracle ASM, crash consistency and recovery
- Ease of use
- Oracle I/O Profile and INFINIDAT storage synergy
- Host-based Configuration Guidelines
- Red Hat Enterprise Linux (RHEL) I/O schedulers
- RHEL File system types
- Windows NTFS
- AIX jfs2 and LVM
- Oracle ASM
- Oracle Databases on NFS
- Data Reduction
RDBMS Databases represent the backbone of all organizations large and small. These databases are an integral part of any organization and represent an asset that requires the highest availability, reliability, performance and flexibility for all aspects of the organization they support. Typically storing mission critical transactional data about customers, patients, suppliers, orders; databases are also used to analyze the performance of the organization using data warehouses, data marts, data lakes, as well as non-structured data analysis as the organization moves into capturing and analyzing what is now popularly known as “Big Data”, or data outside of the “Systems of Record” data stores.
InfiniBox represents a new age of data storage, departing from the traditional dual-controller, RAID-set storage mentality and provides a solution to the most demanding of application and database environments, while using best-of-breed storage architecture design that provides unmatched ease-of-use, fast start-to-finish storage deployment tools, the InfiniBox system benefits tremendously by avoiding these legacy storage architectures. This gives applications hosted on InfiniBox higher, more predicable performance as well as much simpler, easier to manage host side configurations.
The net result is a much lower TCO for applications migrated to InfiniBox, and a platform for unparalleled database and application consolidation where up to 2PB of data can be stored in a single floor tile. No other storage vendor can store mission critical databases.
This paper provides more details on what features InfiniBox provides, and how specific database activities are streamlined, how the InfiniBox architecture encourages simpler database architecture designs and how InfiniBox reduces the time and complexity of managing these critical database resources.
Large databases (10’s of TB to hundreds of TB) pose a unique challenge to enterprise storage arrays by providing an I/O profile that is unpredictable, and often overwhelms the storage frame resulting in high latencies, which increase the run time of database workloads. Some database activities are very latency sensitive, and in many cases will effect the end user population that the application supports.
InfiniBox provides benefits that are requirements for enterprise database deployment:
Consistent, high performance: Infinidat is designed with massive parallelization, huge compute power, large L1 and L2 caches, while it’s data distribution architecture ensures even access across all 480 NL SAS at all times and providing consistent, predictable performance, an absolute requirement for all enterprise databases. The storage snapshot architecture of InfiniBox provides the ability to execute thousands of snapshots, while not effecting performance, from which Oracle databases can derive benefit. Snapshots are used by our customers to augment their Oracle database backup and recovery architecture.
High availability and reliability: InfiniBox architecture provides a robust, highly available storage environment, providing 99.99999% uptime, which is one of the highest rated uptime’s for any storage platform. That equates to 3 seconds of downtime a year! Drive rebuild times are the best in the storage industry. Oracle customers using Infinidat system report no loss of data, even upon multiple disk failures. InfiniBox offers end-to-end business continuity features -- including synchronous and asynchronous remote mirroring -- and supports Oracle RAC and its multi-node, multi-initiated Oracle Real Application Cluster (RAC) platform for highly available clustered database support. Using snapshots, recovery of a database can be reduced to the amount of time it takes to map the volumes to hosts, minutes instead of hours of recovery time using a more traditional RMAN recovery process.
Exceptional ease in storage management: InfiniBox architecture, along with the elegant simplicity of it’s web-based GUI allow for easy, fast deployment and management of storage to database environments. The amount of time saved in performing traditional storage administration tasks is huge. Also, because of InfiniBox open architecture, and aggressive support for REST-ful API’s the use of other platforms such as Openstack, and emerging container-based application environments such as Docker allow for storage administration tasks to be performed at the application level, without the need to use the excellent InfiniBox GUI. Direct storage deployment and management can also be performed directly from VMWARE’s vCenter console through support for all of the major VMWARE API’s such as VAAI, VASA and VADP.
Lower total cost of ownership: Massive parallelization, extreme availability, highest data density in the industry, consistent performance and easy of use all point to unmatched TCO. This is important for environments where there is a need to consolidate mission critical databases into smaller and smaller physical footprints, while our customers are experiencing an explosion of data sources such as mobile, machine generated data, and huge amounts of analytic data. There is no other storage platform on the market that provides all of these benefits, particularly for mission critical enterprise database environments.
This paper will go through some of the major requirements and characteristics and provide guidance on best practices, as well as any observed behavior that is unique to running Oracle Databases on InfiniBox.
High Availability and RAS
InfiniBox is designed to provide the highest availability, while reducing the physical data footprint to store that data by utilizing a unique and patented data distribution and parity-based protection mechanism that distributes data from each volume across EVERY drive in the InfiniBox frame. That is 480 NL-SAS drives supporting each and every volume (F6240).
The parity-based storage architecture insures that the highest amount of usable capacity is available.
Drive rebuild times can directly affect availability. InfiniBox has a maximum drive rebuild time for a 6TB drive of 15 minutes for a system (hard drive) that is completely full. For systems with less space used, the rebuilt times are lower. The reason is that InfiniBox is not built from RAID sets, or limited number of spindles grouped together. InfiniRAID is a new way to store data with a unique and patented way to distribute large amounts of data across every spindle in the frame. This significantly improves (reduces) drive rebuild times in part because when data is needed to be re-built all drives in the system support the effort. Also, because of the unique and patented way that InfiniBox stores data along with parity, most of the data is rebuilt without having to move data from one place to another.
Parity-based storage architecture, plus low drive rebuild times, and fully redundant hardware (many components have triple redundancy) gives InfiniBox the ability to provide 99.99999% uptime per year. This equates to roughly 3 seconds of down time per year. If you put this into perspective, at 3 seconds a year, that is much shorter than your average SCSI timeout sequence. Which means even if there was down time, it would not cause the host to loose connection to the data, or even recognize that there was a short disruption of data availability.
With such high availability, database consolidation, and reduction of mirrored copies of data are possible. More customers are considering consolidation of databases onto less hardware to reduce costs.
Backup and recovery
Most Oracle customers use more traditional backup and recovery architecture, primarily using Oracle RMAN (Recovery Manager) to backup and recover database data. RMAN is the preferred method to provide a complete, comprehensive backup image of your database, while providing full or piece-meal recovery of data to allow for more focused recovery capabilities.
When RMAN backup is called for, InfiniBox is a very good backup target candidate for backup of a database directly to disk. The reason is that it will be fast, and it will be cost effective due to the massive density of storage on a single frame. Many Oracle database shops use VTL-based RMAN backup strategies as the first line of recovery defense.
The following is an example RMAN configuration:
In this configuration example, a total of 8 channels are used and set parallelism to 8. This is to insure that I/O multi-threading and CPU capacity are maximized to drive the backup hard. You don’t have to adopt this technique, however, the point of driving backup hard is to complete the backup in as short a time as possible. If this is not a requirement, back off of the number of channels configured, and the level of parallelism to the desired level. One reason to do this is to not consume all of the servers CPU resources and I/O bandwidth capacity while backup is being performed because other activities are also being performed on the same server.
To backup the database using this level of parallelism, perform the following at the RMAN command prompt:
Note the use of section size 100g in the backup command. Without it, every tablespace backup will be stored in a single backup piece, or if you use the maxpiecesize command at the channel level, it will break up a large datafile into maxpiecesize chunks, but in this case, since the schema had only a single large tablespace within the database (made up of a single large ASM data file), still only used one channel for backup. With the maxpiecesize set, a single channel was observed to write about 350MB/sec to the /backup file system, while reading from the ASM datafiles at the same 350MB/sec. This backup was started at 10:54 and ended at 20:50, a total of almost 10 hours. Below is a screen shot of the back process from the InfiniBox performance screen:
After changing the backup command as shown above, (using the section size parameter in the backup command) the figure below shows a backup of the 5TB database using all 8 channels and maxing out at a combined 3GB/sec (reading from ASM datafiles at 1.5GB/sec and writing to the /backup file system at 1.5GB/sec, line speed for this 4 x 8GB hba system. The backup took 35 minutes, a significant improvement over the default backup. The database size was 4.7TB.
So, again, the choice is yours, to run backup at full speed, or run it at a reduced speed.
RMAN backup to NAS
Here is a 7.6TB Oracle database backed up to a single 10TB NFS mount point. Infinimetrics view of the NAS side, which shows the writing side of the RMAN backup process to the NAS mount. We are running at 1GB/s, line speed for this server. Note the 1.5ms write latency. NAS provides an efficient way to provide a backup target for RMAN. Since this is an NFS mount, the file system can be mounted on another server, running IBM’s TSM, or Commvault backup software to backup the RMAN backup sets to tape. This backup of a database took roughly 2 hours. A single 12TB NFS mount point was created called /backup, and 4 sub-directories were created. The RMAN config was modified to set parallelism to 8, and 8 separate channels were created, 2 per sub-directory to insure maximum parallelization. The default filesperset was used to insure that all channels are busy performing work.
Here is the SAN side, which was the reading side of the backup process.
RMAN Backup and Compression
RMAN supports in-line data compression of the data as it is being written to the backup target. There are a few levels of compression supported by RMAN. The RMAN configuration and backup command are modified to enable compression as follows:
- Set the <'compression level'> compression algorithm to 'DEFAULT'
- This configuration line executed before the backup command sets the level of compression desired. The levels are:
- BASIC – This is the default
Advanced Compression options:
- The backup section size should be 20GB
Here is an example of a backup performed against a 7.6TB database using all variations of compression options, and their respective impacts on backup set size, and host server utilization. This test was run on an Oracle 12c database, running on RHEL 7.1.
|Test||Size [MB]||Compression Ratio (n:1)||Time [hh:mm]||Bandwidth [MB/s]||InfiniBox R/W Latency [ms]||Server [%usr]||Server [%sys]||Server [% I/O wait]||Server [%idle]|
RMAN Backup in a RAC environment
In the guide: “Oracle Real Application Clusters Administration and Deployment Guide” for 11g or 12c, Chapter 7: Managing Backup and Recovery, goes into more detail about configuring RMAN to work in a RAC environment. Specifically, on Page 7-4, examples are provided to configure and launch multiple channels against certain nodes in the cluster. You can also use Oracle’s server side load balancing to dynamically launch channels against any and all surviving nodes in the cluster to maximize parallelization of your backup by using all nodes in the cluster.
Alternately, you can manually direct RMAN to launch specific channels against specific nodes/DB instances in the cluster to perform backup. Just be aware, that when using this approach, if any of those nodes is or becomes unavailable, the backup will no longer be load balanced, and could take a longer time to execute.
In either case, Infinidat makes for an excellent backup target, either block storage file system, or network filesystem as described above.
InfiniBox Snapshots as a backup / recovery option
InfiniBox provides an elegant high-speed storage-based snapshot system (InfiniSnap) that allows you to take thousands of snapshots of with no performance impact. Storage-based snapshots are yet another way to backup and recover database data at the storage level, performing the task within microseconds, rather than the hours it takes for a traditional RMAN backup. RMAN must read all the data from the database and write it all to a virtual tape library or actual tape library, compared to InfiniSnaps which can be used to restore the database immediately and quickly from the snapshot.
InfiniBox provide the ability to create consistency groups to group storage volumes together to enable the ability to take a snapshot using a single command line, or mouse click to execute a snapshot across all volumes in the consistency group simultaneously. There is no time gap between snapshot images of volumes within a consistency group, no matter how many volumes are in the consistency group. This insures that the contents of the snapshot images within the consistency group are timestamped essentially the same, which means that the contents of the snapshot will contain data from a specific point in time and therefore crash consistent image of the database.
This consistent snapshot image (or any one of the tens of thousands of snapshots that can be taken) of the database, which uses InfiniBox efficient redirect on write snapshot image can then be used as a direct recovery option for the database. The recovery is simple, and fast. Within the GUI you just click on “Restore from this group” to take the image of the snapshot and overwrite the original volume contents. This action can also be done within Infinishell on the command line.
Finally, InfiniBox consistency groups can be used as part of the InfiniBox volume replication system to copy the data to a secondary disaster recovery site, shipped as a point in time snapshot of the data with a Recovery Point Objective as low as 4 seconds, the lowest in the entire storage industry.
For consistent images of the database (as opposed to a crash consistent image) Oracle provides a mechanism to allow for the use of storage-based snapshot technologies by allowing the administrator to place the database in “Backup mode”. The command to do this is:
You then execute an InfiniBox snapshot of all volumes supporting this database. Then, issue:
The amount of time in backup mode is very short, as long as it takes you to either go to the InfiniBox GUI and execute the snapshot or use the CLI commands to execute the snapshot directly from the server.
This backup mode allows Oracle to continue normal operation even if users are writing data to the database. Please visit support.oracle.com for more details. Specifically, the Oracle Backup Recovery Reference Guides for any version of Oracle provide more details about using this series of commands.
Essentially backup mode places the database in a special mode where the contents of the data across the database can be seen as “consistent” and “recoverable”. When the snapshot is taken while in backup mode, the contents of the data, the control files, the data file headers and the redo logs all are in a consistent state. If you need to use this snapshot to recover the database, you simply use the snapshot images as the primary volumes and mount the database. Oracle, upon mounting reviews the time stamps and contents of the control files, the data file headers, redo logs and determines what point in time that the data represents. The database then opens in a mode that allows for even further recovery by allowing the database administrator to recover the database and apply archive logs (copies of historical redo logs) until an exact point in time is achieved.
Obviously the number of archive logs applied is highly dependent on how frequently storage snapshots are taken. If you take a snapshot every 15 minutes, and your archive logs are written less frequently, you may not need to apply archive logs to get to the closest RPO (Recovery Point Objective) desired.
Best practice: Use both InfiniBox Snapshots AND RMAN backups
A good strategy would be to utilize both InfiniBox snapshots and RMAN backups for a well-rounded backup and recovery solution that will cover a wide range of recovery scenarios. And the flexibility of having both allows for a more customized restoration approach. We have seen customers utilize this mixture where snapshots are used for daily backup and near line recovery option for full system restore, and a week-end RMAN full backup to allow both piece meal restoration as well as support off-site backup media, which InfiniBox snapshots alone cannot achieve.
Database Cloning using Snapshots
To clone a database, one option would be to use RMAN and what is called a “redirected restore” which will allow you to point the RMAN restore process to a different set of LUNs. There are several steps needed to insure that proper device naming is used to restore the database to the right locations, but once in place, redirected restore is a very viable option. The amount of time it takes to restore is fairly dependent on how long it takes to back up the database. With the use of multiple channels, and if the backup was broken up with the section size parameter, the restore could take 30 minutes to 1 hour as in my example above. Or if it was backed up using default parameters, the restore could take hours.
Another easier way to clone a database is to use InfiniSnaps to take a snapshot of database LUNs, mount it to another server for read/write use. InfiniBox Consistency Groups simplify the cloning of databases by allowing single-action snapshot of a group of volumes simultaneously. These snapshot groups can then be mounted to another server as described above.
This type of snapshot usage saves a significant amount of time, because there is no need to wait for the restore process. The snapshot is an entirely storage frame-based activity capable of standing up these LUNs on another server. The Oracle database can be either a file-system-based, or ASM-based database to mount the snapshot to another server. It is not recommended to try to mount to the same server because, particularly if it is an ASM-based database, the ASM disk headers will have specific information about the data being stored, the physical nature of the existing LUNs. This will prevent the ASM instance to be able to mount the disk group as a different disk group than the one already mounted. There are some techniques out there that point toward scrubbing the header of each device, but it is best to not to try this at home.
Take that snapshot and map the LUNs to another server running ASM (with Linux and oracleasmlib you must also add the device pointers to allow ASM to see the devices with the proper permissions) once Oracle ASM completes the scan for new devices, it reads the headers of the new devices, and in the header, ASM recognizes that the device did belong to an ASM disk group in the past, and marks it as a member rather than candidate.
Since ASM now knows that the mapped devices belong to a disk group, you can immediately mount the disk group with the snapshots. Since the snapshots are writable, you can then (after copying over and creating the proper $ORACLE_BASE directory structure for the database, and adding it to /etc/oratab, you can start up the database, and create a listener entry for it.
If your organization has reservations about working with snapshots in a non-prod environment, you can use more traditional methods for copying the data such as an RMAN redirected restore, or an RMAN traditional restore to the new structure.
To clone a database on a remote InfiniBox storage system, simply take a snapshot of the replicated volumes on the secondary side of the replicated InfiniBox volume pair and mount that snapshot group to a remote server.
More about storage replication on the next section.
Database Replication vs. Storage Replication
Oracle provides a few tools to replicate database data to a secondary site. All of the tools allow for the database to be available within minutes on the secondary site. Each tool uses some form of transactional replication where database transactions are shipped from the primary site to the DR site.
Oracle Dataguard provides database replication and allows the secondary site to be either cold, or read-only (Active Dataguard), where queries can be run against the secondary site. It does require that both databases are on the same version of Oracle. There are two versions of Dataguard. Oracle Dataguard and Oracle Active Dataguard. Active Dataguard is an optional product with additional cost and provides more features than Oracle Dataguard whose license is included in the purchase of Oracle Enterprise Edition. The most prominent feature of Active Dataguard is the ability to have the target and source databases open for users. The target database will be in read-only mode, but queries can be run on it, to support data warehouse style application access. Oracle Active Dataguard allows up to a 1:30 fan-out replication from 1 primary to up to 30 targets.
Oracle 12c Active Dataguard introduces a new concept called Real-Time Cascade which allows Oracle Dataguard to replicate from the Primary to Secondary site, then from the Secondary to a third site. This daisy chain replication can be a very powerful add-on capability for those environments that have more unique three- site requirements.
Oracle Goldengate is a more comprehensive tool allowing for full 2-way active-active replication and live read/write access to the target database as well as the source database, and also supports different versions of Oracle, and different operating systems on either side of the DR pair. It is the most comprehensive database replication product Oracle has. It is an optional product with additional license cost.
InfiniBox includes the tools to replicate the data at the storage level from one site to a second InfiniBox site through either IP-based asynchronous or synchronous replication. Each individual Oracle datafile (either a file or a volume) can be replicated to the DR site, so that in the event of a disaster at the primary site, the data will already be at the secondary site. The data can then be presented to a series of servers on the DR side and database can be brought up fairly quickly.
InfiniBox replication is included in the price of the storage. The unique advantage that InfiniBox provides is a very fast and frequent sync intervals, achieving low or even zero RPO. This is the shortest RPO interval in the storage industry. What this means is that the data update delta between sites is closer than any other storage vendor. This ensures that there is less chance of corruption, more chance of recovering to a very near real time when the primary site failure occurs.
Recovery from storage-based long distance replication is similar to recovering the database on a local copy of a snapshot not using Oracle backup mode. A crash-consistent image is what will be available on the DR side in the event of recovery using storage replication. Meaning, if a disaster occurs, and the database must be started on the DR side, upon startup, the database will go into crash recovery mode, rolling back any transactions not fully committed, and reconciling the control file with the datafile headers and synchronizing all files to a specific database generation ID. Because this is a crash-consistent image of the database, there will be no opportunity to roll forward any archive logs to a point in time. For some customers, this is acceptable, and therefore storage replication is a solution that satisfies both RPO and RTO requirements.
With the use of InfiniBox Consistency Groups (SnapGroups), remote data replication is absolutely critical for database Disaster Recovery. Instead of just replicating individual volumes, you can create a consistency group, and replication all objects to the DR site within the consistency group all at one time. InfiniBox asynchronous replication is based on InfiniSnap technology, the entire consistency group contents will be replicated on a consistent timestamp for each of the objects in the consistency group, assuring a consistent recovery. This will ensure that the DR target data is just as consistent as a local snapshot of the source. The database objects within the consistency group objects will have the exact same timestamp and will provide the highest level of consistency for a database.
Here is a screen shot of the process of setting up a consistency group replication of a set of Oracle volumes. With a single mouse click, all members of the consistency group are included in the replication set up. The storage admin simply points the replication to the specific InfiniBox remote system, point to the remote pool, set up replication interval and RPO, and “Create”. The whole process to start the initialization just took under a minute. This example doesn’t show all storage elements required to replicate. Other items that should be included in the consistency group are redo logs, archive logs, executables.
Another option is to combine both Active Dataguard AND InfiniBox asynchronous replication for yet another three-site replication architecture. Dataguard from primary to secondary, then storage replication from secondary to tertiary site.
Oracle ASM, crash consistency and recovery
When using Oracle ASM, a header is placed on each individual device belonging to a disk group. Within the header is a flag that allows Oracle to determine if it is a candidate or member disk. If it is a member disk, the disk group that the device belongs to is also stored. Since InfiniBox replication is based on snapshots, to recover and startup the database at the target replication site, you must have a server up and running with Oracle Grid Infrastructure already installed, and an ASM instance running. When you map the replicated volumes to the server (and for linux oracleasmlib the devices) startup asmca, and oracle will scan the device headers, see that they are member disks, and notice that they already are associated to a disk group. You can then just mount the disk group.
You do need to make sure that the replicated volumes are now master devices of the replicated pair, and when the original source InfiniBox comes back online, re-start the replication in the reverse direction. Once satisfied that all volumes are now back in sync, you now have the data in both sites in a consistent state, and can reverse the disaster recovery process to point the primary production applications back to their original pre-disaster state.
Ease of use
InfiniBox GUI and CLI are very easy to use
There are a couple of dimensions to Ease of Use. The more visible one is the incredibly easy to use InfiniBox GUI and command line interface. Storage administrators and database administrators will see that InfiniBox provides tools like the GUI, the CLI and the Host Power Tools that simplify the creation and management of storage for databases.
InfiniBox provides a management system that can isolate storage pools and volumes to specific users, to provide Multi-tenancy features so that application users, such as Oracle DBA’s can manage their own storage, pools, volumes and snapshots. This is important for shops that are moving to Oracle Automatic Storage Management (ASM) to store data, and moving away from O/S-based file system storage. With ASM, the storage management function is mainly moved to the DBA support organization. With the strong user management functions of InfiniBox, Oracle DBA’s can manage their own objects within one or more storage pools. All the storage administrator has to do is to initially set up the pool, and add the Oracle users to the InfiniBox management system to manage that pool.
InfiniBox Architecture Promotes Simplified Data Layout
The second dimension to ease of use is primarily due to the storage architecture. Because each volume is broken up and it’s data is spread across all 480 spindles in the frame, there is no need to be concerned about RAID groups, hot spot management, and concern about volume size and the number of spindles in each RAID group. There is no need to create a large number of small volumes to spread the I/O load across more spindles. As a result, the best data layout is the simplest. Use a small number of large LUNS for data. Choose a LUN size that best fits the growth needs of the database, rather than the performance characteristic limitations of the underlying storage. Most customers choose a LUN size of 500GB to 2TB so that when a new LUN is required, adding that size LUN to the database doesn’t waste too much space between new allocations. So, here would be a typical configuration for a 3TB database, using Oracle ASM.
In this example, there are 3 separate LUNS for redo logs, each in it’s own ASM disk group, and 6 x 1TB LUNS for Oracle tablespaces in a single Data disk group. Note that we do not configure redo group mirroring in this example. You may or may not choose to create a mirrored copy of redo groups. It is up to you. The reason is since we are providing such a highly available storage platform for these LUNs, unless dictated by your specific Oracle administrative needs, there is no need to mirror the log groups. Some customers still mirror log groups for administrative purposes. That is fine. Just note that mirroring log groups require writes to both log groups before the write can be flagged as successful and complete, rather than a single write.
There are considerations that need to be explored if you do not use Oracle ASM, i.e. raw devices to store the data files. In some cases, a bottleneck can be introduced on the server, at the O/S file system level. Some server environments do have some limitations on how much data can be pushed through a single file system. For Linux, ext3 file systems, the limit is roughly 300MB/s. With Oracle ASM, the database is directly managing data on the raw devices as presented by the storage. There is no other intermediate layer in between storage and database, and therefore you will get maximum performance from database to storage. And, because you have chosen InfiniBox to store the data, there is no need to mirror the data disk groups for added protection. So, when you set up the disk group, choose “External” mirroring, rather than “Normal” or “High” which sets up a 2 or 3 way mirror of data to the specified set of LUNs.
One other consideration is that when using Oracle ASM, as in this example, you can use ASM to move data from one storage frame to another by simply adding LUNS from another storage frame to the ASM Data disk group, and run an ASM rebalance command. Once the command is complete, all of the data that was originally on the 6 drives will now be located across all drives. Then, you can flag the original 6 LUNS for removal and re- run a rebalance so that ASM moves the data to what ever LUNS remain in the disk group. All of this can be done online without downtime.
Oracle I/O Profile and INFINIDAT storage synergy
Oracle uses a multi-process model to manage the database. Each process(s) are responsible for specific tasks. Each task can be done simultaneously, and asynchronously. Meaning, all I/O to and from database components are performed in parallel, and uses a “fire and forget” approach to reads and writes. Oracle can ask for many large blocks of I/O to be read from the database structures on storage, and the process doesn’t have to wait for all
requests to complete. It can move onto the next task. An internal table of I/O requests are managed by the process asking for the data, and when all of the I/O slots within the table have been answered or ack’ed, the entire I/O request is complete. Typically async I/O is performed with multi-block read and write operations, such as a large table scan, large index scan for reads, and big block sequential writes from processes like deleting or truncating a table, or a typical ETL process for data warehouses where massive amounts of data is imported into the database.
Oracle does use several methods to retrieve large blocks of data. If Oracle deems that there is significant reuse value of the data (several processes are going after the same blocks), it will ask storage for a big chunk of data, and place it in the Oracle buffer cache, called the System Global Area, or SGA. That way, the next process asking for those blocks will find it in Oracle buffer cache, rather than resorting to a storage request.
If the data is deemed as a one-time-only request, from a table with a very low number of block accesses, or very infrequent block access (oracle reads through each block header to determine date/time stamp and access info) it will perform what is called a direct read, which is sequential in nature, and sends the data directly to the process requesting the data, without any caching of those blocks. A lot of this type of activity indicates that the access pattern of the applications going after the data are looking at most or all of the entire database footprint, and not re-using that data for other processes.
A fair amount of this type of activity is sequential in nature. Although big data moves like these can look more random.
This type of access, big block sequential reads, are well suited to Infinidat. We have extremely large main DIMM cache, and massive SSD cache to support these activities, supported by a sophisticated, analytics driven pre-fetch mechanism that stays ahead of Oracle large block read requests. Most of our Oracle shops enjoy 90+% cache hit rates, from either cache. And if the database is mostly read intensive, eventually the SSD cache will contain the vast majority of the blocks read, and in some cases, can contain the contents of the entire database as each Infinibox frame supports up to 86TB of SSD cache. This will result in a overall read latency for tablespaces of under 5ms for even the busiest Infinidat frames. Typically1-2ms reads are seen.
Another large portion of the read activity are small block random reads. Typically, these are quick index reads, where a query asks for and get’s only a single block (depending on the block size chosen, typically 8kb-16kb) of data, randomly. Indexes are built for b-tree walking speed, not for storage optimization. So typically these small block I/O’s are random in nature. This is where our SSD cache shines. Particularly for hot indexes, where specific pieces of data are read over and over again in rather quick succession.
For writes, no storage platform supports databases better than Infinidat. Unlike all flash array’s, that experience write-cliff effect due to the large amount of house-keeping required, we can run line-speed writes all day long, and faster than any hybrid array, and most all flash arrays due to our patented log-write technology. Refer to the RMAN section of this document showing our Infinimetrics graphs of how we performed during an Oracle RMAN backup.
Here is an example of what an Oracle database running Swingbench OLTP workload looks like. This is a picture of our Infinimetrics showing the performance of the system supporting this Swingbench run. Note the read and write latencies. The SAN throughput graph shows spikes, which are log switch / archivelog writes.
Here is a picture of the Swingbench console running the 100 user OLTP test.
Host-based Configuration Guidelines
There are several items that need to be taken into consideration when configuring host operating systems to support Oracle Databases, particularly when connected to infiniBox.
There are some performance guidelines that are universal in their application across all operating systems, here are some of them. The Infinidat Host Power Tools will adjust these by default, but it doesn’t hurt to understand what they do, what they should be set to.
Queue depth is the amount of memory space allocated to insure that when I/O is executed by an application, that the number of commands and blocks of data that are sent to the host bus adaptor (HBA) are queued insuring the application is free to send more I/O when it can. This feature allows for high amount of work parallelization, and the possibility of massive asynchronous I/O. Oracle, along with SQL Server and DB2 perform both synchronous and asynchronous I/O depending on the situation and circumstance of where the I/O is being generated. Asynchronous I/O is when an application like Oracle batches up a group of blocks of data and sends the entire group as a single I/O request to storage. The process that sends the request, most likely for Oracle would be DBWR, the database writer process, scans through Oracle buffer cache for dirty blocks, consolidates a list of addresses of those blocks and sends these blocks to be written. DBWR issues a single I/O request of many blocks to storage, signals the database that they are written, and then in the background clears the list when acknowledges of each block written is received from storage. This allows the database to immediately recycle the original blocks back to the free list for more buffered reads. DBWR can be very aggressive with the write list, with as many as several hundred blocks gathered and written in a fire-and-forget fashion. This high block count then must have some sort of queuing mechanism between the server and the storage, and that is where the HBA queue depth comes in. When the queue’s start to fill, and get to what ever the maximum queue depth is set for the server (or for AIX, there are separate queues for each LUN / hdisk), a stop request is issued to DBWR to stop sending data until the queue drains. This is not a desirable condition, as it causes delays in how fast DBWR can evacuate dirty blocks. When more blocks are needed, DBWR would be the choke point slowing every other user process requesting buffer cache space. If Oracle senses a slow down in free blocks is causing all transaction activity to slow, it will revert to backup methods like direct path reads, which are reads requests sent to storage and responses sent directly back to the user process bypassing Oracle buffer cache. There is no re-use of this data in this mode, which is not good. Oracle relies on high data reuse to improve performance. Typically, a well tuned database will service 100 times more logical (buffered reads) through the Oracle buffer cache than physical reads. This reduction of physical reads improves end user performance and keeps the storage system from having to perform them.
Queue depths are changed when you install the Infinidat Host Power Tools. The setting for most operating systems is 128.
Red Hat Enterprise Linux (RHEL) I/O schedulers
I/O scheduler plays an important part in supporting a very specific I/O profile for the host OS. There are three classifications of I/O scheduler available for RHEL 4,5,6,7.
cfq – Completely Fair Queueing. This is the default scheduler you will get when you install any version of RHEL 4 or later. The purpose of this scheduler is to insure that the I/O profile of the application supported doesn’t overwhelm the underlying storage. Typically cfq is used for desktop/workstation uses, even when using RHEL. The assumption is cfq kicks in and paces I/O, to and from storage, with the assumption there is a single dedicated hard drive supporting the workstation.
There are 3 possible choices for setting I/O scheduler for RHEL.
The scheduler choice is modified by the Infinidat Host Power Tools, which is set to noop. So there is no need to make any changes.
RHEL File system types
Two default file system types are available for use for RHEL 5 and higher.
Typically ext3 will be used on RHEL5 and ext4 on RHEL6 and above.
A word about fragmentation. Both ext3 and ext4 fragment easily, as does many journal-based file systems. There are some tools available to measure the level of fragmentation, but not many available to fix it. The only known technique to remove fragmentation is to bring the application down, create a second brand new file system, copy the data from the old file system to the new file system, unmounts the old file system, re-mount the new file system with the old file systems mount point, then bring the application up.
This is a time consuming and painful process, but necessary one, as both ext3 and ext4 can become heavily fragmented over time.
The other option, would be to use other file system types that simply fragment less. RiserFS, VFS are two modern journaled file systems available for RHEL that reduce the probability of fragmentation down to 20%.
Ext4 and RHEL6 write I/O pacing
RHEL6 and ext4 introduce another layer of I/O pacing called write barriers. Again, the design assumption is by default not to allow an application to overwhelm the underlying storage infrastructure. With InfiniBox, we don’t worry about writes, as all writes are to cache, and there is a very large cache available, along with a very elegant de- stage mechanism that uses multi-modal log writing architecture to dump modified blocks out of cache very quickly. When configuring RHEL6 or higher, using ext4, or any file system types, mount the file systems with the –nobarrier option. This will turn off the I/O pacing of the O/S and allow maximum write capabilities straight from application to storage.
You can determine if write barriers are turned off by using the mount command.
When deploying Oracle on Windows and NTFS file systems (not using ASM), the NTFS file system uses data clusters to store blocks of data in groups. The default block size is 4kb, which works well on infiniBox. The Allocation Unit size, or extent size, or cluster size, as it is called determines how NTFS will group data blocks into like groups. When Windows submits an I/O to the storage sub system, it normally will use this cluster size to access and pre-fetch data. This 64kb matches the block size that infiniBox uses to store and access that data. So using a 64kb AU size works best.
NTFS does fragment, as does all journal-based file systems, so defrag when ever possible. Oracle likes to update data in place, and this causes high fragmentation on NTFS. Fortunately, the defrag tool really works well here. And no downtime required, unlike other file systems like ext3,ext4 and AIX jfs2.
AIX jfs2 and LVM
AIX uses a journaled file system called jfs2. Jfs2 on top of the AIX Logical Volume Manager are a powerful storage environment for Oracle databases.
AIX supports 2 types of data striping mechanisms. One is very fined grained and one is course grained.
PP spanning is a course grain data layout and access technique that allows Oracle data to be layed out in chunks of up to 256MB in size for a single Physical Partition, or PP. PP spanning then takes each 256MB of data and stores it on each of the hdisks within a given volume group. Data is then accessed using this course grain “striping” mechanism. This is the preferred method, as it allows the InfiniBox to sense the data access and pre-fetch accordingly.
LVM stripping is performed at the logical volume level. You can create a logical volume to span multiple hdisks /
LUNS to allow for many disks to support a single logical volume. That logical volume is then formatted with jfs2 and mounted as a single mount point. The default stripe width at the LVM level is 128kb. This is not the recommended data layout for InfiniBox, as the individual stripes are seen as 128kb random I/O rather than a much larger block sequential I/O profile that a table scan or index range scan would look like.
Oracle Automatic Storage Management was introduced in Oracle 10g and is becoming more widely supported by Oracle shops. There are several advantages and dis-advantages to using ASM. ASM runs on every operating system that the Oracle database can run on. And it is managed in exactly the same way no matter what OS is used.
ASM provides a layer between the database and storage and acts as the Oracle database “file system”. It actually deals directly with raw devices, so it eliminates any issues and bottlenecks that Operating System file systems introduce.
The performance advantages are then very obvious. ASM does provide it’s own striping mechanism, to insure that all LUNS / devices in each disk group are evenly used.
ASM also provides other very nice features, like the ability to “move” tablespaces off of one set of LUNS onto another without bringing the database down. This is very advantageous for Oracle DBA’s providing a layer of protection from Storage migrations.
There are some dis-advantages of using ASM. Setting up Oracle ASM is not trivial. You are essentially setting up a single node cluster, by installing the Oracle Grid Infrastructure software under the Database software.
Once ASM is set up, it is very easy to manage using the built in tools such as ASMCMD, the x-based asmca tool, or Oracle Enterprise Manager (OEM).
Database backup using RMAN is the same for ASM as it is for file system-based Oracle databases.
Storage-based snapshot backup and recovery, particularly when using cloning techniques to stand up non-prod environments with an InfiniBox snapshot is slightly different with ASM. First, if you take a snapshot and map the snapshot back to the same server from which the snapshot is taken, the Oracle ASM instance running on that server will get confused because there are headers written to each LUN within each disk group in ASM. When you clone that data, and present it back to the same server, the ASM instance will notice that the headers found on those devices are already in use and will not allow you to map and mount this disk group to the same server. If you have a different server running ASM, mapping the snapshot to it is fast and easy. Basically you map the clones to the new host, configure them as Oracle ASM devices (for linux Oracleasmlib), start up asmca and it already see’s that the devices make up a disk group and you can immediately mount the disk group.
ASM and AU size
ASM is setup initially with an extent size, or Allocation Unit size used for data layout. The default is 1MB. What this tells ASM is that for every 1MB of data stored on the disk group, an extent marker is placed. This has no direct bearing on read access and stripe width, but does impact performance. When Oracle is scanning through the data, and is told to read a larger block of data, if it encounters an extent marker, this signals the end of channel for this extent, and Oracle must then issue a second I/O to continue the read request until the request is satisfied. Obviously the larger the extent size, the less often an extent marker will be hit, and therefore less physical I/O performed for that same operation. The suggestion for the data disk group then would be to use an AU larger than 1mb. 8mb or 16mb works very well on InfiniBox.
Oracle Real Application Clusters (RAC)
Another advantage for ASM is ASM has the ability to work directly with Oracle Real Application Clusters (RAC), which requires a multi-initiated storage layer like shared file systems that can be read and written by multiple physical hosts simultaneously. Without ASM, you will need a “cluster aware” file system like OCFS (Oracle Clustered File system), Veritas Storage Foundation, IBM’s GPFS. Cluster aware-ness is the ability to allow multiple nodes in a cluster to read and write data stored in one location and the aware-ness is the synchronization of those activities to insure write order and preserve data integrity.
RAC does increase availability of the database by allowing the database to be supported by more than one server. If any of the host nodes in the cluster fails, the database still remains up, supported by the remaining nodes in the host cluster complex. This is an active-active system. It is widely used by shops requiring maximum uptime. One side effect of RAC, is that the amount of I/O generated by a 3 node RAC environment is roughly 3 times more than if the database was run on a single node. The increased I/O works well on InfiniBox, and something to consider when customers are planning for a RAC deployment.
Oracle ASM and RHEL using oracleasm
There is one extra step involved in supporting SAN LUNs with RHEL for use with Oracle ASM. The OracleASMlib must be installed on the RHEL server. It is required to basically create hard links to the real /dev/mapper , and /dev/dm-* devices created by infinihost and the RHEL multipath software. And these hard links allow Oracle ASM to see and own the devices as user oracle group oinstall. You download the OracleASMlib software directly from oracle.com. Be mindful that there are different versions depending on the linux kernel version you are using.
In this case, the ulimit –a command reveals we are running 220.127.116.118.el5.
You then run the oracleasm tool to configure, initialize the service. Then oracleasm createdisk <logical name> <physical device> for each LUN mapped by infinihost.
The oracleasm querydisk command allows you to view the physical nature of the device label, in this case ORA11DATA006.
In this example, the logical device ORA11DATA006 is made up of a hardlink to /dev/mapper/mpath28, which points to /dev/dm-27. There are 12 subordinate devices that support /dev/mapper/mpath28, which are the 12 individual paths that are created by the multipath software, defined by the physical connection between the server and Infinibox. You can check this by:
Note the major and minor number of both devices is the same.
The /etc/sysconfig/oracleasm file is the main configuration file for oracleasm:
The ORACLEASM_SCANORDER and ORACLEASM_SCANEXCLUDE are required entries and force oracleasm to properly build the correct hard links to the right devices. Without these two lines, oracleasm will pick the top path from each physical device and use it as the hard link. This will insure that all I/O traffic passes through a single path, rather than the number of paths that have been configured between the host and the InfiniBox.
Oracle Databases on NFS
Oracle databases on NFS storage is an emerging option for customers seeking to simplify their Oracle database storage environments. The simplicity is provided in several forms.
- Simplified data layout. A single, or two NFS mount(s) to store all Oracle data is much simpler than an ASM environment using 10-20 LUNs to support a database.
- Removal of Fiber Channel complexity from the server. No more need for FC zoninig, multipath, complex switch gear. Removal of extra FC HBA hardware from the server.
- Good enough performance. 2-4ms reads, 1-2ms writes. Not as fast as raw devices on ASM, but in most cases, this level of performance provides sufficient support for most workloads.
- Tighter consolidation of databases on servers, on storage.
This is a relatively new concept, one in which Oracle has been trying to sell to it’s customers with the ZFS storage appliance. The adoption rate for Oracle on NFS has been slow, primarily due to performance issues as well as availability issues on other NAS platforms. Infinibox removes these obstacles for Oracle on NFS by providing the same 7 Nines of availability, and excellent performance by using the same architecture that supports block storage.
The process of implementing Oracle on NFS is simple, starting with setting up storage by creating a NAS file system rather than block storage on Infinibox, exporting the file systems to hosts, mounting on the host (using Oracle suggested NFS mount options) and with the use of the Oracle dNFS client installed on the host running the database software, be able to perform the same direct, asynchronous non-buffered I/O that it uses to fiber channel devices. Standard NFS v3 file shares are supported, and the dNFS client is available for just about any host operating system available.
Oracle performed their own tests comparing standard host-based NFS client, compared to their dNFS client and the performance difference was fairly significant.
Since InfiniBox 2.2 natively supports NFSv3, it makes perfect sense to use it as a ZFS replacement since all of InfiniBox characteristics are far and beyond that of the ZFS appliance. ZFS uses a heritage dual controller architecture, with RAID set data deployment architecture. To provide the same dual drive failure support provided by default on Infinibox, you need to create many triple mirror RAID sets that span the 1.5PB raw capacity of the ZS4-4 Enterprise, reducing the usable capacity to less than 500TB.
Here are the results of a Swingbench OLTP test between a block storage, ASM database, using 8 * 1TB luns, and a single NFS mount point-based Oracle database. Both databases were the same size, about 4.7TB.
The Swingbench transaction rates were fairly close, 194k TPM for ASM, 177k TPM for NFS.
Swingbench transaction response times were also very similar, 31ms for ASM, 33ms for NFS.
From the AWR report, Oracle reports I/O latency between ASM and NFS. In the AWR report, this metric are I/O response times to the SOE tablespace. Here is where the I/O difference shows how fast ASM works .vs NFS. When you review this data, it shows that the latency is 3x higher on NFS. However, based on the actual transaction profile, this resulted in little overall difference to the application throughput in terms of transactions executed, and overall Swingbench transaction latency. This is where “good enough” performance is really good enough.
There are a couple of white papers published by Oracle on how to mount the file systems.
This is what was used for the test:
ibox1082-nfs2:/ora11fsdata1 on /u01/app/oracle/nfs/data1 type nfs (rw,bg,hard,nolock,rsize=32768,wsize=32768,addr=172.19.0.78)
ibox1082-nfs4:/ora11fslog1 on /u01/app/oracle/nfs/log1 type (rw,bg,hard,nolock,rsize=32768,wsize=512,addr=172.19.0.80)
ibox1082-nfs5:/ora11fslog2 on /u01/app/oracle/nfs/log2 type (rw,bg,hard,nolock,rsize=32768,wsize=512,addr=172.19.0.81)
ibox1082-nfs1:/ora11fslog3 on /u01/app/oracle/nfs/log3 type (rw,bg,hard,nolock,rsize=32768,wsize=512,addr=172.19.0.77)
To enable the dNFS client, you must shutdown the database:
This make command replaces the standard SQLnet client NFS library with the dNFS library.
Startup the database:
Review the database alert log and note that there are mentions of Direct NFS as part of the startup process.
Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 2.0.
Starting with the 3.0 version of InfiniBox, data reduction is included with the software. Now, you can create volumes on Infinidat as compressable, SSD-supported, thin provisioned volumes for Oracle consumption. Oracle does not know or care that the data underneath is compressed. With traditional host-based compression capabilities, such as Oracle’s Hybrid Columner compression, and compression of RMAN backups using the RMAN compression engine, the host CPU is used to compress the data in-line as the data is written to storage. This has the unfortunate effect of increasing the core count on the server to support this effort, which increases Oracle Software Licensing costs, which are charged by the server core.
InfiniBox uses a unique compression engine that compresses the data as it is being destaged out of memory, rather than attempting to compress it before it gets into DRAM or SSD. The end result is that write performance is not effected when writing to a compressed volume. Writes on Infinibox, as you already may know are gathered, staged and executed on a regular interval, typically once every 5 minutes. This insures that the data is written smartly and more efficiently. The data is not compressed in cache. Not to worry, we have a ton of cache.
The performance penalty is paid when a read miss occurs. A read miss on Infinibox results in the a spindle read, which on a compressed volume involves reading the blocks, uncompressing the data then placing in DRAM and SSD.
Since Infinidat has such a high cache hit rate, typically in the mid-90% range, the end result is very little difference in performance.
Here is a test run of Swingbench against a non-compressed set of volumes on an F6240. Compared to a set of compressed volumes from a smaller F1130. The systems are vastly different in spindle count (480 for F6240 .vs 60 for F1130) and SSD cache size (86TB for F6240 .vs 23TB for F1130). The I/O workload does not translate to high workload for the storage, but the SSD cache size and spindle count should provide improved performance for the larger F6240.
To use compressed volumes, you must create them on Infinibox as compressed, and they must be thin volumes.
The Swingbench database used in the testing uses 8 x 1TB volumes for the data disk group, 3 x 5GB volumes for 3 individual redo log groups, 1 x 2TB volume for the archive destination disk group, all volumes are ASM disk group based.
The RMAN backup test volumes were 4 x 2TB volumes mounted as Linux ext3 file systems. RMAN was told to create 4 channels, one for each file system and break up the backup pieces into no larger than 10GB chunks. This insured maximum parallelization (set to 4) for the backup.
The data disk group compressed the best, at 2.5:1. All other objects compressed at 1.7:1.
Here is a screen shot of the Swingbench console showing the overview screen showing transactions per minute, average transaction response time and transactions per second for the uncompressed run.
Here is a screen shot of a Swingbench run against a set of compressed volumes.
The difference is negligible. Roughly the same number of transactions per minute between the two runs. The average transaction latency is roughly the same as well. These are actual Swingbench transactions.
The actual data compression efficiency (based upon Infinidat GUI) breakdown is as follows:
|Data / indexes||2.5:1|
|redo / archive logs||1.7:1|
|RMAN backup pieces||1.7:1|
Each of the panels identifies a volume type on InfiniBox with the data reduction status showing in the lower right corner of each box.
The AWR’s show very little difference between uncompressed (top) and compressed (bottom) top 5 wait events. The majority of work done by this Swingbench test were small block index reads, or to InfiniBox, small block random reads.
|Event||Waits||Time (s)||Avg wait (ms)||% DB time||Wait Class|
|db file sequential read||127.502.017||268,446||2||78.23||User I/O|
|log file sync||12,781,586||15,939||1||4.64||Commit|
|library cache: mutex X||801,753||2,811||4||0.82||Concurrency|
|latch: cache buffers chains||550,664||1,026||2||0.30||Concurrency|
|db file sequential read||125,059,497||266,664||2||78.89||User I/O|
|log file sync||11,782,395||15,468||1||4.58||Commit|
|library cache: mutex X||692,545||3,432||5||1.02||Concurrency|
|latch: cache buffers chains||596,214||1,208||2||0.36||Concurrency|
Here are the tablespace data file stats, top being uncompressed, bottom compressed. Note, similar number of reads executed, as well as the recorded latency, which doesn’t even register. For writes, slightly higher latency on the writes.