Wednesday, October 30, 2013

Oracle Exadata -- Infiniband -- OFED -- Delivering the messages -- bypassing the kernel, not generating those interrupts, using RDMA directly place them into memory..

Infiniband  has been deployed in Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud , Oracle SPARC SuperCluster and more. It has been used in these engineered systems mainly for high performance clustering.. It is used for both db-storage connections and rac interconnect.

In Exadata X2, in each exadata servers, we have 1 dual port PCIe 2.0 HCA with two infiniband  4x QDR(40gb/s) ports.. With these two ports, each server can be connected to two different infiniband switches.

There are types of Infiniband like single data rate (SDR), double data rate (DDR), quad data rate (QDR), fourteen data rate (FDR), and enhanced data rate (EDR).

Following table shows the data rates of each type of link:

1X2 Gbit/s4 Gbit/s8 Gbit/s10.3125 Gbit/s13.64 Gbit/s25 Gbit/s
4X8 Gbit/s16 Gbit/s32 Gbit/s41.25 Gbit/s54.54 Gbit/s100 Gbit/s
12X24 Gbit/s48 Gbit/s96 Gbit/s123.75 Gbit/s163.64 Gbit/s300 Gbit/s

Oracle Exa family uses 40 Gbit/s Infiniband QDR using Sun switches. 40 Gbit/s makes 32 Gbit/s effective data rate because , SDR, DDR and QDR links use 8b/10b encoding — every 10 bits sent carry 8bits of data — making the effective data transmission rate four-fifths the raw rate.
Infiniband fabric on Exa Family, is used to connect the compute nodes with the storage nodes.
Infiniband provides transport protocols in Hardware and Direct Memory Access capability. Thus, interconnected systems have more cpu available for processing.Actually we are talking about Remote Direct Memory Access, which provides direct and efficient  access to host or client memory without involving processor overhead..

Infiniband technology on the server side is provided by Host Controller Adapters (HCA), which is connected through PCI Express slot.
All the Infiniband functionality is supplied by HCA hardware.. -- Server's CPU is not used for infiniband transport..

Some notes about this kind of offloading:

Normally/traditionally, when a network data transfer occurs, network interface card receives the data packet and interrupts the server's CPU.. Server's CPU extracts the data from the network packet and writes the data to the memory, where relevant applications can reach..
If a network packet needs to be sent, again, Server's CPU copies the data from memory to network buffer..
In Infiniband, data packets is moved directly to memory  without any intervention from the host processor. Also, unlike traditional software-based transport protocol processing, InfiniBand provides hardware support for all of the services required to move data between hosts..(RDMA)
Oracle employs the OpenFabrics driver stack for Infiniband in Exadata..Ofcourse, Oracle  have made some improvements on OFED stack to make it Exadata-ready.
About Open Fabrics: The OpenFabrics Alliance (OFA) develops, tests, licenses, supports and distributes OpenFabrics Enterprise Distribution (OFED™) open source software for high-performance networking applications that demand low latency and high scalability.
Lets take a look at the following diagram;

Here you see the levels of Infiniband Stack..

When you look at the Upper level protocols section; Open Fabrics supply serveral protocols for the application needs. So by the use of upper level protocols, various application types can take the advantage of Infiniband accelerated technology.

Open Fabrics Upper Level Protocols are:

IPoIB- > IP over Infiniband
SDP-> Bypasses the TCP stack.. For tcp sockets
EoIB-> Ethernet over Infiniband .. Network interface implemetation of Infiniband,  enables routing of packets from the InfiniBand fabric to a single GbE or 10 GbEsubnet.
SRP->for tunneling SCSI request packets over InfiniBand
iSER-> eliminates the traditional iSCSI and TCP bottlenecks by enabling zero-copy RDMA.
Network File System (NFS) over RDMA : NFS over RDMA extends NFS to take advantage of the RDMA features of InfiniBand and other RDMA-enabled fabrics
Reliable Datagram Sockets (RDS)/ZDP : RDS provides reliable transport services and can be used by applications such as Oracle RAC for both interprocess communication and storage access.

On top of InfiniBand, Exadata uses the Zero Data loss UDP (ZDP) protocol. ZDP is open
source software that is developed by Oracle. It is like UDP but more reliable. Its full technical
name is RDS (Reliable Datagram Sockets) V3. The ZDP protocol has a very low CPU
overhead with tests showing only a 2 percent CPU utilization while transferring 1 GB/sec of

Using RDS, Oracle internal interconnect test tool shows;
50% less CPU than IP over IB, UDP
½ latency of UDP (no user-mode acks)
50% faster cache to cache Oracle block throughput.
So with this implementations, it looks like normal Ethernet to host software..
All IP-based tools work transparently – TCP/IP, UDP, HTTP, SSH..

That is it for now, but I will write more about this topic when I have time for the research...


  1. This is common in any communications protocol. In TCP / IP (TCP offload, full kernel Bypass) each of these units of information called "datagram" (datagram), and data sets are sent as separate messages.

  2. Thank you for you concern.. I didnt mention about those Tcp offloads.. This comment of yours motivated me to make some researchs about them, as I see they are different. Infiniband, TOE and etc.. seems to built for the same purpose but they are operating differently. They have advantages and disadvantages it seems.
    For example: Following is from RDMA Consortium FAQs April 29, 2003..
    . TCP Offload Engines reduce much of the TCP/IP protocol processing burden from the main CPU. However, the ability of performing zero copy of incoming data streams on a TOE is very dependent on the TOE design, the operating system's programming interface, and the application's communication model. In many cases, a TOE doesn’t directly support zero copy of incoming data streams. RDMA directly supports a zero copy model of incoming data over a wider range of application environments than a TOE . The combination of TCP offload and RDMA in the same network adapter is expected to provide an optimal architecture for high speed networking with the lowest demands on both CPU and memory resources.

  3. This time, you can already achieve and work with highest performance even with the lowest latency. Imagine executing over 1 M stock trading FIX transaction in only a matter of one second.
    NIC with Full TCP Offload

  4. Application: It corresponds to the levels of OSI application, presentation and session. Here protocols to provide services such as electronic mail (SMTP), file transfer (FTP), remote login (TELNET) and more recently as HTTP (Hypertext Transfer Protocol) are included.
    - Transportation: Matches the transport layer of the OSI model. This level protocol such as TCP and UDP offload, are responsible for managing data and provide the reliability necessary for transport of the same.
    thanks in advanced

  5. The main objective of this plan is to develop a ten gigabyte TOE or TCP offloading engine soft IP core which utilizes the Avalon ST or Altera Standard Interfaces for delivering and getting Ethernet packets. For this reason, the TCP Offload IP core will let users to effectively handle the TCP IP more than 10 Gigabyte Ethernet on different Altera platforms like AoE Solarflare and SocKit.