Digital storage is an important element in the Open Compute Project and there were many announcements and presentations focusing on digital storage at the 2019 OCP Summit. Digital storage is an important enabler of advanced applications such as machine learning training and inference accelerators for putting AI to work. We will go over some of the presentations and announcements from the 2019 OCP Summit.
As Vijay Rao from Facebook discussed, for his company there is a big focus on using AI for video/image applications. He said that CPUs and GPUs are not enough that that custom accelerators are coming into wider use. The original Glacier Point v2 was an NVMe storage module for Yosemite. The original Glacier Point fit six NVMe SSDs into the same form factor that the Intel Twinlake accelerator modules fit into. The v2 Glacier Point has 12-M.2 interfaces for a mix of accelerators and SSDs in Yosemite 2.0. Using a mix of processor (e.g. the Intel NNP I-1000) and storage M.2 modules, Facebook plans to use Yosemite for compute, video and AI inference applications. The figure below shows how Yosemite is put together.

Vijay also displayed an image of upcoming neural networks and the dominating activities for these different. He pointed out that the OCP Angel’s Landing box with 8 CPUs was a learning module for memory bandwidth and memory capacity dominated machine learning workloads, while the Emerald Pool box is compute and communication dominated.
Nav Kankani and Vijayan Rajan from Facebook spoke about hardware and software co-design to achieve predictable IO latency. The figure below shows their view of today’s storage/memory types—going from the core through cache layers, DRAM Flash, HDDs and Magnetic Tape.

They also pointed out that as the density of NAND flash has increased the IOPS/TB has decreased. With less die per TB there is an increase in IO latency and therefore less predictable system latency as shown below.

With the massive levels of sharding of data, fetching that data incurs a large fanout on the system backend resulting in reading from several servers and many different pieces of data from each server. The result is a wide degree of variability (e.g. asymmetry in read and write access patterns) in hyperscale workloads. Because of the amplification in user requests to backend requests the long tail latencies become more important in overall performance than average latency. The more the user requests, the greater the probability of high tail latency. Despite the unpredictable latency in the storage stack, a large-scale distributed system needs predictable latency. Large scale distributed systems need predictable latency regardless of unpredictable latency in the storage stack (which exists).
In order to create this predictable system latency several things are done at the hardware layer including parallel operation paths, priority quest, and new device features such as write/erase suspends. One must also be able to isolate streams and VNMeSets, let maximum read recovery limits and find and maintain predictable latency modes.
At the software layer to optimize for predictable latency one needs to do shard management and rebalancing, pooling and striping, block and page caching, tiering using persistent memory (also known as storage class memory, SCM), write coalescing and dynamic resizing. There are many trade-offs to make with various performance parameters and power management to enable more predictable latency.
In addition to HW/SW co-design, knowledge of the application domain opens new optimization opportunities and architectures. Facebooks OCP flash-based hardware is shown below.

The speakers gave key value store (KVS) as an application example. Key value stores incur high write amplification and a lot of data sharding. There are huge differences in BW: 120 MB/s reads/writes of small keys and values (256 bytes) vs. 3000 MB/s disk reads and writes. Keeping amplified I/O local saves networking, improves latency, especially tail latency.
Key/Value stores are CPU- and DRAM-hungry. Lightning JBOF-based designs achieve good sharing, and great capacity management. This is perfect for Block protocols, but difficult to run RocksDB—1 JBOF + 5-15 RocksDB hosts works perfectly but 1 JBOF + 5-15 RocksDB instances + 5-15 DB hosts is extremely unbalanced. One neeeds to have a flexible combination of CPU+DRAM+SSD.
For RocksDB use cases achieve better ratios with two NVMe SSD per server in Yosemite Chassis: 1 CPU + X DRAM +4-8 TB SSD. Compare this against JBOF based design with 2 CPU + Y DRAM + 60-240 TB SSD.
According to the presenters, HW rearchitecting goes hand-in-hand with SW reconfiguration. At scale, getting efficiency is hard. Need a flexible set of building blocks—the right ratios of CPU, DRAM and SSD within each server connected with low-cost high-speed networks. Customize architectures to be application aware.
Manoj Wadekur from Facebook and Anjaneya “Reddy” Chagam from Intel spoke about disaggregated storage architecture. The figure below compares local attached storage with disaggregated storage using the compute node (to create a logical disaggregation) and at a storage head node with physical storage disaggregation. In both cases disaggregation results in more efficient utilization of storage resources.

NVM Express (NVMe) is the standardized interface designed around the capabilities of non-volatile fast memory and storage. NVMe, based on the fast PCIe computer bus is rapidly displacing older SSD interfaces, such as SATA and SAS. NVMe has low latency due to direct CPU connection and has no host bus adapter, resulting in lower power and lower cost. NVMe SSDs are now available in a great many form factors such as PCIe AIC, SFF-8639, M.2, SATA Express and BGA.
With the latest NVMe standard release NVMe can work over an Ethernet fabric as shown below. This enables sharing of NVMe flash storage over this network using traditional block protocols (e.g. iSCSI) or NVMe optimized protocols (e.g. NVMe/TCP). NVMe over Fabrics (NVMe-oF) supports fiber channel and infiniband transport in addition to Ethernet.

Disaggregating flash storage enabled independent scaling of compute and storage ersources for cloud workloads and using NVMe over TCP/IP enables disaggregation of flash storage without requiring changes to networking infrastructure.
Mellanox (which announced its coming acquisition by NVIDIA just before the OCP Summit) is playing an important role in composable infrastructure using NVMe-oF. At the OCP Summit is announced its NVMe SNAP (Software-defined Network Accelerated Processing) storage virtualization solution for public cloud, private cloud and enterprise computing.
The company’s NVMe SNAP solution is said to eliminate the inefficiency of local storage while addressing a growing need for compute and storage disaggregation and composability at cloud scale. NVMe SNAP enables faster adoption of NVMe over Fabrics (NVMe-oF) in different data center environments by enabling seamless integration into almost any server with any operating system or hypervisor, effectively enabling immediate deployment of NVMe-oF technology for any application.
The NVMe SNAP solution, delivered as part of the BlueField family of PCIe SmartNIC adapters, makes networked flash storage appear as local NVMe storage, effectively virtualizing the physical storage. NVMe SNAP makes use of texisting NVMe drivers, to give customers the composability and flexibility of networked flash storage combined with the advantages of local SSD performance, management, and software transparency.

This NVMe SNAP technology is combined with BlueField’s multicore Arm processors and virtual switch and RDMA offload engines, to enable a broad range of accelerated storage, software defined networking, and domain specific application solutions. The Arm processors in combination with SNAP can be used to accelerate distributed file systems, compression, de-duplication, big data, artificial intelligence, load balancing, security and many other applications.
In its briefing talk, Mellanox identified many partners who are using their BlueField NVMe-oF technology including Celestica, Dell/EMC, E8 Storage, Excelero, MITAC, NetApp, Pure Storage and vmware.
Computational storage was another theme at the OCP Summit. Computational storage reduces the need to move data in order to process the data in storage devices. There are several companies working on this technology. At the OCP Summit NGD Systems announced the general availability of its 16TB U.2 NVMe SSD, the first of its Newport Platform of products for broad deployment of computational storage devices.

The Newport Platform offers the highest capacity NVMe SSDs and leverages In-Situ processing to eliminate the need for data movement to main memory prior to processing. The Newport Platform makes it possible to implement intelligent edge and hyperscale environments with much greater efficiency and lower cost, providing customers a significant TCO benefit.
Silicon Motion Technology Corporation released its SM2271, an SSD controller ASIC with a firmware stack for high-performance enterprise SATA SSDs with a maximum capacity of 16TB. The SM2271 is a performance, eight-channel enterprise SATA SSD controller solution that supports 3D TLC and QLC Flash technologies. It exploits the potential of the SATA interface, achieving a maximum sequential read speed of 540MB/s and sequential write speed of 520MB/s.
The SM2271 combines enterprise endurance, support for high SSD capacity, and performance for read-intensive workloads while enabling users to maintain their existing SATA storage infrastructure for a low TCO. Providing more consistent latency than client SSDs for more reliable data write and read operation, it is a platform for SATA SSD designs intended to replace lower-performance HDDs in enterprise storage, data center and industrial storage systems.
Erich Haratsch from Seagate spoke about using SSD with Compression in compute and storage infrastructure. He said that confusion exists about benefits, use cases and data integriy when SSDs implement compression. Some data (such as data bases, OS files and various application data are typically highly compressible. Other data, in particular images and video files are often more difficult to compress.
He pointed out that SSD compression algorithms need to be lossless (that is, all the data can be reconstructed from the compressed content). Compression needs to run inline at the rate of the data flow so there is low impact on writing and reading latencies. Compression must also be done before encryption and ECC coding. If used correctly, for many types of data, compression reduces the data written to the media (here flash memory). An SSD with compression must have a flash translation layer (FTL) that can manage physical data units of variable size.
If done properly compression can increase effective overprovisioning. Although the logical capacity doesn’t change the write amplification is reduced and thus it increases endurance. It also increases random write and mixed read/write performance. Compression can also be used to increase the logical capacity of the SSD, the resulting extra logical capacity depends upon the stored data entropy.
Applied to quad-level cell (QLC) NAND, compression can increase the flash endurance, performance and/or user capacity. Seagate’s Nytro 1000 SATA SSD has the company’s DuraWrite lossless data reduction technology to increase performance and deliver hig-hpower efficiency. Erich made a call for action for compression-enabled SSD devices in the Open Compute Project.
Kushagra Vaid from Microsoft spoke about updates in the Cerberus security specification, introduced last year with the Root of Trust domain extended to peripheral components such as NICs, SSDs and accelerators and said that it was implemented on all Project Olympus motherboards. They also gave Denali updates where this flexible SSD for cloud-scale applications could be extended to storage/media disaggregation beyond the cloud as shown below.

He also spoke about a new open source loss-less compression, called Project Zipline targeted at legacy and modern datasets from the edge to the cloud. This initiative is said to provide always-on data processing with high compression ratios, high throughput and low latency. The release included Verilog RTL source and test suite, allowing rapid implementation. One of the use cases was archival storage, but perhaps it could be implemented with storage devices as well (like the Seagate approach).
The 2019 OCP Summit revealed storage solutions to support machine learning, compression, composable storage infrastructure and the wide-spread utilization of NVMe and NVMe-oF to create the next generation of data center storage solutions.