Get all your news in one place.

100's of premium titles.
One app.

Start reading

Get all your news in one place.

100's of premium titles. One news app.

Start reading

Tom’s Hardware

Technology

Anton Shilov

Kioxia preps XL-Flash SSD that's 3x faster than any SSD available — 10 million IOPS drive has peer-to-peer GPU connectivity for AI servers

Kioxia Nvidia

Kioxia aims to change the storage paradigm with a proposed SSD designed to surpass 10 million input/output operations per second (IOPS) in small-block workloads, the company revealed at its Corporate Strategy Meeting earlier this week. That's three times faster than the peak speeds of many modern SSDs.

One of the performance bottlenecks of modern AI servers is the data transfer between storage and GPUs, as data is currently transferred by the CPU, which significantly increases latencies and extends access times.

To reach the performance target, Kioxia is designing a new controller specifically tuned to maximize IOPS — beyond 10M 512B IOPS — to enable GPUs to access data at speeds sufficient to keep their cores 100% used at all times. The proposed Kioxia 'AI SSD' is set to utilize the company's single-level cell (SLC) XL-Flash memory, which boasts read latencies in the range of 3 to 5 microseconds, significantly lower than the read latencies of 40 to 100 microseconds offered by SSDs based on conventional 3D NAND. Additionally, by storing one bit per cell, SLC offers faster access times and greater endurance, attributes that are crucial for demanding AI workloads.

Current high-end datacenter SSDs typically achieve 2 to 3 million IOPS for both 4K and 512-byte random read operations. From a bandwidth perspective, using 4K blocks makes a lot of sense, whereas 512B blocks do not. However, large language models (LLMs) and retrieval-augmented generation (RAG) systems typically perform small, random accesses to fetch embeddings, parameters, or knowledge base entries. In these scenarios, small block sizes, such as 512B, are more representative of actual application behavior than 4K or larger blocks. Therefore, it makes more sense to use 512B blocks to meet the needs of LLMs and RAGs in terms of latencies and use multiple drives for bandwidth. Using smaller blocks could also enable more efficient use of memory semantics for access.

It is noteworthy that Kioxia does not disclose which host interface its 'AI SSD' will use, although it does not appear to require a PCIe 6.0 interface from a bandwidth perspective.

The 'AI SSD' from Kioxia will also be optimized for peer-to-peer communications between the GPU and SSD, bypassing the CPU for extra performance and lower latency. To that end, there is another reason why Kioxia (well, and Nvidia) plan to use 512B blocks as GPUs typically operate on cache lines of 32, 64, or 128 bytes internally and their memory subsystems are optimized for burst access to many small, independent memory locations, to keep all the stream processors busy at all times. To that end, 512-byte reads align better with GPU designs.

Kioxia's 'AI SSD' is designed to support AI training setups where large language models (LLMs) require fast, repeated access to massive datasets. Also, Kioxia envisions it being deployed in AI inference applications, particularly in systems that employ retrieval-augmented generation techniques to enhance generative AI outputs with real-time data (i.e., for reasoning). Low-latency, high-bandwidth storage access is crucial for such machines to ensure both low response times and efficient GPU utilization.

The Kioxia 'AI SSD' is scheduled for release in the second half of 2026.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Read news from 100's of titles, curated specifically for you.

Already a member? Sign in here