FPGA Acceleration for Storage Deduplication

In Short

Traditional deduplication systems use FPGA as an offload engine and maintain a hash table on the host side. Those designs need extra memory copy and long-latency communications between the host and FPGA. Both affect the performance of the whole system.

To overcome this limitation, our new system integrates all logic related to page deduplication to FPGA. No extra memory copy and communication are needed in our system. Furthermore, the system is implemented entirely in hardware, delivering 12.7GB/s(100Gbps) throughput and < 30us latency(10x faster than previous work). This project was carried out at Systems Group, ETH.

Deduplication: Algorithm

The deduplication system chunks the input and then identifies the unique chunks by fingerprinting them using a cryptographically secure hash(typically MD5 or SHA). The system maintains metadata for the correspondence between the user pages and unique pages. Only unique pages are stored in SSD to lower the write operation and extend the SSD lifetime.

Deduplication Algorithm

System Setup

Our whole system contains the driver and FPGA logic. The system is a transparent layer inserted between the CPU and SSD. Neither is aware of this layer, which handles everything related to page deduplication: chunking, fingerprinting, metadata management, and garbage collection.

System Setup

Hardware Architecture

On the hardware side, the deduplication logic, DedupCore, is built on top of the FPGA infrastructure Coyote, which provides DMA and RDMA access to the system. A hardware hash table engine maintains page metadata in the FPGA memory. FSM array supports multiple lookup requests in parallel, and a lock table is employed to ensure the correctness of those parallel lookups.

Hardware Architecture

Bloom Filter Deletion Mechanism

We employ the Bloom filter in our system to quickly identify new pages to accelerate the hash table lookup further. The challenge is that the Bloom filter does not support deletion. Traditional methods use extra memory space to enable deletion(e.g., counting Bloom filter). Those methods will drastically increase the memory footprint(e.g., a 16-bit counter leads to 16x memory space).

In our system, we design a per-bucket Bloom filter (each bucket has its own Bloom filter) and a reconstruction mechanism that reconstructs the bucket’s Bloom filter when a false positive is detected in the lookup. This reconstruction mechanism does not introduce extra memory access since we look up until the end of the linked list when a false positive happens, and also has no memory space overhead.

Bloom Filter Reconstruction Mechanism is Employed as a Deletion Mechanism Effectively

Performance

When using 6 FSMs for parallel lookup, we saturate the DMA throughput when the requests only contains existing pages. Latency is < 30 us for write requests and < 15us for erase requests.

Throughput and Latency vs. Hash Table Fullness

The worst case in our system corresponds to the scenario that the hash table is almost full and all incoming pages are new, which means we always need to look up until the end of the linked list. With the help of the Bloom filter, we can still get 12.1 GB/s throughput in the worst case without increasing the number of FSM.

Throughput vs. Page Deduplication Ratio

FPGA Floorplan

Our system uses roughly 2/3 of the FPGA resource, leaving enough space for other extensions.

FPGA Floorplan