March 25, 2025

ikayaniaamirshahzad@gmail.com

Improving Parquet Dedupe on Hugging Face Hub


yuchenglow's avatar


Di Xiao's avatar



The Xet team at Hugging Face is working on improving the efficiency of the Hub’s
storage architecture to make it easier and quicker for users to
store and update data and models. As Hugging Face hosts nearly 11PB of datasets
with Parquet files alone accounting for over 2.2PB of that storage,
optimizing Parquet storage is of pretty high priority.

Most Parquet files are bulk exports from various data analysis pipelines
or databases, often appearing as full snapshots rather than incremental
updates. Data deduplication becomes critical for efficiency when users want to
update their datasets on a regular basis. Only by deduplicating can we store
all versions as compactly as possible, without requiring everything to be uploaded
again on every update. In an ideal case, we should be able to store every version
of a growing dataset with only a little more space than the size of its largest version.

Our default storage algorithm uses byte-level Content-Defined Chunking (CDC),
which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.
Here we run some experiments to see how some simple modifications behave on
Parquet files, using a 2GB Parquet file with 1,092,000 rows from the
FineWeb
dataset and generating visualizations using our dedupe
estimator
.



Background

Parquet tables work by splitting the table into row groups, each with a fixed
number of rows (for instance 1000 rows). Each column within the row group is
then compressed and stored:

Parquet Layout

Intuitively, this means that operations which do not mess with the row
grouping, like modifications or appends, should dedupe pretty well. So let’s
test this out!



Append

Here we append 10,000 new rows to the file and compare the results with the
original version. Green represents all deduped blocks, red represents all
new blocks, and shades in between show different levels of deduplication.

Visualization of dedupe from data appends

We can see that indeed we are able to dedupe nearly the entire file,
but only with changes seen at the end of the file. The new file is 99.1%
deduped, requiring only 20MB of additional storage. This matches our
intuition pretty well.



Modification

Given the layout, we would expect that row modifications to be pretty
isolated, but this is apparently not the case. Here we make a small
modification to row 10000, and we see that while most of the file does dedupe,
there are many small regularly spaced sections of new data!

Visualization of dedupe from data modifications

A quick scan of the Parquet file
format
suggests
that absolute file offsets are part of the Parquet column headers (see the
structures ColumnChunk and ColumnMetaData)! This means that any
modification is likely to rewrite all the Column headers. So while the
data does dedupe well (it is mostly green), we get new bytes in every
column header.

In this case, the new file is only 89% deduped, requiring 230MB of additional
storage.



Deletion

Here we delete a row from the middle of the file (note: insertion should have
similar behavior). As this reorganizes the entire row group layout (each
row group is 1000 rows), we see that while we dedupe the first half of
the file, the remaining file has completely new blocks.

Visualization of dedupe from data deletion

This is mostly because the Parquet format compresses each column
aggressively. If we turn off compression we are able to dedupe more
aggressively:

Visualization of dedupe from data deletion without column compression

However the file sizes are nearly 2x larger if we store the data
uncompressed.

Is it possible to have the benefit of dedupe and compression at the same
time?



Content Defined Row Groups

One potential solution is to use not only byte-level CDC, but also apply it at the row level:
we split row groups not based on absolute count (1000 rows), but on a hash of a provided
“Key” column. In other words, I split off a row group whenever the hash of
the key column % [target row count] = 0, with some allowances for a minimum
and a maximum row group size.

I hacked up a quick inefficient experimental demonstration
here.

With this, we are able to efficiently dedupe across compressed Parquet
files even as I delete a row. Here we clearly see a big red block
representing the rewritten row group, followed by a small change for every
column header.

Visualization of dedupe from data deletion with content defined row groups

Based on these experiments, we could consider improving Parquet file
dedupe-ability in a couple of ways:

  1. Use relative offsets instead of absolute offsets for file structure
    data. This would make the Parquet structures position independent and
    easy to “memcpy” around, although it is an involving file format change that
    is probably difficult to do.
  2. Support content defined chunking on row groups. The format actually
    supports this today as it does not require row groups to be uniformly sized,
    so this can be done with minimal blast radius. Only Parquet format writers
    would have to be updated.

While we will continue exploring ways to improve Parquet storage performance
(Ex: perhaps we could optionally rewrite Parquet files before uploading?
Strip absolute file offsets on upload and restore on download?), we would
love to work with the Apache Arrow project to see if there is interest in
implementing some of these ideas in the Parquet / Arrow code base.

In the meantime, we are also exploring the behavior of our data dedupe process
on other common filetypes. Please do try our dedupe
estimator
and tell us about
your findings!



Source link

Leave a Comment