What Happened to 17 Million S3 Objects?

Last week, I migrated an S3 bucket from one AWS region to another. The original bucket held 174 million objects and around 18TB of data. To set up the migration, I created a new S3 bucket in the target region, configured live replication, then triggered a Batch Operation to copy all existing objects to the new bucket. That’s when things got strange.

A missing 17 million

The batch operation reported 157 million objects copied, 17 million short of the expected 174 million. CloudWatch metrics confirmed the replication count, so it wasn’t just a reporting glitch. There was also a 9.6GB size difference between the two buckets. My first thought was to re-run the batch job. Maybe it had skipped some objects. But the second attempt failed almost immediately, saying there were no remaining objects to replicate.

I double-checked the replication setup. The only objects excluded were those encrypted using SSE-KMS. As far as I knew, we didn’t use that. I checked using a script to randomly sample 10,000 objects and check their encryption. None used SSE-KMS. I used a similar sscript to compare objects between the old and new buckets. Again, 10,000 objects later, I couldn’t find a single one that hadn’t been copied. I should’ve hit about 1,000 unreplicated objects if 10% were missing, but I found none.

Maybe the 174 million count was wrong?

S3 doesn’t provide a quick way to get an exact object count. To get a definitive answer, I enabled S3 Inventory on both buckets to generate a daily, queryable listing of all objects.

Lifecycle Rules

While waiting for the inventory report, I revisited the S3 docs. There was a note recommending that lifecycle rules be disabled during batch operations. If objects expire mid-migration, the buckets can go out of sync.

I hadn’t disabled those rules, but surely I would’ve noticed some expired objects during my earlier checks?

This made me think about object versions. I had configured replication to include deletion markers, so versioned deletes should have been copied. A few manual checks showed that object versions were indeed replicated properly.

Then I noticed two specific lifecycle rule options:

Clean up failed multipart uploads
Delete expired deletion markers

We use multipart uploads extensively, and I’d never really considered what happens to failed ones. Turns out S3 keeps them indefinitely unless you explicitly clean them up.

Same for expired deletion markers, these are deletion markers where all previous versions of an object have been deleted. What remains is an empty, dangling object. This felt hopeful, maybe expired deletion markers and failed uploads are not replicated.

Multipart Uploads

The S3 API lets you list inflight multipart uploads. Sure enough, there were uploads dating back five years, and a lot of them. Using a script I calculated the total size of these inflight uploads: 9.6GB. The exact size difference between the two buckets.

Running the same script against the new bucket showed zero pending uploads. That solved the size discrepancy, but not the missing objects. There were only about 25,000 in-flight upload parts. That’s nowhere near the 17 million missing objects.

Deletion Markers

That left expired deletion markers as the most likely cause. Since these are effectively tiny metadata entries, they could easily account for a large object count without a corresponding size difference.

Ordinary deletion markers were being replicated, I could confirm that. But expired deletion markers? There’s no direct API to count them, and scanning 170 million objects manually wasn’t feasible.

Fortunately, S3 Inventory reports list whether each object is a deletion marker, perfect.

I queried the data in Athena:

SELECT COUNT(*) AS expired_delete_marker_count
FROM (
  SELECT key
  FROM inventory_table
  GROUP BY key
  HAVING COUNT(*) = 1
    AND MAX(CASE WHEN is_delete_marker = true THEN 1 ELSE 0 END) = 1
) AS expired_keys;

The result? 17 million expired deletion markers. I had potentially found them. To confirm they hadn’t been copied, I checked several of their keys in the new bucket. As expected, they were missing.

That explained everything:

9.6GB of failed multipart uploads not copied
17 million expired deletion markers not replicated

As a final check, I queried for any objects in the source bucket that didn’t have a replication status of "REPLICATED". None found.

Lessons Learned

Potentially missing 10% of objects during an S3 migration is not ideal. But it turns out, these were pretty much just dangling metadata that had never been cleaned up. S3 will happily retain expired deletion markers and failed multipart uploads forever unless you set up lifecycle rules to remove them. I’ve now added cleanup rules to the new bucket.