Migrating 124,000 Files from Dropbox to S3: A Quick Tutorial
Migrating Files from Dropbox to S3: A Quick Tutorial
We recently completed a simple data migration project: transferring ~124,000 files from Dropbox to Amazon S3. This wasn't just a simple copy operation, althought maybe it should have been. It was a complex, resumable migration system that handled authentication challenges, API rate limits, and network interruptions while maintaining data integrity throughout t# Migrating 124,000 Files from Dropbox to S3: A Technical Journey
Why We Migrated
The decision to migrate from Dropbox to S3 wasn't taken lightly. Several factors drove this choice:
Cost Optimization: S3's lifecycle policies allow automatic transitions to cheaper storage classes. Files transition to Glacier after 30 days ($0.004/GB/month) and Deep Archive after 90 days ($0.001/GB/month). For our 1TB of data, this translates to approximately $12/year after transitions—a significant cost reduction compared to traditional cloud storage.
Scalability: S3's virtually unlimited storage capacity and integration with other AWS services provided better long-term scalability for our growing data needs.
Control: Direct S3 access gives us more granular control over access patterns, backup strategies, and data lifecycle management.
The Technical Challenge
Migrating 124,000+ files presented several technical challenges:
- Scale: Processing over 100,000 files requires robust error handling and progress tracking
- Reliability: Network interruptions and API failures needed graceful handling
- Authentication: Dropbox access tokens expire, requiring token refresh strategies
- Performance: Balancing speed with API rate limits and system resources
- Resumability: The ability to restart from where we left off after interruptions
Our Solution: A Resumable Migration System
We built a Python-based migration system with several key components:
1. File Inventory System
First, we created a comprehensive inventory of all Dropbox files:
def create_inventory(self):
"""Create comprehensive file inventory"""
all_files = []
def process_folder(path=""):
try:
result = self.dbx.files_list_folder(path, recursive=True)
while True:
for entry in result.entries:
if isinstance(entry, dropbox.files.FileMetadata):
s3_key = entry.path_lower[1:] # Remove leading '/'
all_files.append({
'dropbox_path': entry.path_lower,
's3_key': s3_key,
'size': entry.size
})
if not result.has_more:
break
result = self.dbx.files_list_folder_continue(result.cursor)
except Exception as e:
logger.error(f"Error processing folder {path}: {e}")
process_folder()
return {'files': all_files, 'total_count': len(all_files)}
2. Resumable Migration with Progress Tracking
The core migration system maintains state between runs:
def load_progress(self):
"""Load migration progress from file"""
if os.path.exists(self.progress_file):
with open(self.progress_file, 'r') as f:
data = json.load(f)
# Convert lists back to sets for fast lookup
data['completed'] = set(data.get('completed', []))
data['failed'] = set(data.get('failed', []))
return data
return {'completed': set(), 'failed': set(), 'total_files': 0}
def migrate_all(self, limit=None):
progress = self.load_progress()
# Filter out already completed files
files_to_migrate = []
for dbx_path, s3_key, size in all_files:
if s3_key not in progress['completed']:
files_to_migrate.append((dbx_path, s3_key, size))
if limit and len(files_to_migrate) >= limit:
break
# Process with thread pool for parallel uploads
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {executor.submit(self._migrate_file_safe, dbx_path, s3_key, size): s3_key
for dbx_path, s3_key, size in files_to_migrate}
for future in as_completed(futures):
# Handle results and save progress periodically
if len(progress['completed']) % 10 == 0:
self.save_progress(progress)
3. Intelligent File Processing
We implemented compression for text-based files and multipart uploads for large files:
def _migrate_file(self, dropbox_path, s3_key, file_size):
# Download from Dropbox
_, response = self.dbx.files_download(dropbox_path)
content = response.content
# Compression for text files
should_compress = self._should_compress(dropbox_path, file_size)
content_encoding = None
if should_compress:
compressed = gzip.compress(content)
if len(compressed) < len(content) * 0.9: # Only if 10%+ savings
content = compressed
content_encoding = 'gzip'
# Choose upload method based on size
if len(content) >= self.multipart_threshold:
self._multipart_upload(s3_key, content, content_encoding)
else:
self._simple_upload(s3_key, content, content_encoding)
4. S3 Lifecycle Configuration
We set up automatic cost optimization through S3 lifecycle rules:
def setup_lifecycle_policy(bucket_name):
lifecycle_config = {
'Rules': [{
'ID': 'DropboxMigrationLifecycle',
'Status': 'Enabled',
'Filter': {'Prefix': ''},
'Transitions': [
{
'Days': 30,
'StorageClass': 'GLACIER'
},
{
'Days': 90,
'StorageClass': 'DEEP_ARCHIVE'
}
]
}]
}
s3_client.put_bucket_lifecycle_configuration(
Bucket=bucket_name,
LifecycleConfiguration=lifecycle_config
)
The Migration Process
The actual migration took place over several sessions, processing files in batches:
- Initial Setup: Created file inventory (124,000+ files identified)
- Batch Processing: Ran migration in chunks of 10,000-20,000 files
- Token Management: Refreshed Dropbox access tokens as they expired
- Progress Monitoring: Tracked completion status and handled failures
- Final Push: Completed the last remaining files
Results and Lessons Learned
The Good
Robust Error Handling: The resumable design meant that token expirations, network issues, and API rate limits never derailed the entire process.
Cost Efficiency: The automated lifecycle policies immediately began reducing storage costs.
Parallel Processing: Using ThreadPoolExecutor with 3 workers provided optimal throughput without overwhelming the APIs.
The Challenges
Token Management: Dropbox access tokens expire frequently, requiring intervention to refresh tokens during long-running migrations.
API Limitations: Some file types (like Dropbox Paper documents) aren't downloadable via the standard API, and can't be transfered directly.
Progress Tracking Overhead: Saving progress every 10 files created some I/O overhead but was essential for resumability.
Memory Management: Large files required careful memory handling to avoid system resource exhaustion.
Key Metrics
- Total Files: 124,000+
- Processing Time: Multiple sessions over several hours
- Estimated Annual Storage Cost: ~$12 (after lifecycle transitions)
Technical Takeaways
Design for Resumability: Large-scale data migrations will be interrupted. Build systems that can restart gracefully.
Progress Persistence: Maintain detailed progress logs. The ability to see exactly what's been completed is invaluable.
Batch Processing: Process files in manageable chunks rather than attempting everything at once.
Parallel Processing: Use threading judiciously—too many concurrent operations can overwhelm APIs, too few waste time.
Error Classification: Distinguish between retryable errors (network issues) and permanent failures (unsupported file types).
Cost Optimization: Implement lifecycle policies from day one. The savings compound quickly.
Looking Forward
This migration demonstrates the power of thoughtful system design in handling large-scale data operations. The resumable architecture, comprehensive error handling, and automated cost optimization created a robust solution that successfully moved over 100,000 files with minimal manual intervention.
For organizations considering similar migrations, the key is building systems that expect and gracefully handle the inevitable challenges of large-scale data movement. With proper planning and robust error handling, even massive migrations can achieve near-perfect success rates.
The combination of cloud storage flexibility, lifecycle cost optimization, and reliable migration tooling opens new possibilities for data management strategies that balance accessibility, durability, and cost-effectiveness.