Sharding

Sharding is a technique where you split a large dataset into smaller, more manageable pieces, called shards, and store them across multiple databases or machines.

📦 What Is a Shard?

A shard is just a horizontal partition of your data.
Each shard contains a subset of the overall data, and together they form the full dataset.

📚 Example

Imagine a table of 100 million users. Instead of storing all users in one massive database:

Shard 1: Users with ID 1–25 million
Shard 2: Users with ID 25–50 million
Shard 3: Users with ID 50–75 million
Shard 4: Users with ID 75–100 million

Each shard can live on a different machine/database.

🧠 Why Use Sharding?

Performance: Smaller datasets = faster queries.
Scalability: Spread data across multiple machines.
Reliability: If one shard fails, others still work.

🛠️ Common Sharding Strategies

Strategy	How it works	When to use
Range-Based Sharding	Shard by ranges (e.g., user ID 1–1000)	Easy, but can lead to hot shards
Hash-Based Sharding	Use a hash function (e.g., `hash(user_id) % N`)	More even distribution
Geo-Based Sharding	Shard by location (e.g., country, region)	Good for geo-sensitive data
Directory-Based	Use a lookup table to track where data lives	Flexible, but adds overhead

🔥 Common Issues with Sharding

Hotspots: One shard gets more traffic than others.
Rebalancing: Moving data between shards when load changes.
Cross-shard queries: Complex and slow.
Data consistency: Harder to enforce across shards.

💡 Real-World Analogy

Think of sharding like splitting a classroom of 1000 students into 10 rooms of 100 students.
Each room can be managed independently, but together they make up the whole batch.

Gaurav’s Notes

Explorer