YouTube
On-Site Round - System Design (45 min)🔗
Question
Design YouTube.
Areas to cover:
- Video upload/download (chunking, resumable uploads, object storage)
- Recommendation (collaborative filtering + content-based)
- Search (inverted index, ranking, autocomplete)
- Scale (1B+ DAU, hot videos, CDN, sharding)
- Consistency vs availability for view/count metrics
- Transcoding pipeline, thumbnails, notifications
Explanation
This question tests whether you can design a large media platform end-to-end: ingest, process, serve, discover, and measure at massive scale.
A strong answer usually separates the system into five planes:
- Ingestion plane (upload + storage)
- Processing plane (transcoding + thumbnails + metadata)
- Serving plane (video delivery via CDN)
- Discovery plane (search + recommendation)
- Analytics plane (views, engagement, counters)
High-Level Architecture🔗
graph TD
U[User Client] --> A[API Gateway]
A --> UP[Upload Service]
UP --> OBJ[(Object Storage)]
UP --> MQ[(Event Queue)]
MQ --> TR[Transcoding Workers]
TR --> OBJ
TR --> TH[Thumbnail Generator]
TH --> OBJ
TR --> MD[(Metadata DB)]
U --> CDN[CDN/Edge]
CDN --> OBJ
A --> SRCH[Search Service]
SRCH --> IDX[(Inverted Index)]
A --> REC[Recommendation Service]
REC --> FEAT[(User/Video Features)]
A --> CNT[Counter Service]
CNT --> TS[(Time-series/Counter Store)]
MQ --> NOTIF[Notification Service]
Upload / Download🔗
Upload
- Client requests upload session.
- Upload service returns pre-signed chunk URLs.
- Client uploads chunks with retry/resume support.
- After final commit, service emits
video_uploadedevent.
Download/Playback
- Player fetches manifest (HLS/DASH).
- Segments served from CDN edge cache.
- CDN misses pull from origin object storage.
This design minimizes origin pressure and handles hot videos well.
Transcoding Pipeline🔗
video_uploadedevent triggers async transcoding.- Generate multiple resolutions/bitrates (240p..4K).
- Create streaming manifests and thumbnails.
- Persist metadata (
status=ready) and publish notification event.
Failure handling:
- Idempotent jobs keyed by
video_id + profile. - Dead-letter queue for poison tasks.
- Partial success allowed (serve lower resolutions if high profile fails).
Search Design🔗
- Index title, tags, channel, and transcript tokens into inverted index.
- Ranking combines lexical relevance + freshness + engagement priors.
- Autocomplete uses prefix index + trending query boosts.
Read path must be low latency; indexing can be eventually consistent.
Recommendation Design🔗
Two-stage approach:
- Candidate generation:
- collaborative filtering (similar users/videos)
- content-based signals (topic, embeddings, language)
- Ranking:
- model combines watch history, retention, CTR, recency, diversity
Serving strategy:
- Precompute candidate pools for active users.
- Online rank top-N with fresh context.
Scale (1B+ DAU)🔗
- Store immutable media in object storage; shard metadata by
video_id. - Aggressive CDN for hot videos; multi-layer cache for manifests/metadata.
- Partition queues and processing workers by region/video class.
- Separate control-plane APIs from heavy data-plane traffic. GenAI assist: classify likely-to-trend videos early and pre-warm CDN/cache tiers before traffic spikes.
Consistency vs Availability (Views / Counts)🔗
Use split semantics:
- View ingestion path: highly available append (event log).
- Public counters: eventually consistent aggregates (near-real-time).
- Creator analytics: corrected/anti-fraud batch numbers.
This keeps playback and event capture available while accepting slight lag in displayed counts.
Notifications🔗
Trigger async fanout when:
- Channel publishes video and notification policy allows.
- Video reaches
readystate. - Notification service applies user preferences and rate limits.
Additional Complication Idea: Copyright Detection🔗
- Compute audio/video fingerprints at upload time and compare against rights-holder reference sets.
- Block, monetize, or allow with policy based on match confidence and territory/license rules.
- Re-scan catalog periodically as reference databases and policies evolve. GenAI assist: use multimodal embeddings to catch transformed near-duplicates that exact fingerprinting can miss.
Additional Complication Idea: Abuse, Spam, and Safety🔗
- Run ML + rules moderation on title/description/transcript/thumbnails pre and post publish.
- Keep trust/risk scores per account and apply rate limits, temporary holds, or stricter review.
- Maintain human-review queues for borderline/high-impact enforcement decisions. GenAI assist: use LLM/VLM classifiers to generate richer policy labels and reviewer-ready rationale across text, audio, and images.
Additional Complication Idea: Multi-Region Failover🔗
- Use active-active playback with global DNS/load balancing and regional CDNs.
- Keep upload sessions region-local, with cross-region replication and resumable continuation on failover.
- Replicate metadata asynchronously and use regional stickiness/origin fallback for read-after-write gaps. GenAI assist: use an incident copilot to summarize telemetry and suggest failover/rollback runbook steps to operators.
Additional Complication Idea: Retention, Privacy, and GDPR🔗
- Separate PII from high-volume event data; apply field-level encryption and strict access controls.
- Support delete/export workflows for watch history and user data with auditable completion state.
- Enforce retention windows and downstream deletion propagation to analytics, caches, and backups per policy. GenAI assist: use entity extraction to auto-detect and classify PII in user-generated content for policy routing.