Getting Video to Viewers — Streaming, HLS, and Delivery Architecture
How video delivery works end-to-end: progressive download vs streaming, HLS protocol and adaptive bitrate, segment generation with ffmpeg and mediabunny, the complete R2 + mediabunny architecture for a course platform, cost modeling, and how YouTube/Loom/Netflix do it. Ends with a challenge to architect a Loom alternative.
You've made it through the hard parts. You know what containers and codecs are (Tutorial 1), and you know how to transform media files between formats (Tutorial 2). You can take a creator's raw video upload and wrangle it into a clean, web-friendly MP4 with H.264 video and AAC audio.
Now for the question that actually matters: how does that video reach the person who wants to watch it?
The 2GB Problem
Picture this. A course creator on your platform just uploaded a 45-minute lecture on advanced Rails patterns. After processing, it's a tidy MP4 — H.264, AAC, 1080p. Two gigabytes.
A student named Priya opens the course page on her phone in Bangalore. She's on Jio 4G — decent bandwidth most of the time, but she's on a moving bus and the signal keeps dipping. She clicks play.
If you just serve that MP4 file the way you'd serve an image or a CSS file — a single HTTP request, browser downloads the whole thing — here's what happens:
She waits. At 10 Mbps (optimistic for mobile 4G), 2GB takes about 27 minutes to download. The browser won't start playing until it has enough data buffered. She's staring at a spinner.
She can't skip ahead. She already watched the first 20 minutes yesterday. She wants to jump to minute 21. Too bad — the browser has to download everything before that point first.
Her connection dips. The bus enters a tunnel. Download stalls. When it comes back, the connection is slower. The video was encoded at 1080p and needs 5 Mbps minimum. Her connection is now 2 Mbps. Playback stutters and dies.
She burns through her data plan. She only wanted to watch 25 minutes, but the browser downloaded all 45 minutes' worth. 2GB gone.
Every single one of these problems has been solved. The solution is what we call streaming, and understanding it is the difference between building a video feature that works in a demo and one that works in the real world.
Progressive Download: The Naive Approach
Before we get to proper streaming, let's understand what the browser does by default when you point a <video> tag at an MP4 file.
<video src="https://your-cdn.com/lecture.mp4" controls />
This triggers a progressive download. The browser sends an HTTP GET request for the file, starts downloading from byte 0, and begins playing as soon as it has enough data buffered. It's basically the same thing as downloading a PDF — linear, front to back.
Progressive download sort of works for short videos. A 30-second product demo? Fine. But it has three fatal problems for anything longer:
Problem 1: Seeking requires downloading. Want to jump to minute 30? The browser needs all the bytes before minute 30 first. Unless...
Problem 2 (and its fix): The moov atom problem. Remember from Tutorial 1 that MP4 files have a table of contents — the container's index listing where every frame lives in the file. In MP4, this index is called the moov atom. By default, ffmpeg and most encoders put the moov atom at the end of the file. This means the browser has to download the entire file before it even knows where anything is.
The fix is called MP4 fast start — moving the moov atom to the beginning of the file:
# ffmpeg: move moov atom to the front
ffmpeg -i input.mp4 -movflags +faststart output.mp4
With fast start, the browser downloads the table of contents first, which means it can make HTTP range requests to jump to any byte offset. Now seeking works without downloading everything. This is a must-do if you're serving MP4 files directly — without it, the user experience is genuinely broken.
Problem 3: No quality adaptation. The video is encoded at one quality level. If Priya's connection drops from 10 Mbps to 2 Mbps, the player can't switch to a lower-quality version. It just buffers and stutters.
Progressive download with fast start is good enough for short, low-stakes videos. But for a course platform where students watch 45-minute lectures on phones with unreliable connections? You need real streaming.
Streaming: Chunks, Manifests, and Adaptation
The core idea behind modern video streaming is almost comically simple:
Don't serve one big file. Serve many small ones.
Take your 45-minute video and slice it into segments — each one a few seconds long. Create a text file (a manifest or playlist) that lists all the segments and their URLs. The video player reads the manifest, then fetches segments one at a time, playing each one as it arrives.
That's it. That's the fundamental architecture. Everything else is refinement.
┌─────────────────────────────────────────────────────┐
│ 45-minute video │
└─────────────────────────────────────────────────────┘
│
split into
│
▼
┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐ ┌──────┐
│ seg0 ││ seg1 ││ seg2 ││ seg3 ││ seg4 │ ... │seg674│
│ 4sec ││ 4sec ││ 4sec ││ 4sec ││ 4sec │ │ 4sec │
└──────┘└──────┘└──────┘└──────┘└──────┘ └──────┘
│
described by
│
▼
┌──────────────────┐
│ manifest.m3u8 │
│ │
│ seg0.ts 0:00 │
│ seg1.ts 0:04 │
│ seg2.ts 0:08 │
│ seg3.ts 0:12 │
│ ... │
└──────────────────┘
This immediately solves two of our three problems. Seeking is instant — want minute 30? The player reads the manifest, calculates that minute 30 is segment 450, and fetches that segment directly. No need to download anything before it. Network recovery is graceful — if a segment download fails, the player just retries that one segment, not the whole file.
But the real magic is what this architecture makes possible: adaptive bitrate streaming.
Adaptive Bitrate: The Reason YouTube Quality Changes Mid-Video
Here's where it gets clever. Instead of encoding your video at just one quality level, you encode it at several:
┌───────────────────────────────────────────────┐
│ Original 1080p video │
└───────────────────────────────────────────────┘
│ │ │
encode at encode at encode at
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ 1080p │ │ 720p │ │ 480p │
│ ~5 Mbps │ │ ~2 Mbps │ │ ~1 Mbps │
└─────────┘ └─────────┘ └─────────┘
│ │ │
segment segment segment
│ │ │
▼ ▼ ▼
675 segments 675 segments 675 segments
Now your manifest file doesn't just list one sequence of segments — it lists all three variants, each with its bandwidth requirement:
┌───────────────────────────────────────┐
│ master.m3u8 │
│ │
│ BANDWIDTH=5000000 → 1080p.m3u8 │
│ BANDWIDTH=2000000 → 720p.m3u8 │
│ BANDWIDTH=1000000 → 480p.m3u8 │
│ │
└───────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│1080p.m3u8│ │720p.m3u8│ │480p.m3u8│
│ │ │ │ │ │
│ seg0.ts │ │ seg0.ts │ │ seg0.ts │
│ seg1.ts │ │ seg1.ts │ │ seg1.ts │
│ seg2.ts │ │ seg2.ts │ │ seg2.ts │
│ ... │ │ ... │ │ ... │
└─────────┘ └─────────┘ └─────────┘
The player starts by estimating the available bandwidth. If Priya's connection looks like 5 Mbps, it picks the 1080p variant. It starts fetching segments. After each segment download, it measures how long it took. If her bus enters that tunnel and bandwidth drops to 1.5 Mbps, the player seamlessly switches to the 720p variant for the next segment. When the signal comes back, it switches back to 1080p.
This is why YouTube quality dips and recovers mid-video. It's not a bug — it's adaptive bitrate working exactly as designed. The player is constantly measuring real-world bandwidth and choosing the best quality level it can sustain without buffering.
From Priya's perspective: the video plays smoothly the entire time. The quality might drop for a minute in the tunnel, but it never stops. That's the difference between a frustrating experience and one that just works.
HLS: How This Actually Works in Practice
The protocol that makes all of this happen is HLS — HTTP Live Streaming. Apple created it in 2009, and it has become the dominant streaming protocol on the internet. There's also DASH (Dynamic Adaptive Streaming over HTTP), which does roughly the same thing with different manifest formats, but HLS has won the practical adoption war and is what you'll use.
The beauty of HLS is right there in the name: HTTP. There's no special streaming server. No custom protocol. No WebSocket magic. It's just regular HTTP serving regular files. Your segments are files. Your manifests are files. You put them on any web server — or better yet, on a CDN — and it works.
Let's look at what the actual files look like.
Master playlist (master.m3u8):
```
EXTM3U
EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8
EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720
720p/playlist.m3u8
EXT-X-STREAM-INF:BANDWIDTH=1000000,RESOLUTION=854x480
480p/playlist.m3u8
```
Media playlist (1080p/playlist.m3u8):
```
EXTM3U
EXT-X-VERSION:3
EXT-X-TARGETDURATION:4
EXT-X-MEDIA-SEQUENCE:0
EXTINF:4.000,
segment000.ts
EXTINF:4.000,
segment001.ts
EXTINF:4.000,
segment002.ts
...
EXT-X-ENDLIST
That's it. It's a text file. The master playlist says "here are the quality variants available and their bandwidths." Each media playlist says "here are the segments for this quality level, each one is 4 seconds long, fetch them from these URLs." The player does the rest.
The segments themselves are typically MPEG Transport Stream (`.ts`) files or fragmented MP4 (`.m4s` / CMAF) files. Transport Stream is the older format and has wider compatibility; CMAF is newer, more efficient, and what you'd pick for a new project. Both work.
### Generating HLS with ffmpeg
Here's how you take a processed MP4 and generate an HLS package with three quality variants:
```bash
# Generate 1080p variant
ffmpeg -i lecture.mp4 \
-vf scale=1920:1080 -c:v libx264 -b:v 5M -c:a aac -b:a 128k \
-hls_time 4 -hls_playlist_type vod -hls_segment_filename '1080p/seg%03d.ts' \
1080p/playlist.m3u8
# Generate 720p variant
ffmpeg -i lecture.mp4 \
-vf scale=1280:720 -c:v libx264 -b:v 2M -c:a aac -b:a 128k \
-hls_time 4 -hls_playlist_type vod -hls_segment_filename '720p/seg%03d.ts' \
720p/playlist.m3u8
# Generate 480p variant
ffmpeg -i lecture.mp4 \
-vf scale=854:480 -c:v libx264 -b:v 1M -c:a aac -b:a 96k \
-hls_time 4 -hls_playlist_type vod -hls_segment_filename '480p/seg%03d.ts' \
480p/playlist.m3u8
Then you write a master.m3u8 that points to the three playlists. Upload the whole directory to your storage, and you have a streamable video.
Key flags to understand:
- -hls_time 4 — target segment duration in seconds (more on choosing this later)
- -hls_playlist_type vod — "video on demand" mode, meaning the playlist is complete and won't change (as opposed to live streaming, where new segments keep appearing)
- -hls_segment_filename — pattern for naming the segment files
Generating HLS with mediabunny
mediabunny supports HLS as a first-class output format, which means you can do this client-side in the browser — no server-side ffmpeg required:
import { Output, HlsOutputFormat, PathedTarget, FilePathTarget } from 'mediabunny';
const output = new Output({
format: new HlsOutputFormat({ /* segment duration, codec config */ }),
target: new PathedTarget(
'master.m3u8',
({ path }) => new FilePathTarget(`/output/${path}`),
),
});
// Add your video and audio tracks
output.addVideoTrack(videoSource);
output.addAudioTrack(audioSource);
await output.start();
// Feed in media data...
// mediabunny handles the segmenting and manifest generation
await output.finalize();
The PathedTarget is the key abstraction here. Unlike MP4 where the output is a single file, HLS produces many files — the master playlist, per-variant playlists, and hundreds of segment files. PathedTarget takes a callback that resolves each file path to a target, so you can route segments to disk, to memory, or directly to a cloud storage upload stream.
Reading HLS back works symmetrically. If you need to ingest an existing HLS stream — say, pulling a video from another platform for re-processing:
import { Input, UrlSource, ALL_FORMATS } from 'mediabunny';
const input = new Input({
source: new UrlSource('https://cdn.example.com/master.m3u8'),
formats: ALL_FORMATS,
});
// mediabunny parses the m3u8, resolves segment URLs, and gives you
// unified access to the media data as if it were a single file
mediabunny handles the m3u8 parsing internally — it follows the master playlist to find variant playlists, then fetches segments on demand. You get a clean, unified input regardless of whether the source was a single MP4 or a 600-segment HLS stream.
The Complete Architecture for a Course Platform
Let's put everything together. Here's how video delivery would work end-to-end for a courses feature — from the moment a creator clicks "Upload" to the moment a student watches it on their phone.
CREATOR'S BROWSER CLOUDFLARE R2 STUDENT'S BROWSER
───────────────── ────────────── ─────────────────
1. Creator selects
lecture.mov (2GB)
│
▼
2. mediabunny reads
the file (client-side)
- extracts metadata
- validates format
│
▼
3. Transmux/transcode
to web-friendly MP4
(client-side via
mediabunny)
│
▼
4. Generate quality 5. Upload segments
variants + HLS segments ────► & manifests to R2
(server-side is │
smarter here) │
│
6. Segments + manifests
sitting in R2 as
plain files
│
│ 7. Student clicks play
│ │
│ ▼
│ 8. HLS player fetches
│◄─────────── master.m3u8
│ │
│ ▼
│ 9. Player picks best
│ variant for bandwidth
│ │
│ ▼
│ 10. Fetches segments
│◄─────────── one by one
│ │
│ ▼
│ 11. Plays, adapts quality
│ as bandwidth changes
Let's walk through each step.
Steps 1-3: Upload and normalization. The creator's browser uses mediabunny to read the uploaded file, check its format, and transmux or transcode it into a clean MP4. This is what you learned in Tutorial 2. The key insight: this work happens on the creator's machine, not your server. Their CPU does the heavy lifting.
Step 4: Generating quality variants. This is where it gets interesting. You could do this client-side too — mediabunny can encode multiple quality levels. But for generating 2-3 quality variants of a 45-minute video, server-side processing is probably smarter. The creator's browser would be grinding for a long time, and if they close the tab, you lose everything. A more robust pattern: upload the normalized MP4 once, then kick off a server-side job (using ffmpeg on your server, or a service like Mux or Cloudflare Stream) to generate the HLS package.
Step 5: Upload to R2. Cloudflare R2 is object storage, like S3, but with one critical difference: no egress fees. S3 charges you every time someone downloads a file. For video — where a single popular course might serve terabytes of data per month — egress fees can be devastating. R2 charges $0 for egress. For a bootstrapped course platform, this is a game-changer.
Your HLS package in R2 looks like a directory:
/courses/rails-patterns-101/
master.m3u8
1080p/
playlist.m3u8
seg000.ts
seg001.ts
...
720p/
playlist.m3u8
seg000.ts
seg001.ts
...
480p/
playlist.m3u8
seg000.ts
...
Just files. Plain, boring files sitting in object storage. No streaming server, no special infrastructure.
Steps 6-11: Playback. On the student's side, you need an HLS-capable video player in the browser. The native <video> tag doesn't support HLS on most browsers (Safari is the exception — Apple made the protocol, after all). You use a library like hls.js:
import Hls from 'hls.js';
const video = document.getElementById('player');
const hls = new Hls();
hls.loadSource('https://r2.your-domain.com/courses/rails-patterns-101/master.m3u8');
hls.attachMediaElement(video);
That's the entire playback setup. hls.js reads the master manifest, figures out which quality variant matches the current bandwidth, and starts fetching segments. If Priya's connection degrades mid-lecture, hls.js detects it on the next segment fetch and automatically drops to a lower quality variant. If it recovers, back to 1080p.
The cost model
Let's run some rough numbers for a bootstrapped SaaS:
- Storage (R2): 2GB per course × 3 quality variants × ~1.1x HLS overhead ≈ 7GB per course. At R2's $0.015/GB/month, 100 courses costs you about $10.50/month in storage.
- Egress (R2): $0. Literally zero. This is the whole point.
- Compute (variant generation): If you're running ffmpeg on your own server, it's just CPU time. A 45-minute video takes maybe 10-15 minutes to encode three variants on a modest VPS. If you're using a service like Mux, you pay per minute of video processed.
Compare that to S3, where 1TB of egress (a busy month for a popular course) costs $90. At scale, the egress difference between S3 and R2 is the difference between a profitable feature and one that bleeds money.
How the Rest of the Industry Does It
Every major video platform uses the same fundamental architecture. The differences are in scale and optimization, not in concept.
YouTube transcodes every upload into a staggering number of variants — sometimes 10+ quality levels from 144p to 4K/8K. They use both DASH and HLS, serve from their own global edge CDN, and apply per-video encoding optimization (some videos compress better than others, so they adjust bitrates accordingly). But strip away the Google-scale infrastructure and it's the same pattern: encode variants, chunk, manifest, serve over HTTP.
Loom optimizes for a different metric: time-to-playback after recording. When you stop a Loom recording, you want the link to be shareable within seconds, not minutes. So Loom starts processing while you're still recording, generates a low-quality variant first for instant playback, and backfills higher quality variants in the background. Same architecture, different prioritization.
Netflix takes encoding optimization to an extreme with what they call "per-shot encoding." They analyze each scene individually — a dark, static dialogue scene can be encoded at much lower bitrate than a bright action sequence without visible quality loss. The result: better quality at lower bandwidth, which means fewer buffering events, which means lower churn. Same fundamental pattern, with a sophistication layer on top that only makes sense at Netflix-scale.
Zoom recordings are the simplest version. Typically just one or two quality variants, less aggressive segmentation, no per-shot optimization. But still: segments, manifests, CDN. The pattern holds.
The point is: you're not inventing something new. You're implementing a well-understood architecture. The decisions that matter are the ones you make at the margins.
The Decisions You'll Actually Face
When you sit down to build this, here are the choices that matter:
How many quality variants? For a course platform, three is the sweet spot: 480p (mobile on bad connections), 720p (the workhorse — looks good on laptops and phones), and 1080p (desktop and good connections). YouTube encodes a dozen variants because they serve billions of viewers with wildly different devices. You don't need that. Two variants is the minimum for adaptive bitrate to be meaningful; four is likely overkill for your scale.
Segment duration? This is a tradeoff between seeking precision and overhead. Shorter segments (2 seconds) mean the player can adapt to bandwidth changes faster and seeking is more precise, but you get more files (1,350 segments for a 45-minute video at 2-second segments) and more HTTP request overhead. Longer segments (10 seconds) mean fewer files and less overhead, but slower adaptation. The industry standard is 4-6 seconds. Pick 4 and don't overthink it.
Client-side vs server-side processing? Here's a sensible split: use mediabunny client-side for format normalization (transmux that .mov to .mp4, fix the codec if needed) and server-side for variant generation and HLS packaging. Client-side processing is great for lightweight operations, but encoding three quality variants of a long video is CPU-intensive and you don't want to hold a user's browser hostage for 20 minutes. Upload the normalized MP4, process server-side.
Segment format: .ts vs .m4s (CMAF)? MPEG Transport Stream (.ts) is the original HLS segment format and works everywhere. CMAF (.m4s, fragmented MP4) is newer, slightly more efficient (smaller segments, less overhead), and is compatible with both HLS and DASH. If you're starting fresh, CMAF is the better choice. If you need to support very old devices, stick with .ts.
DRM and content protection? For a course platform in its early days, probably skip it. DRM (like Apple FairPlay or Google Widevine) is complex to implement, expensive (licensing fees), and degrades the user experience. The practical reality: anyone determined to pirate your content will screen-record it regardless. Signed URLs with expiration (so only authenticated users can access the segments) give you 90% of the protection at 1% of the complexity. Add DRM later if and when a paying customer demands it.
Putting It All Together
Over three tutorials, you've built a mental model of the entire video pipeline:
TUTORIAL 1 TUTORIAL 2 TUTORIAL 3
Containers & Codecs Operations Delivery
───────────────── ────────── ────────
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Video file │ │ Transmux │ │ HLS packaging │
│ = Container │─────►│ Transcode │─────►│ Multi-variant │
│ + Codec │ │ Mux/Demux │ │ CDN delivery │
│ + Metadata │ │ │ │ Adaptive player │
└─────────────┘ └──────────────┘ └──────────────────┘
"What is it?" "How to change it" "How to serve it"
The raw upload is a container (maybe .mov, maybe .webm) holding encoded streams (maybe H.264, maybe VP9). You transmux or transcode it into a web-friendly format. Then you package it into HLS segments, generate a manifest, upload to a CDN, and let an adaptive player handle the rest.
And the beautiful part: none of this requires exotic infrastructure. An MP4 file. Some .ts segments. A text file listing them. A CDN that serves files over HTTP. A JavaScript library in the browser that fetches segments intelligently. That's the entire stack. The same stack Netflix uses — just without the per-shot encoding PhD team.
Challenge: Architect a Loom Alternative
You now know enough to design the video pipeline for a real product. Here's your challenge:
Design the architecture for a Loom-style screen recording tool. A user clicks "Record," captures their screen + camera + mic, stops recording, and gets a shareable link. A viewer clicks the link and watches the recording with adaptive quality.
Sketch answers to these questions:
Recording format: The browser's MediaRecorder API gives you a WebM file with VP8/VP9 video and Opus audio. Is that your final format, or do you transcode? What are the tradeoffs?
Time-to-link: Loom's magic is that the link works within seconds of stopping the recording. How would you achieve that? Hint: think about what you can do during recording, not just after.
Quality variants: Do you generate multiple quality levels? When? The viewer might click the link 5 seconds after recording stops — they can't wait for a full transcode pipeline.
Storage and delivery: Where do the segments live? How do you serve them? What's the cost model?
The full pipeline diagram: Draw it. Every step from "user clicks Record" to "viewer watches at adaptive quality." Label where mediabunny runs, where ffmpeg runs (if anywhere), where the files live, and what protocol delivers them.
There's no single right answer. But if your architecture handles the Priya-on-a-bus scenario — unreliable connection, instant seeking, quality adaptation — you've internalized the material. The rest is implementation.