What's Actually Inside a Video File?
The foundational mental model for video engineering: containers vs codecs, tracks, and how tools like mediabunny and ffmpeg operate on these two layers. Taught through mediabunny's API and ffmpeg commands, with real product examples (Loom, YouTube, Netflix, CapCut).
You've been building web apps for years. You know what's inside a database row, what's inside an HTTP request, what's inside a JWT. But what's inside a video file? If your answer is "a bunch of images played really fast," you're in good company — and you're about to find out why that answer, while not wrong, misses the architecture that makes modern video actually work.
This is the foundation for everything else in this series. By the end, you'll have a mental model that explains why tools like mediabunny and ffmpeg exist, why the same video can be in five different file formats, and why Netflix spends billions on infrastructure that ultimately just... shows you pictures really fast.
The Naive Model (And Why It Breaks)
Let's start with the mental model you probably have. A video is a flipbook — a sequence of images, played at 24 or 30 frames per second, with some audio layered on top. Simple, right?
Let's do the math. Say you're recording a 1080p screen capture for a feature demo. Each frame is 1920 x 1080 pixels. Each pixel needs 3 bytes (red, green, blue). So one frame is:
1920 × 1080 × 3 bytes = 6,220,800 bytes ≈ 6 MB per frame
At 30 frames per second:
6 MB × 30 fps = 180 MB per second
A 5-minute demo would be 54 GB.
Your users are not uploading 54 GB files to your Rails app. Something else is going on.
That "something else" is the entire field of video engineering, and it starts with a deceptively simple question: how do you make a video file small enough to actually use?
Early engineers in the 1990s faced exactly this problem. They needed to compress video dramatically — from gigabytes to megabytes — while keeping it watchable. And they needed the compressed file to support basic operations: play from the beginning, jump to a specific time, play at different speeds. Oh, and the audio needs to stay perfectly synchronized with the video.
The solution they arrived at has two distinct layers, and understanding the separation between those layers is the single most important concept in this entire tutorial.
The Two-Layer Architecture
Here's the mental model that will carry you through every video-related decision you'll ever make:
┌─────────────────────────────────────────────────────────┐
│ CONTAINER (MP4) │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Video Track [compressed with H.264] │ │
│ │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Audio Track [compressed with AAC] │ │
│ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Metadata duration, resolution, title... │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
There are two completely separate jobs here:
The Codec is the compression algorithm. It takes raw, enormous video data (those 6 MB frames) and crunches it down to something manageable. H.264 can compress that 54 GB demo video down to maybe 50 MB. That's a ~1000x reduction. Codecs are math — clever, beautiful math that exploits the fact that most of a video frame looks almost identical to the previous frame.
The Container is the packaging. It takes the compressed streams — video, audio, maybe subtitles — and bundles them into a single file with a table of contents. The container doesn't care how the video was compressed. It just needs to know where each piece of data lives so it can serve it up when asked.
Think of it like a ZIP file (the container) that holds a JPEG and an MP3 (the codecs). The ZIP format doesn't know anything about image compression or audio encoding. It just stores files and knows how to find them. Similarly, an MP4 container doesn't know how H.264 compression works. It just stores the compressed data and keeps a map of where everything is.
This separation is the reason tools like mediabunny and ffmpeg exist. They operate on these two layers independently. And it's the reason the same concepts keep showing up whether you're building a Loom-style screen recorder, a CapCut-style editor, or a YouTube-style processing pipeline.
Codecs: The Compression Engine
So how does a codec actually shrink video by 1000x? The core insight is redundancy. Video is absurdly redundant.
Think about a screen recording of someone giving a presentation. For most of the video, the slide doesn't change. The speaker's face moves a little. Maybe a cursor blinks. But 95% of each frame is identical to the previous one. Why store the same pixels 30 times per second?
A codec like H.264 exploits this with a technique you'll hear called inter-frame compression. Instead of storing every frame as a complete image, it stores:
Frame 1 (Key Frame): Full image — every pixel stored
Frame 2 (Delta Frame): "Same as Frame 1, except these 200 pixels changed"
Frame 3 (Delta Frame): "Same as Frame 2, except these 150 pixels changed"
Frame 4 (Delta Frame): "Same as Frame 3, except these 180 pixels changed"
...
Frame 30 (Key Frame): Full image again (reset point)
The key frames (also called I-frames) are complete pictures. The delta frames (P-frames and B-frames, if you want to get precise) only describe what changed. Since most frames are deltas, and most deltas are tiny, the total data shrinks enormously.
This has a consequence you'll run into the moment you try to build anything with video: you can't jump to an arbitrary frame. If you want to show frame 17, you need to start at the nearest key frame (frame 1 in this case) and replay all the deltas forward to frame 17. This is why seeking in a video sometimes snaps to a slightly earlier position — it's jumping to the nearest key frame.
Mediabunny calls these "key packets" and "delta packets" in its API — the same concept, same constraint:
Key packets can be decoded directly, independently of other packets.
Delta packets can only be decoded after the packet before it has been decoded.
Here are the codecs you'll actually encounter in the wild:
| Codec | Also called | Era | What it's for |
|---|---|---|---|
| H.264 | AVC | 2003 | The default. Works everywhere. Most video on the web today. |
| H.265 | HEVC | 2013 | 50% smaller files than H.264, but patent-encumbered and not universal. |
| VP9 | — | 2013 | Google's answer to H.265. Used by YouTube. Free. |
| AV1 | — | 2018 | The next generation. Open-source, royalty-free, excellent compression. Slow to encode. |
| AAC | — | 1997 | The standard audio codec. What you hear in most MP4 files. |
| Opus | — | 2012 | Better than AAC at low bitrates. Used in WebRTC, Discord, WhatsApp calls. |
The trend is clear: each generation compresses better but takes more CPU to encode. AV1 produces files ~30% smaller than H.264, but encoding takes 10–100x longer. This tradeoff drives real product decisions — it's why YouTube can afford AV1 (they encode once, serve billions of times) but a real-time screen recorder probably can't.
Containers: The Packaging
If codecs are the engine, containers are the chassis. The container format answers a completely different set of questions: How do I bundle multiple tracks into one file? How does a player find the audio that goes with the video? How do I jump to the 5-minute mark without reading the entire file?
Here's a more detailed view of what's actually inside a container:
┌─────────────────────────────────────────────────────┐
│ MP4 CONTAINER │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Header / Table of Contents │ │
│ │ • Number of tracks: 3 │ │
│ │ • Duration: 5:23.4 │ │
│ │ • Created: 2026-05-27 │ │
│ │ • Seek index: [0s→byte 0, 10s→byte 48291, │ │
│ │ 20s→byte 91744, ...] │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Track 1: Video │ │
│ │ • Codec: H.264 (avc1.42c032) │ │
│ │ • Resolution: 1920×1080 │ │
│ │ • Frame rate: 30 fps │ │
│ │ • [compressed video data...] │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Track 2: Audio (English) │ │
│ │ • Codec: AAC (mp4a.40.2) │ │
│ │ • Sample rate: 48000 Hz │ │
│ │ • Channels: 2 (stereo) │ │
│ │ • [compressed audio data...] │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Track 3: Audio (Hindi) │ │
│ │ • Codec: AAC (mp4a.40.2) │ │
│ │ • Sample rate: 48000 Hz │ │
│ │ • Channels: 6 (5.1 surround) │ │
│ │ • [compressed audio data...] │ │
│ └───────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────┘
That seek index is important. It's what lets a video player jump to the 2-minute mark without reading the first two minutes of data. The container stores a map: "timestamp X is at byte offset Y in the file." Without this, every seek would mean reading from the beginning.
Now here's what makes the container/codec separation click. The same compressed H.264 video data could be placed inside any of these containers:
┌──── .mp4 (Apple, web, most things)
│
H.264 video ─────┼──── .mkv (power users, Plex, archiving)
+ AAC audio │
├──── .mov (Final Cut Pro, Apple ecosystem)
│
└──── .ts (broadcast TV, HLS streaming)
And the same MP4 container can hold video compressed with different codecs:
┌──── H.264 (universal compatibility)
│
.mp4 container ───┼──── H.265 (smaller files, newer devices)
│
└──── AV1 (smallest files, newest devices)
This is exactly like how a .zip file can contain a .jpg or a .png or a .webp — the container doesn't care about the compression format of its contents.
The containers you'll encounter most:
| Container | Extension | Why it exists |
|---|---|---|
| MP4 / ISOBMFF | .mp4, .m4v, .m4a | The universal default. Works in every browser, every device. |
| WebM | .webm | Google's web-optimized container. Pairs with VP9 or AV1. |
| MKV (Matroska) | .mkv | The "hold anything" container. Supports virtually every codec, multiple audio/subtitle tracks. Used by power users and media servers. |
| MOV (QuickTime) | .mov | Apple's format. Essentially MP4's older sibling — very similar internally. |
| Fragmented MP4 | .m4s | MP4 broken into small independent segments. The backbone of adaptive streaming (HLS, DASH). |
That last one — fragmented MP4 — is worth understanding because it's how streaming actually works. A normal MP4 has its seek index at the beginning (or end) of one big file. A fragmented MP4 breaks the video into tiny self-contained chunks, each a few seconds long. The streaming player requests one chunk at a time, and can switch between quality levels on the fly. That's why Netflix can adapt to your bandwidth in real time.
Tracks: The Multi-Lane Highway
We've been talking about tracks casually, but they deserve their own moment. A track is a single stream of one media type — video, audio, or subtitles — running along a shared timeline.
Think of it like a multi-lane highway where each lane carries different cargo, but all the lanes run in parallel at the same speed:
Timeline: 0s ──────── 30s ──────── 60s ──────── 90s ────→
Video: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
Audio EN: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Audio HI: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Subtitles: ── Hello ──── World ──── Namaste ──── ─────────
The container's job is to keep these tracks synchronized. Every packet of data in every track has a timestamp, and the container ensures that when the player shows video frame at t=42.033s, it plays the audio sample from the same moment.
This is why you can have a movie with English, Hindi, and Spanish audio — they're just three separate audio tracks in the same container. The video track doesn't change. The player picks which audio track to route to your speakers.
It's also why subtitles are fundamentally different from "text burned onto the video." A subtitle track is just another track in the container, completely independent of the video pixels. You can turn it on, turn it off, switch languages — because it was never part of the video data in the first place.
How This Maps to mediabunny
Here's where the mental model pays off. mediabunny's API is a direct mirror of the architecture we just described. Once you understand containers, codecs, and tracks, the API reads like a sentence.
Opening a file and identifying the container:
import { Input, ALL_FORMATS, BlobSource } from 'mediabunny';
// Create an input — this sets up the demultiplexer
const input = new Input({
formats: ALL_FORMATS, // support all container formats
source: new BlobSource(file), // the file to read
});
// What container format is this?
await input.getFormat(); // => Mp4InputFormat
// Full MIME type with codec details
await input.getMimeType(); // => 'video/mp4; codecs="avc1.42c032, mp4a.40.2"'
// container ─┘ codec ─┘ codec ─┘
See it? video/mp4 is the container. avc1.42c032 is the video codec (H.264, profile High, level 5.0). mp4a.40.2 is the audio codec (AAC-LC). Container and codecs, right there in the MIME type.
Inspecting tracks:
// Get all tracks in the container
const tracks = await input.getTracks(); // => InputTrack[]
// Or be specific
const videoTrack = await input.getPrimaryVideoTrack();
const audioTrack = await input.getPrimaryAudioTrack();
// What codec is this track using?
await videoTrack.getCodec(); // => H264 (the codec)
await videoTrack.getCodecParameterString(); // => 'avc1.42001f'
// Track properties — all metadata from the container
await videoTrack.getDisplayWidth(); // => 1920
await videoTrack.getDisplayHeight(); // => 1080
await audioTrack.getSampleRate(); // => 48000
await audioTrack.getNumberOfChannels(); // => 2
Finding audio tracks by language:
import { desc } from 'mediabunny';
// Get only English audio tracks
const englishAudio = await input.getAudioTracks({
filter: async track => await track.getLanguageCode() === 'eng',
});
// Get the highest-resolution video track
const bestVideo = await input.getPrimaryVideoTrack({
sortBy: async track => [
desc(await track.getDisplayWidth()),
desc(await track.getBitrate()),
],
});
The terminology in the API maps directly to the concepts:
| Concept | mediabunny term | What it does |
|---|---|---|
| Reading a container | Input + BlobSource |
Opens and parses the container format |
| Extracting tracks | Demultiplexer (built into Input) |
Reads the container, separates tracks |
| Compressed data | EncodedPacket |
One chunk of compressed video or audio |
| Decompressed data | VideoSample / AudioSample |
Raw frame or audio ready for display |
| Writing a container | Output + Multiplexer |
Packages tracks back into a container |
The equivalent in ffmpeg — same concepts, command-line interface:
# Inspect a file's container, tracks, and codecs
ffprobe -show_streams input.mp4
# Output shows you the same information:
# Stream #0:0: Video: h264 (avc1), 1920x1080, 30 fps
# Stream #0:1: Audio: aac (mp4a), 48000 Hz, stereo
# Extract just the audio track (strip the container, re-package)
ffmpeg -i input.mp4 -vn -c:a copy output.aac
# Re-package into a different container (no re-encoding!)
ffmpeg -i input.mp4 -c copy output.mkv
That last command is the "aha" moment. -c copy means "don't touch the codecs — just move the compressed data from the MP4 container to the MKV container." It runs in seconds because there's no compression or decompression happening. It's literally just re-packaging, like moving files from one ZIP archive to another.
How This Maps to Real Products
Now you can see the architecture behind every video product you use.
When Loom records your screen, the browser captures raw frames (samples), encodes them with a codec (H.264 via WebCodecs), and packages the compressed packets into a container (WebM or MP4). That file gets uploaded. Loom's server might then re-encode it at multiple quality levels for adaptive streaming.
When YouTube processes your upload, it uses a demultiplexer to open whatever container you uploaded (MP4, MOV, MKV — YouTube accepts nearly anything). It extracts the video and audio tracks, decodes them back to raw samples, then re-encodes them into multiple versions: 1080p AV1, 720p VP9, 480p H.264, each packaged in fragmented MP4 segments for adaptive streaming. One upload, dozens of output files.
┌─→ 1080p AV1 (.m4s segments)
Your upload ─→ Demux ─→ ───┼─→ 720p VP9 (.m4s segments)
(.mov) Decode ├─→ 480p H.264 (.m4s segments)
└─→ Audio Opus (.m4s segments)
When Netflix serves you a movie, the player requests fragmented MP4 segments a few seconds at a time. If your bandwidth drops, it switches to a lower-quality track mid-stream. The reason this works seamlessly is that every quality level uses the same container format with segments aligned to the same timestamps. The player just swaps which track it's pulling from.
When a browser-based editor like CapCut trims a clip, it doesn't necessarily re-encode the whole video. If you're cutting at a key frame boundary, it can just rewrite the container's index to point to a different starting position. That's why some trims are instant and others take time — it depends on whether the cut aligns with a key frame.
This is the power of the two-layer model. Every one of these products is just doing different combinations of the same four operations:
Demux ──→ Decode ──→ [process] ──→ Encode ──→ Mux
(open (decompress) (compress) (write
container) container)
Sometimes you skip the middle. Re-packaging into a different container? Demux → Mux, no decode/encode needed. Trimming at a key frame? Demux → tweak the index → Mux. Transcoding to a smaller codec? The full pipeline. The mental model tells you exactly which steps are needed and why.
Try This
You have ffprobe (it comes with ffmpeg). Grab any video file on your machine — a screen recording, a downloaded clip, anything — and run:
ffprobe -hide_banner -show_format -show_streams your_video.mp4
Find these things in the output:
- The container format — look for
format_name. Is itmov,mp4,m4a,3gp,3g2,mj2(the MP4 family) or something else? - The tracks — each
[STREAM]block is a track. How many are there? What types? - The video codec — look for
codec_namein the video stream. Is ith264?hevc?vp9? - The audio codec — same field in the audio stream. Probably
aac. - Key frame frequency — look for
has_b_framesorkey_int_min. This tells you how often full frames appear.
Then try re-packaging without re-encoding:
ffmpeg -i your_video.mp4 -c copy your_video.mkv
Compare the two files. Same size? (They should be nearly identical — the data inside is the same, only the packaging changed.) Try playing both. Identical? That's the container/codec separation in action.
If you want to try the same thing in mediabunny, open a file in the browser and inspect it:
const input = new Input({
formats: ALL_FORMATS,
source: new BlobSource(file),
});
console.log('Format:', await input.getFormat());
console.log('MIME:', await input.getMimeType());
console.log('Duration:', await input.computeDuration(), 'seconds');
for (const track of await input.getTracks()) {
console.log(`Track ${track.number} (${track.type}):`,
await track.getCodecParameterString());
}
You now know more about what's inside a video file than most web developers ever learn. In the next tutorial, we'll use this mental model to understand the four operations — muxing, demuxing, transcoding, transmuxing — and why picking the right one is the single most important product decision in video processing.