The Four Operations — Muxing, Demuxing, Transcoding, Transmuxing
The four fundamental operations in video processing, taught through the scenario of an iPhone MOV upload needing to play on a Chromebook. Covers when to transmux (fast, lossless) vs transcode (slow, necessary), with mediabunny Conversion API and ffmpeg equivalents. Real product examples: Loom, CapCut, course platforms.
The Problem on Your Desk
You're building a courses feature for Curated Connections. A community leader records a 20-minute tutorial on her iPhone 15, drags the file into your upload form, and hits submit. The file lands on your server: lesson-1.MOV, 1.2 GB.
A student opens the course page on his Chromebook, clicks play, and gets... nothing. A codec error. A blank player. Maybe a cryptic "format not supported" message if you're lucky.
Here's what happened. The iPhone recorded video in H.265 (HEVC) and wrapped it in a MOV container — Apple's preferred combo. The student's Chrome browser on ChromeOS can handle H.264 in an MP4 container perfectly, but its H.265 support is spotty at best. The container is wrong. The codec is wrong. Everything is wrong.
This is not an edge case. This is the case. Every video platform — Loom, YouTube, CapCut, yours — exists in a world where creators produce video in whatever format their device prefers, and viewers consume it on whatever device they own. The gap between those two realities is the entire reason video processing exists.
Closing that gap requires exactly four operations. After this tutorial, you'll know all four, when to reach for each one, and why picking the wrong one costs you either minutes of processing time or hours of debugging.
A Quick Recap
From Tutorial 1, you know that a video file is a container (the ZIP file) holding multiple tracks (the files inside the ZIP). Each track is encoded with a codec — H.264, H.265, AAC, Opus, etc. The container format (MP4, MOV, WebM) is independent of the codecs inside it.
With that foundation, let's talk about what you can actually do with these containers and tracks.
Operation 1: Demuxing
Demultiplexing — pulling the container apart to access the individual tracks inside.
┌─────────────────────────────┐
│ lesson-1.MOV │
│ ┌────────┐ ┌────────┐ │
│ │ Video │ │ Audio │ │
│ │ H.265 │ │ AAC │ │
│ └────────┘ └────────┘ │
└─────────────┬───────────────┘
│ DEMUX
▼
┌────────┐ ┌────────┐
│ Video │ │ Audio │
│ H.265 │ │ AAC │
│ stream │ │ stream │
└────────┘ └────────┘
Demuxing is unzipping. You're not changing anything about the video or audio data — you're just opening the container so you can see what's inside and work with the pieces individually.
This is the first thing that happens in any video processing pipeline. Before you can convert, trim, resize, or do anything at all, you need to crack open the container and read the tracks.
In ffmpeg, you demux just by pointing at a file:
# Probe the file — demux and report what's inside
ffmpeg -i lesson-1.MOV
# Output:
# Stream #0:0: Video: hevc, 3840x2160, 30 fps
# Stream #0:1: Audio: aac, 48000 Hz, stereo
That -i flag triggers demuxing. ffmpeg reads the MOV container's metadata, finds the track index, and reports what codecs and parameters each track uses. No output file means no processing — just reading.
In mediabunny, this is the Input API:
const input = new Input({
source: new BlobSource(file), // the uploaded .MOV
formats: ALL_FORMATS,
});
// Now you can inspect what's inside
const videoTrack = await input.getPrimaryVideoTrack();
const width = await videoTrack.getDisplayWidth(); // 3840
const height = await videoTrack.getDisplayHeight(); // 2160
const rotation = await videoTrack.getRotation(); // 90° (iPhones love portrait)
const audioTrack = await input.getPrimaryAudioTrack();
const sampleRate = await audioTrack.getSampleRate(); // 48000
Notice how clean this is. You create an Input, hand it a source, and suddenly you have structured access to every track and its properties. No parsing binary headers yourself. No guessing whether the video is rotated (spoiler: iPhone videos almost always are).
When you use it: Every time. Demuxing is step zero of every video operation. You can't do anything to a video file without first demuxing it.
Operation 2: Muxing
Multiplexing — taking individual tracks and packaging them into a container.
┌────────┐ ┌────────┐
│ Video │ │ Audio │
│ H.264 │ │ AAC │
│ stream │ │ stream │
└────┬───┘ └───┬────┘
│ │
│ MUX │
▼ ▼
┌─────────────────────────────┐
│ output.mp4 │
│ ┌────────┐ ┌────────┐ │
│ │ Video │ │ Audio │ │
│ │ H.264 │ │ AAC │ │
│ └────────┘ └────────┘ │
└─────────────────────────────┘
Muxing is zipping. You take some tracks — video, audio, maybe subtitles — and weave them together into a container format. The container handles synchronization (making sure audio and video stay in lockstep), seeking (jumping to minute 14:30), and metadata (duration, chapter markers, rotation flags).
In ffmpeg, muxing happens whenever you specify an output file. The output file's extension tells ffmpeg which container format to use:
# Mux a raw H.264 stream and an AAC stream into an MP4
ffmpeg -i video.h264 -i audio.aac -c copy output.mp4
In mediabunny, this is the Output API:
const output = new Output({
format: new Mp4OutputFormat(),
target: new BufferTarget(), // write to memory
});
// Add tracks to the container
const videoSource = new CanvasSource(canvas, {
codec: 'h264',
bitrate: QUALITY_HIGH,
});
output.addVideoTrack(videoSource);
const audioSource = new AudioBufferSource({
codec: 'aac',
bitrate: QUALITY_HIGH,
});
output.addAudioTrack(audioSource);
await output.start();
// ... feed frames and samples ...
await output.finalize();
const { buffer } = output.target; // your finished MP4
The Output API is the mirror image of Input. Where Input pulls a container apart, Output assembles one. You choose a container format, add tracks with their codecs, feed in the data, and finalize.
When you use it: At the end of every processing pipeline. Something has to put the tracks back into a container for playback. Muxing is the last step.
The Two That Matter Most
Demuxing and muxing are building blocks. They're necessary but not sufficient — they're the "open the box" and "close the box" steps. The interesting question is: what do you do with the tracks while the box is open?
That's where the real decision lives, and it's a decision with enormous consequences for your product's speed, cost, and quality.
Operation 3: Transmuxing
Changing the container without touching the encoded data inside.
┌──────────────────────┐ ┌──────────────────────┐
│ lesson-1.MOV │ │ lesson-1.mp4 │
│ ┌───────┐ ┌──────┐ │ FAST │ ┌───────┐ ┌──────┐ │
│ │ H.265 │ │ AAC │ │ ──────► │ │ H.265 │ │ AAC │ │
│ │ video │ │audio │ │ copy │ │ video │ │audio │ │
│ └───────┘ └──────┘ │ │ └───────┘ └──────┘ │
└──────────────────────┘ └──────────────────────┘
MOV container MP4 container
same codecs, same data, same quality, different box
Transmuxing is the fast path. You're not decoding or re-encoding anything. You're literally lifting the tracks out of one container and dropping them into another. Like moving documents from a Manila folder to a binder — the documents don't change, only what holds them.
This matters because MOV and MP4 are structurally almost identical. They're both based on the ISO Base Media File Format (ISOBMFF). The tracks inside are stored the same way. Transmuxing between them is practically a metadata rewrite — the bulk of the file's bytes (the actual video and audio data) get copied verbatim.
In ffmpeg, the magic flag is -c copy:
# Transmux: MOV → MP4, no re-encoding
ffmpeg -i lesson-1.MOV -c copy lesson-1.mp4
The -c copy tells ffmpeg: "copy the codec data as-is, don't decode or re-encode." On a 1 GB file, this finishes in seconds. Literally seconds. Because it's just rewriting container metadata and copying byte streams.
In mediabunny, transmuxing happens when you connect an Input directly to an Output without any processing in between — the Conversion API handles this automatically when the codecs are compatible:
const input = new Input({
source: new BlobSource(movFile),
formats: ALL_FORMATS,
});
const output = new Output({
format: new Mp4OutputFormat(),
target: new StreamTarget(writableStream),
});
const conversion = await Conversion.init({ input, output });
await conversion.execute(); // near-instant for transmux
When the input codecs are compatible with the output container, mediabunny's pipeline is smart enough to pass the encoded data through without decoding. The result is the same speed advantage you get with ffmpeg's -c copy.
The key insight: Transmuxing preserves 100% of the original quality because the encoded data is never touched. There is zero generation loss. Zero. The output is bit-for-bit identical in the streams that matter — only the container wrapper changes.
When you use it: Whenever the codecs are already compatible with the target playback environment and you only need to change the container. MOV → MP4 is the classic case. This should be your first choice every time, because it's fast and lossless.
Operation 4: Transcoding
Decoding the tracks and re-encoding them with a different codec or different parameters.
┌──────────────────────┐ ┌──────────────┐ ┌──────────────────────┐
│ lesson-1.MOV │ │ RAW DATA │ │ lesson-1.mp4 │
│ ┌───────┐ ┌──────┐ │ │ ┌────────┐ │ │ ┌───────┐ ┌──────┐ │
│ │ H.265 │ │ AAC │ │ │ │ pixels │ │ │ │ H.264 │ │ AAC │ │
│ │ video │ │audio │ │────►│ │ frame │ │────►│ │ video │ │audio │ │
│ │ 4K │ │ │ │ dec │ │ by │ │ enc │ │ 1080p │ │ │ │
│ └───────┘ └──────┘ │ │ │ frame │ │ │ └───────┘ └──────┘ │
└──────────────────────┘ └──────────────┘ └──────────────────────┘
decode every re-encode every
frame to raw frame from raw
pixels pixels
This is the heavy operation. Every single frame of video gets decoded back to raw pixels, and then every single frame gets compressed again with a new codec or new settings. For a 30fps video that's 20 minutes long, that's 36,000 frames — each one decoded, potentially transformed, and re-encoded.
This is the operation you need for our iPhone-to-Chromebook problem when transmuxing isn't enough. Remember: the iPhone recorded H.265, and Chrome on the Chromebook can't reliably decode H.265. We need H.264. That means we can't just copy the data — we have to actually understand (decode) every frame of H.265 video and re-express (encode) it as H.264.
In ffmpeg:
# Transcode: H.265 → H.264, 4K → 1080p, keep AAC audio
ffmpeg -i lesson-1.MOV \
-c:v libx264 -preset medium -crf 23 \
-vf scale=1920:1080 \
-c:a copy \
lesson-1.mp4
A few things to notice. -c:v libx264 says "re-encode the video as H.264." -vf scale=1920:1080 resizes from 4K to 1080p (this requires transcoding — you can't resize without decoding). -c:a copy says "but leave the audio alone, just copy it." That's the surgical precision of ffmpeg: you can transcode some tracks and transmux others in the same command.
In mediabunny, the Conversion API handles this through its pipeline architecture:
const input = new Input({
source: new BlobSource(movFile),
formats: ALL_FORMATS,
});
const output = new Output({
format: new Mp4OutputFormat(),
target: new StreamTarget(writableStream),
});
// mediabunny uses WebCodecs under the hood —
// hardware-accelerated decode + encode
const conversion = await Conversion.init({ input, output });
await conversion.execute();
The remarkable thing about mediabunny is that this code looks almost identical to the transmux example. The library figures out whether transcoding is necessary based on the input codecs and the output format's requirements. When it does need to transcode, it leverages the WebCodecs API — the browser's native interface to hardware video encoders and decoders. That means the GPU does the heavy lifting, not the CPU.
And the performance is striking. mediabunny's benchmarks show 804 frames/sec for a WebM conversion with resize — that's roughly 27x realtime for 30fps video. A 20-minute video processes in under 45 seconds. In the browser. On the client's machine. Without touching your servers.
The cost of transcoding: Unlike transmuxing, transcoding always involves some quality loss. You're decompressing and recompressing — it's like saving a JPEG, opening it, and saving it as a JPEG again. Each generation of compression loses a little detail. Good encoder settings minimize this to imperceptible levels, but it's never truly zero. This is called generation loss, and it's why you should only transcode when you genuinely need to.
The Decision Tree
When that iPhone MOV arrives at your upload endpoint, here's how you think through it:
Upload arrives: lesson-1.MOV (H.265 + AAC)
│
▼
┌─────────────────────┐
│ Can the target play │
│ this codec directly? │
└──────────┬──────────┘
│ │
YES NO
│ │
▼ ▼
┌────────────┐ ┌──────────────┐
│ Is the │ │ TRANSCODE │
│ container │ │ H.265 → H.264│
│ compatible? │ │ (slow, lossy) │
└──────┬──────┘ └──────────────┘
│ │
YES NO
│ │
▼ ▼
┌──────────┐ ┌────────────┐
│ NOTHING! │ │ TRANSMUX │
│ Serve │ │ MOV → MP4 │
│ as-is │ │ (fast, │
└──────────┘ │ lossless) │
└────────────┘
For our iPhone-to-Chromebook scenario: Chrome can't reliably play H.265, so we go right. We must transcode. But notice — if the creator had recorded in H.264 (which older iPhones or the "Most Compatible" setting produce), we'd only need to transmux MOV → MP4. Same file size, same quality, done in seconds instead of minutes.
This distinction is the single most important product decision in video processing. Getting it wrong means either:
- Wasting compute by transcoding when transmuxing would suffice (burning money, making users wait)
- Serving broken video by transmuxing when transcoding was needed (broken playback, support tickets)
In mediabunny, you can check browser capabilities before deciding your strategy:
// Check what the browser can handle
const canDecodeH265 = await VideoDecoder.isConfigSupported({
codec: 'hev1.1.6.L93.B0',
});
const canEncodeH264 = await VideoEncoder.isConfigSupported({
codec: 'avc1.640028',
width: 1920,
height: 1080,
});
These WebCodecs capability checks let you make the transmux-vs-transcode decision dynamically, per device. A MacBook with Apple Silicon has hardware H.265 decoding — maybe you can skip transcoding entirely for those users and just transmux. A Chromebook can't decode H.265 — transcode. This is the kind of per-device optimization that turns a "video always takes 2 minutes to process" experience into "video is ready instantly for 60% of your users."
Everything Else Is a Remix
Once you understand these four operations, you realize that every other video processing task is just a combination of them:
Trimming (cutting a clip from 2:00 to 5:00): Demux, seek to the start point, copy (or re-encode) only the frames in range, mux. If you cut on a keyframe boundary, you can transmux the trimmed segment. If you cut mid-GOP (between keyframes), you need to transcode at least the frames around the cut point.
Resizing (4K → 1080p): Requires transcoding. You must decode each frame to raw pixels, scale the pixel buffer, and re-encode. There's no shortcut — you can't resize compressed data.
Rotating: Same story. Decode, transform, re-encode. Although — fun quirk — many container formats support a rotation metadata flag that tells the player to rotate during display, without actually re-encoding. iPhones use this extensively. So "rotating" might just be a metadata edit, which is even cheaper than transmuxing.
Resampling audio (44.1kHz → 48kHz): Requires transcoding the audio track. The video track can be transmuxed through unchanged.
Adding subtitles: If you're adding a subtitle track (like SRT), that's just muxing — adding a new track to the container. If you're burning subtitles into the video pixels, that's transcoding.
The pattern: if you're changing the encoded data, you're transcoding. If you're only changing the container structure, you're transmuxing. If you're reading, you're demuxing. If you're writing, you're muxing.
How Real Products Use These Operations
Loom records your screen using the browser's MediaRecorder API, which produces a WebM (VP8/VP9 + Opus). That's a mux operation — raw frames from the screen capture get encoded and packaged in real time. Later, Loom's servers transcode that recording into multiple quality levels (1080p, 720p, 480p) for adaptive streaming. The recording is fast (mux only). The processing is slow (transcode to multiple outputs).
CapCut works in the editing paradigm. When you add a clip to your timeline, it demuxes the source file to access the tracks. When you apply a filter or transition, it decodes the relevant frames, processes them, and holds them ready. When you hit export, it encodes everything and muxes the final output. The entire edit pipeline is: demux → decode → process → encode → mux. Every operation in this tutorial, chained together.
Your courses feature has a simpler pipeline. Creator uploads a file. You demux it to inspect the codecs and resolution. You check whether the target browsers can play those codecs. If yes, you transmux to MP4 (fast path). If no, you transcode to H.264 + AAC in MP4 (slow path). Store the result on R2. Serve it via your CDN. Done.
The key product insight: transmuxing is nearly free, transcoding is expensive. If 70% of your creators upload H.264 video (which is still very common), 70% of your uploads hit the fast path. Only the H.265 uploads (growing as newer iPhones become dominant) need the expensive path. Knowing this ratio changes how you architect your processing pipeline and how much server compute you budget for.
And here's where client-side processing with mediabunny gets really interesting. At 804 frames/sec with hardware acceleration, a lot of transcoding work can happen in the user's browser before the file ever touches your server. The creator uploads, their browser transcodes to H.264 + MP4, and what arrives at your server is already in the universal format. Zero server compute cost. The creator's GPU does the work. You just store and serve the result.
Challenge: Build the Decision Logic
Write a function — in Ruby for your Rails backend, or TypeScript if you're thinking client-side — that takes a video file's codec and container information and returns the minimum operation needed:
"passthrough"— the file is already H.264 + AAC in MP4. Serve it directly."transmux"— the codecs are compatible but the container is wrong (e.g., H.264 in MOV). Repackage without re-encoding."transcode"— the codecs aren't compatible with browser playback (e.g., H.265 video). Full decode and re-encode required.
Think about: what codecs does Chrome universally support? What about Safari? What if you want to serve VP9 in WebM to Chrome users and H.264 in MP4 as fallback? How does this decision tree change when you add adaptive bitrate streaming to the mix?
In the next tutorial, we'll tackle exactly that — streaming and adaptive bitrate, where a single upload becomes multiple versions at different quality levels, and the player switches between them on the fly based on the viewer's network speed. That's where all four operations come together in a production pipeline.