Jobs & Background Work: The Thinnest Thread Boundary
A first-principles foundations tutorial on Active Job in Campfire: why background work exists, why the trigger altitude (after_create_commit, not after_save) is the load-bearing decision, and why every Campfire job is a two-line thunk that delegates to a model method. Grounded in real file:line evidence from the send-a-message fan-out, the webhook delivery, the ban content sweep, and the web-push thread pool that escapes the Rails executor.
You post a message in a room with fifty people. Marking it sent is instant — one INSERT. But fifty people need to know. Some are offline and need an OS push notification, which means talking to Apple's and Google's push gateways over the network, one HTTPS call per device, any of which can hang for seconds or just fail.
Here's the version you'd vibe-code first, the honest one — you'd do it right where the message gets created, because that's where you're standing:
class Message < ApplicationRecord
after_create :notify_everyone
private
def notify_everyone
room.memberships.where.not(user: creator).each do |membership|
membership.update!(unread_at: Time.current) # N queries, one per member
PushNotifier.deliver(membership.user, self) # ...and a blocking network call each
end
end
end
Two separate bugs are baked in here, and they're the two bugs this whole tutorial exists to design out.
The first is speed: the sender's request now blocks while you make fifty sequential calls to flaky push providers. One slow gateway and the person who hit Enter watches a spinner for eight seconds.
The work is slow and flaky and fans out — exactly the profile of work that does not belong on the request thread.
The second is subtler and worse. after_create fires inside the database transaction — before the row is durable. If anything later in that transaction rolls the message back, you've already pinged fifty phones about a message that no longer exists: the ghost row. (We don't unpack the transaction lifecycle here — that's see F1: the callback lifecycle's job, and the altitude philosophy is see P9: Put Work at Its Right Altitude. Here we only need that _commit means after-durable.)
This tutorial is about the machinery that fixes the first problem — moving work off the request thread — and the one-word discipline that fixes the second. By the end you'll be able to read Campfire's entire asynchronous surface in about a dozen lines.
The mechanics, from first principles
Active Job is a uniform interface, not a queue
Rails separates "I want this to run later" from "here's the queue technology that runs it." That separation is Active Job: you write a class with a perform method and call .perform_later, and Active Job hands the work to whatever backend is configured.
class HardWorkJob < ApplicationJob
def perform(record)
record.do_the_slow_thing
end
end
HardWorkJob.perform_later(some_record) # enqueue now, run later on a worker
HardWorkJob.perform_now(some_record) # run synchronously, right here
The backend is swappable and your job code never changes. Campfire configures Resque (Redis-backed) in production (production.rb:95: config.active_job.queue_adapter = :resque); another app might use Solid Queue or Sidekiq. That's the point of the interface — perform_later is the only seam you touch.
perform_later serializes the record, not an id
Here's the mechanic people get wrong. When you enqueue a job, its arguments have to be written to the queue as data and read back on a worker thread that may be a different process entirely. You cannot serialize a live Ruby object onto Redis.
The naive instinct is to pass a primitive and re-find it:
PushMessageJob.perform_later(message.id) # pass the id...
class PushMessageJob < ApplicationJob
def perform(message_id)
message = Message.find(message_id) # ...and re-find it. (And handle the
message.push # case where it was deleted meanwhile.)
end
end
You don't have to. Active Job integrates with GlobalID: pass an Active Record object and it's serialized as a URI like gid://campfire/Message/123, then rehydrated into a real Message before perform runs. So you pass the record:
PushMessageJob.perform_later(message) # pass the record itself
class PushMessageJob < ApplicationJob
def perform(message) # already a Message, already loaded
message.push
end
end
"If it just re-finds the row anyway, what did I gain?" Three things. The call site reads in the domain's nouns (
perform_later(self, message), notperform_later(self.id, message.id)). The re-find and the find-failure handling are framework concerns, not yours —discard_on ActiveJob::DeserializationError(commented out atapplication_job.rb:6as the available knob) is how you say "if the record vanished before the worker got to it, drop the job."
The two products even share the exact generator-default comment for it — but where Campfire leaves it commented as documented intent, Fizzy turns the same line on per job (notify_recipients_job.rb:2, push_notification_job.rb:2), because its jobs genuinely race a record disappearing out from under them. Same idiom, dialed to each app's need. And you never write Model.find(params) boilerplate that drifts. The id is still the thing on the wire; you just never type it.
That's the whole mechanism. An interface, a _later suffix, and records-as-GlobalIDs. Everything distinctive about Campfire is what it refuses to put inside perform.
The 37signals way
The job class is a two-line thunk
Open every job in the app. Here is Room::PushMessageJob in its entirety (push_message_job.rb:1-5):
class Room::PushMessageJob < ApplicationJob
def perform(room, message)
Room::MessagePusher.new(room:, message:).push
end
end
Now Bot::WebhookJob (webhook_job.rb:1-5):
class Bot::WebhookJob < ApplicationJob
def perform(bot, message)
bot.deliver_webhook(message)
end
end
And RemoveBannedContentJob (remove_banned_content_job.rb:1-5):
class RemoveBannedContentJob < ApplicationJob
def perform(user)
user.remove_banned_content
end
end
Three jobs. Each is two real lines, and each does the same thing: receive rehydrated records, then immediately delegate to a method on a model (or a PORO that wraps the models). The job is the thinnest thread boundary — it exists only to be the place where "now" becomes "later." There is no logic in it to test, because there's no logic in it at all.
This isn't a Campfire habit — it's the house rule, written down. Fizzy's NotifyRecipientsJob is the same perform(record) { record.verb } thunk to the byte (notify_recipients_job.rb:1-6: def perform(notifiable); notifiable.notify_recipients; end), and 37signals state the law outright in their style guide: "we write shallow job classes that delegate the logic itself to domain models" (STYLE.md:185-187).
That's a deliberate discipline. The real work — who gets a push, how the webhook payload is built, which messages to destroy — all lives on the model, where it's reachable synchronously. You can call bot.deliver_webhook(message) straight from a console or a test with no queue running at all. The async-ness is a property of how it's invoked, layered on at the call site, not a property baked into the work.
"Why not just put the real logic in
perform? It runs in the background either way." Because then the only way to run it is to enqueue it, which means the only way to test it is to drain a queue, and the only way to reuse it is to enqueue again. A bot's webhook reply rides the exact samedeliver_webhookwhether it came from a real message or a test —webhook.rb:60callsroom.messages.create!(...).broadcast_createsynchronously from inside the delivery. Logic on the model stays composable; logic in a job is stranded behind the queue.
The _later/plain-method pair, with the guard on the wrapper
If the job is a dumb thunk, where does the "should this even run?" decision go? Watch the pairing in User::Bot (user/bot.rb:51-57):
def deliver_webhook_later(message)
Bot::WebhookJob.perform_later(self, message) if webhook # ⚠ guard lives HERE
end
def deliver_webhook(message)
webhook.deliver(message) # the work, no guard
end
Two methods, same shape across the codebase. Bannable does it identically (user/bannable.rb:19-28):
def remove_banned_content_later
RemoveBannedContentJob.perform_later(self)
end
def remove_banned_content
messages.each do |message|
message.destroy
message.broadcast_remove
end
end
The convention: a plain method does the work, and a _later sibling owns both the enqueue and any guard about whether to enqueue.
Fizzy reaches for the identical pairing — Webhook::Delivery has a deliver_later that is just Webhook::DeliveryJob.perform_later(self) next to a deliver that does the real network work (webhook/delivery.rb:29-39) — and STYLE.md:189 names the _later suffix as the house signal for "this method enqueues a job." Notice where the if webhook check sits — on the wrapper, before the job is ever created. The job never re-checks if webhook defensively, because by the time perform runs, the decision was already made: a job that shouldn't run was never enqueued. A bot with no webhook produces zero queue rows, not a queue row that wakes up only to no-op.
The contrast: the naive version buries the guard inside perform (return unless bot.webhook), so you enqueue a job, occupy a worker, deserialize the records, and then decide there was nothing to do — and the same guard gets copy-pasted into every job because there's no shared place for it. Here the guard has exactly one home, on the named seam, and the job stays a two-line thunk you never have to read.
The trigger altitude is the load-bearing line
Everything above is how the job is shaped. The most important decision is where it gets pulled. Recall the cold open's after_create. Campfire enqueues the push fan-out from after_create_commit instead (message.rb:12):
after_create_commit -> { room.receive(self) }
And room.receive is where the in-band/out-of-band split actually happens (room.rb:46-49):
def receive(message)
unread_memberships(message) # cheap, must be durable → runs IN-BAND
push_later(message) # slow, flaky, fans out → goes to a JOB
end
The two private methods it calls (room.rb:68-73):
def unread_memberships(message)
memberships.visible.disconnected.where.not(user: message.creator).update_all(unread_at: message.created_at, updated_at: Time.current)
end
def push_later(message)
Room::PushMessageJob.perform_later(self, message)
end
Two altitudes, drawn cleanly. The "who needs an unread badge?" decision is cheap and must be correct before the response returns, so it runs in-band as one bulk update_all — no loop, no N+1, the database does the filtering. The OS-push fan-out is slow and flaky, so it crosses the sync/async line into a job via push_later. That fan-out logic — the visible.disconnected scopes, the per-involvement push queries — is owned by Room::MessagePusher and detailed in see C1: The Line Where Saving Becomes a Product; here the point is only the seam: cheap-and-durable stays, slow-and-flaky leaves.
Now the _commit part, which is the actual subject. The trigger is after_create_commit, not after_create, and that is what designs out the ghost row. perform_later enqueues the push job only after the row is durably committed. If the transaction rolls back, the callback never fires, so the job is never enqueued, so no phone ever gets pinged about a message that didn't survive. The naive after_create fires inside the transaction — it would enqueue a job for a row a rollback is about to erase, and the worker, running later, would happily push a notification for the ghost.
Fizzy reaches for the same _commit suffix on the same kind of work: its Notifiable concern enqueues recipient notifications from after_create_commit :notify_recipients_later (concerns/notifiable.rb:7), so a rolled-back card can't notify anyone — the choice of _commit over plain after_create is the 37signals reflex in both products.
"Isn't the worker running 'later' enough? Surely the rollback finishes before the job runs." No — and this is the trap. Enqueueing from inside the transaction races the rollback. The job row can land in the queue, get picked up by a worker, and
Message.findeither resurrects nothing or, worse, the rollback hasn't propagated yet and you push for a row about to vanish._commitremoves the race entirely by not enqueuing until durable.
The altitude of the trigger is not a detail; it's the correctness boundary. count the edge cases this line absorbs for free —
_commit absorbs the entire phantom-notification bug class in one suffix.
Work that must escape the Rails executor lives in lib/
There's one more altitude, below the job. A background job still runs inside Rails — inside the framework's executor, which wraps each unit of work to manage database connections, reloading, and such. For the web-push fan-out, 37signals wanted a long-lived thread pool with persistent HTTP connections that does not want to be wrapped per-delivery. So that machinery lives in lib/, not app/, and the very first line says why (web_push/pool.rb:1):
# This is in lib so we can use it in a thread pool without the Rails executor
class WebPush::Pool
The discipline that makes this safe is in deliver_later (web_push/pool.rb:25-36):
def deliver_later(payload, subscription)
# Ensure any AR operations happen before we post to the thread pool
notification = subscription.notification(**payload)
subscription_id = subscription.id
delivery_pool.post do
deliver(notification, subscription_id)
rescue Exception => e
Rails.logger.error "Error in WebPush::Pool.deliver: #{e.class} #{e.message}"
end
rescue Concurrent::RejectedExecutionError
end
Read the comment, then the order of the lines. Every Active Record read — building the notification, grabbing the subscription.id — happens before delivery_pool.post. By the time work is handed to a raw thread, it's holding plain data (notification, an integer id), never a live record that would expect a database connection checked out from a pool the executor manages. The threads touch the network; they never touch the ORM.
The contrast: the naive version posts the subscription record itself into the thread and lets each thread lazily query through it — and now fifty raw threads are racing for database connections outside Rails' connection management, throwing connection pool exhausted under load in a way that's maddening to reproduce. Campfire's rule — do all the AR reads, then post primitives to the pool — makes that impossible by construction, and confines the whole unusual pattern to one file in lib/ so the rest of the app never has to think about executors at all.
Resumable & resilient jobs (advanced)
This section is the frontier — skip it on a first read. The core lesson above (the job is the thinnest thread boundary) stands on its own. What follows is what a thin thunk grows into when the work it delegates to is genuinely long-running or genuinely flaky, and it's all in Fizzy, because Campfire's fan-outs are short enough to never need it.
Everything above assumed the job either finishes or doesn't. But some work is long — a webhook dispatch that walks every active webhook for an event, one blocking HTTPS call each. If a deploy restarts the worker halfway through, a naive job re-runs perform from the top and re-fires every webhook it already delivered. The honest Rails fix you'd reach for is a delivered? flag per webhook and a .where(delivered: false) — a column to maintain, a write per delivery, and a new way to get it wrong.
Fizzy doesn't keep that flag. It makes the job itself resumable with ActiveJob::Continuable (event/webhook_dispatch_job.rb:10-17):
def perform(event)
step :dispatch do |step|
Webhook.active.triggered_by(event).find_each(start: step.cursor) do |webhook|
webhook.trigger(event)
step.advance! from: webhook.id
end
end
end
Read what step buys. find_each(start: step.cursor) begins the scan where the cursor last pointed, and step.advance! from: webhook.id moves the cursor forward after each successful trigger. The framework persists that cursor. If the worker dies after webhook 40 of 100, the retry doesn't restart at 1 — it resumes at 41, because the cursor survived the crash as job state. The "which ones did I already do?" bookkeeping that you'd have hand-rolled as a database column is now a property of the job's position, not a property you store on every record. (And notice it's still a thin thunk under the resumption machinery: the real work is webhook.trigger(event), one verb on the model — the step/cursor scaffolding wraps a delegating call, it doesn't replace it.)
The other half of resilience is knowing which failures to retry. A flaky push gateway or an unreachable SMTP server is transient — retry it. A 550 Unknown user is permanent — retrying it forever is just noise. Fizzy's SmtpDeliveryErrorHandling concern classifies them by SMTP reply code (smtp_delivery_error_handling.rb:4-35):
included do
# Retry delivery to possibly-unavailable remote mailservers.
retry_on Net::OpenTimeout, Net::ReadTimeout, Socket::ResolutionError, wait: :polynomially_longer
# Net::SMTPServerBusy is SMTP error code 4xx, a temporary error.
# Common one we've seen is 452 4.3.1 Insufficient system storage.
# Patiently retry.
retry_on Net::SMTPServerBusy, wait: :polynomially_longer
# SMTP error 50x.
rescue_from Net::SMTPSyntaxError do |error|
case error.message
when /\A501 5\.1\.3/
# Ignore undeliverable email addresses.
Sentry.capture_exception error, level: :info if Fizzy.saas?
else
raise
end
end
# SMTP error 5xx except 50x and 53x.
# * 550 5.1.1: Unknown users
# * 552 5.6.0: Message/headers too large
rescue_from Net::SMTPFatalError do |error|
case error.message
when /\A550 5\.1\.1/, /\A552 5\.6\.0/, /\A555 5\.5\.4/
Sentry.capture_exception error, level: :info if Fizzy.saas?
else
raise
end
end
end
This is operational scar tissue turned into reusable code. The timeouts and the 4xx server-busy get retry_on with polynomial backoff — patient, because the remote will probably recover. The permanent 5xx errors get rescue_from, and the body switches on the reply-code prefix (/\A550 5\.1\.1/, /\A552 5\.6\.0/): a known-permanent code is logged-and-swallowed so it never poisons the retry queue, and anything not matched is re-raised (else raise) so a failure mode they haven't seen yet doesn't get silently eaten. Each # Common one we've seen is... comment is a bug report fossilized into the classifier.
The contrast: the naive job wraps the whole delivery in one rescue => e; retry and treats every failure the same — so a permanent 550 Unknown user retries on the same backoff schedule as a transient timeout, burning worker time forever on mail that will never deliver, while a genuinely novel error gets swallowed by the same blanket rescue and never surfaces. Fizzy's split — retry_on for transient, rescue_from + reply-code prefix for permanent, else raise for the unknown — is the difference between a retry policy and a prayer.
"Isn't this exactly the in-
performlogic the whole tutorial told me to avoid?" No — and the distinction is the point.retry_on/rescue_from/stepare declarative job-framework concerns (when to re-run, where to resume), not domain logic. The domain work is still one verb on a model (webhook.trigger(event)). The thin-thunk rule was about not stranding business logic behind the queue; configuring how the queue itself behaves on failure is precisely what the job class is for.
The whole async surface, in a dozen lines
Step back and count what you can now read. Three job files at five lines each. Three _later wrappers that each enqueue exactly one of them. One trigger (after_create_commit), one split (receive), one lib/ pool for the work that escapes Rails entirely. That's the complete asynchronous behavior of a production chat app — and you can hold all of it in your head at once, precisely because no job contains any logic worth reading. The logic is on the models, synchronously testable; the jobs are just the thread boundary, and the trigger's altitude is where correctness lives.
Scheduled work is the same rich-model-verb doctrine
One more altitude Campfire never demonstrates: time-triggered work. Some jobs aren't pulled by a save at all — they fire on a schedule (clean up stale links, postpone idle cards). Fizzy runs these through Solid Queue's config/recurring.yml, and the striking thing is that the schedule file holds no logic — each task is a one-liner that names a schedule and calls a domain verb (recurring.yml:5-51): auto_postpone_all_due is just command: "Card.auto_postpone_all_due" on every hour at minute 50, cleanup_magic_links is command: "MagicLink.cleanup", deliver_bundled_notifications is command: "Notification::Bundle.deliver_all_later". The cron entry is the exact same thinnest-thread-boundary as a perform_later thunk — it owns only when, and the model owns what (Card.auto_postpone_all_due is a real method you can call from a console). The staggered minutes (minute 50, minute 12, minute 20) just spread load off the top of the hour; the doctrine underneath is identical to every job above.
This is the foundation under see P9: Put Work at Its Right Altitude — the principle that every unit of work has a right altitude (in-band vs out-of-band) and the seam between them should be the thinnest possible thread boundary. It also serves see P1: The Model Owns Its Consequences: the model owns what happens (room.receive, deliver_webhook), and the job owns only where it happens (on another thread). The model says what; the job boundary says where.
Carrying request context into the background (advanced)
Advanced, and the home for this idea — it lives here in F5, but it leans on two principles whose deep treatment is elsewhere: [see P3: Security Is the Shape of Your Data Access] and [see P9: Put Work at Its Right Altitude]. The lesson is the join between them: the altitude seam must also **carry tenancy across the job boundary.
Campfire is single-tenant, so a job never has to ask "whose data am I working on?" — there's one world. Fizzy is multi-tenant: every request runs inside a Current.account, and a board, card, or webhook is only meaningful relative to that account. Now cross the job boundary. The web request that enqueued the job had Current.account set by the middleware. The worker that runs perform later is a different process with Current.account blank. The honest Rails instinct is to thread the account through by hand — add an account_id argument to every job, pass it at every call site, and re-establish it as the first line of every perform. That's a parameter you can forget on exactly one job, and the bug it produces — a job quietly reading another tenant's data — is the worst kind.
Fizzy refuses to make tenancy a job argument. It makes it an ambient property of every job, via the AccountTenanted concern (account_tenanted.rb:15-46):
def initialize(...)
super
@account = Current.account
end
def serialize
super.merge({ "account" => @account&.to_gid })
end
def deserialize(job_data)
super
@account_gid = job_data["account"]
end
private
def with_account_context(&block)
resolve_account!
if account.present?
Current.with_account(account, &block)
else
yield
end
end
def resolve_account!
if @account_gid
@account = GlobalID::Locator.locate(@account_gid)
end
rescue ActiveRecord::RecordNotFound
raise ActiveJob::DeserializationError
end
Trace the account across the wire. At enqueue time, initialize snapshots Current.account off the live request, and serialize writes it onto the job payload as a GlobalID — the same gid:// trick that serializes a record argument, applied to the ambient account. At run time on the worker, deserialize reads the GID back, and an around_perform :with_account_context (wired in the prepended do block above this snippet) re-establishes Current.with_account(account) around the whole perform. So inside any job, Current.account is the same account the request had — without that job ever declaring an account parameter. The account rides the queue the way the record does, and the deliberate rescue ... raise ActiveJob::DeserializationError means a deleted account drops the job through the same discard_on door as any vanished record.
The second move is what makes it total. ApplicationJob does prepend AccountTenanted (application_job.rb:2) so every app job is tenanted — but framework jobs (ActionMailer's delivery job, Turbo's broadcast jobs) don't inherit from ApplicationJob, and they enqueue work too. So an initializer prepends the same concern onto Rails' own internal jobs (initializers/active_job.rb:3-15):
ActiveSupport.on_load(:active_job) do
self.enqueue_after_transaction_commit = true
end
ActiveSupport.on_load(:action_mailer) do
ActionMailer::MailDeliveryJob.prepend AccountTenanted
end
Rails.application.config.after_initialize do
Turbo::Streams::ActionBroadcastJob.prepend AccountTenanted
Turbo::Streams::BroadcastJob.prepend AccountTenanted
Turbo::Streams::BroadcastStreamJob.prepend AccountTenanted
end
Now even a mailer enqueued by Rails, or a Turbo broadcast fired by a model callback, runs inside the right tenant — the concern blankets work the app never explicitly wrote as a job. And the first line closes the loop back to this tutorial's other obsession: enqueue_after_transaction_commit = true makes all Active Job enqueues wait for the surrounding transaction to commit — the same _commit means after-durable guarantee that after_create_commit gave one callback, now applied globally so no job is ever enqueued for a row a rollback is about to erase.
The contrast: the by-hand version threads account_id through every job signature and re-sets Current.account at the top of every perform — until the one job that forgets, which then reads across tenant boundaries with no error to announce it. Fizzy's concern makes forgetting impossible: tenancy is serialized and re-established by the boundary itself, for app jobs and framework jobs alike. The [[see P9: Put Work at Its Right Altitude|altitude seam]] doesn't just move work off the request thread — it carries the request's identity with it.
Key Takeaways — Patterns to Steal
- When you enqueue, don't pass
message.idand then re-find the row inside the worker (and hand-roll the "what if it was deleted" branch every time). Pass the record itself — Active Job's GlobalID serializes it as a URI likegid://campfire/Message/123and rehydrates a real, loaded model beforeperformruns. Campfire writesperform_later(self, message)in domain nouns (room.rb:73) and pushes the vanished-record case onto the framework withdiscard_on ActiveJob::DeserializationError(application_job.rb:6) instead of aModel.findin every job. - Don't put the real work inside
performjust because it runs in the background — that strands the logic behind the queue, where the only way to test or reuse it is to drain a queue. Make the job a two-line thunk that immediately delegates to a method on a model or a PORO.Room::PushMessageJob#performisRoom::MessagePusher.new(room:, message:).pushand nothing else (push_message_job.rb:2-4); the real logic lives on the model where you can call it straight from a console with no worker running. - Don't bury the "should this even run?" check inside
performwith areturn unless ...— you'd occupy a worker and deserialize records only to discover there was nothing to do, and the same guard gets copy-pasted into every job. Pair a plain method that just works with a_latersibling that owns both the enqueue and the guard.deliver_webhook_latercarries theif webhookcheck before the job is ever created whiledeliver_webhookruns unguarded (user/bot.rb:51-57), so a bot with no webhook produces zero queue rows instead of a row that wakes up to no-op. - When a callback enqueues work that reaches outside the database, don't trigger it from plain
after_create— that fires inside the transaction, so a rollback can erase the row after you've already queued a push for it, and the worker happily notifies fifty phones about a message that no longer exists. Reach forafter_create_commitinstead:_commitmeans after-durable, so the job is enqueued only once the row survives. Campfire'safter_create_commit -> { room.receive(self) }(message.rb:12) absorbs the entire ghost-row bug class in one suffix. - Don't let every consequence of saving ride the same altitude — looping over members to set each unread badge AND firing the slow push from one place blocks the sender's request on flaky gateways. Split the work inside a domain method: cheap-and-must-be-durable stays in-band, slow-and-flaky crosses into a job. Campfire's
room.receiverunsunread_membershipsas a single bulkupdate_allsynchronously (no N+1 loop — the database does the filtering) and handspush_lateroff toRoom::PushMessageJob(room.rb:46-49,68-73). - When you post work to a raw thread pool, don't hand a live Active Record object into the thread and let each thread lazily query through it — fifty threads then race for connections outside Rails' connection management and throw
connection pool exhaustedunder load. Do every AR read first, then post only primitives.WebPush::Pool#deliver_laterbuildssubscription.notificationand grabssubscription.idbeforedelivery_pool.post, so the threads hold plain data and touch only the network, never the ORM (web_push/pool.rb:25-36).