Why Cron Jobs Fail in Production (and What Teams Do About It)
Timezones, overlapping runs, silent failures, and the small mistakes that cause big outages.
Cron is one of the oldest tools in the Unix toolbox, and for good reason — you write a schedule, point it at a command, and it just runs. Quietly, repeatedly, for years.
Then production happens.
A billing sync runs twice. A cleanup job silently stops running. A report goes stale for a week before anyone notices. A “simple” script that works fine in your terminal fails under cron because it can’t find python, can’t see your environment variables, or happily writes output straight into the void.
Cron is extremely literal. Most production failures come down to teams assuming it’s doing more than it actually is.
What cron actually is
Cron is not a workflow engine. It does not understand dependencies, concurrency, retries, backoff, idempotency, or “make sure it runs exactly once.” It is basically:
- At a matching minute, start a process using a minimal environment.
- If it starts, cron considers its job done.
- If it fails, cron usually doesn’t “handle it” unless you explicitly do.
That mental model explains most surprises. Cron triggers processes; your code must handle everything else.
Failure mode #1: Timezones, DST, and “the day got weird”
Timezones
Cron schedules are interpreted in the server’s timezone (or the cron daemon’s configured timezone). Problems show up when:
- Your business logic uses a different timezone than the host.
- Servers in different regions run the same schedule (common in multi-region setups).
- A container image assumes UTC, but the host uses local time (or the reverse).
What teams do about it
- Standardize on UTC for servers and schedules when possible.
- If business time matters (e.g., “9 a.m. Tokyo”), run the job in that timezone intentionally and document it.
- Log timestamps with timezone offsets (ISO 8601) so debugging isn’t guesswork.
Daylight Saving Time (DST)
DST creates two classic cron bugs:
- Spring forward: a whole hour “doesn’t exist,” so some schedules never match.
- Fall back: an hour repeats, so some schedules match twice.
If you run “daily at 02:30” in a DST-observing timezone, there will be days where 02:30 is skipped and days where it happens twice.
What teams do about it
- Avoid scheduling in the “DST danger window” (typically around 01:00–03:00 local time).
- If you must run at local business time, make jobs idempotent (safe to run twice).
- For “must run once” semantics, move to a scheduler with uniqueness guarantees, or implement a run ledger (see overlapping runs below).
Failure mode #2: Overlapping executions (double runs, race conditions, data corruption)
Cron does not care if a previous run is still running.
If you schedule something every 5 minutes and it sometimes takes 7 minutes, you get:
- Two copies running at once
- Concurrency bugs
- Duplicate emails / double billing / double inserts
- Locks held longer than expected
- Load spikes right when things are already slow
What teams do about it
- Add a lock:
- Use
flock(simple and effective on a single host) - Use a lock file with PID checks (less reliable)
- Use a distributed lock (Redis/Postgres) if multiple machines might run the job
- Use
- Make jobs idempotent:
- Use unique keys (e.g., “invoice id + period”)
- Upsert instead of insert
- Record “processed” markers
- Use a “run ledger”:
- Store the last successful run time in a DB
- Query work based on “since last run” windows
- Handle gaps and retries safely
A simple single-host pattern looks like:
- Schedule runs every minute
- Job exits immediately if it can’t acquire the lock
- Job does the work and releases the lock
Editing cron is harder than it should be
Cron syntax is compact, but it’s not friendly:
*/5 * * * *is obvious to some people and cryptic to others- “Every weekday at 9” vs “every 9 minutes” mistakes happen
- Month/day-of-week interactions are confusing
- Small typos become production outages
If your team edits cron by hand, you’re relying on tribal knowledge and perfect attention.
If you want a safer way to generate and review schedules, you can use our cron editor here: Cron Editor.
Many teams only discover cron mistakes during incidents. Visual review before deployment catches most of them. It helps you build cron expressions and verify what times they actually produce before you deploy them.
Failure mode #3: Silent failures
Silent features is that cron did run… but the job didn’t succeed.
Cron will happily start your command even if the command fails instantly. Many failures are “silent” because output goes nowhere unless you capture it.
Common silent-failure causes:
- Exit code non-zero with no alerting
- Output not logged (stdout/stderr discarded or emailed to an unmonitored mailbox)
- Script returns success even when it partially fails
- Dependencies down (DB, API) and you don’t retry or alert
What teams do about it
- Always log somewhere you actually read:
- Write to syslog/journald
- Append to a rotated log file
- Ship logs to a log service
- Alert on failures:
- Check exit codes
- Send to a webhook (Slack, PagerDuty, email you monitor)
- Add health checks:
- “Job ran successfully in the last X minutes/hours”
- A heartbeat ping on success (or even at start/end)
If you run critical jobs, “no news” is not good news unless you explicitly verify success.
Failure mode #4: Environment differences (it worked in my shell)
Cron jobs often fail because cron runs with a minimal environment. Typical surprises:
PATHis different (cron can’t findnode,python,bash,psql, etc.)HOMEisn’t what you expect (config files not found)- Different shell (
shvsbash) - Missing environment variables (API keys, DB URLs)
- Different working directory (relative paths break)
- Permissions differ (cron runs as a different user)
What teams do about it
- Use absolute paths:
/usr/bin/python3notpython/opt/app/bin/run_jobnot./run_job
- Set an explicit environment:
- Export required variables in the script (or source a known env file)
- Set
PATHexplicitly at the top
- Use a wrapper script:
- One entrypoint that sets env,
cdto the right directory, and runs the job
- One entrypoint that sets env,
- Run the job the way cron runs it:
- Test with a stripped environment to catch assumptions early
In production, the environment is part of your code.
Failure mode #5: Host churn, containers, and “cron lives on one machine”
Cron is host-local. That’s great until it isn’t:
- You deploy to multiple instances and accidentally run the job N times
- You replace instances and the crontab doesn’t follow
- Containers restart and cron isn’t running inside them
- A node dies and scheduled jobs disappear with it
What teams do about it
- Treat cron configuration as code:
- Keep schedules in repo
- Deploy them via automation (not manual edits)
- Use a single scheduler host for “singleton” jobs, or implement distributed locking
- Consider managed scheduling or orchestration when jobs must survive host replacement
When cron is the wrong tool
Cron is great for simple, low-to-moderate frequency tasks on a single machine. It’s a bad fit when you need guarantees cron doesn’t provide.
Cron is often the wrong choice for:
High-frequency jobs
If you need sub-minute scheduling or constant background processing, cron becomes a noisy loop. Better options:
- A worker process reading from a queue
- A long-running service with backoff and retries
- Event-driven triggers
Distributed workloads
If multiple nodes might run the same job, you need coordination:
- A distributed scheduler
- A queue + workers model
- A workflow engine
- A database-driven “claim work” pattern
Dependency-heavy pipelines
If job B must run only after job A succeeds, cron alone is fragile:
- You’ll end up re-implementing a workflow engine in shell scripts
- Failures become hard to reason about and recover
Cron can trigger workflows, but it shouldn’t be your workflow system.
What production teams usually standardize on
If you stick with cron, the teams that avoid outages typically add a few non-negotiables:
- Locking to prevent overlap
- Idempotency so retries and duplicates don’t hurt
- Logging that is centralized and searchable
- Alerting on failure and on “didn’t run”
- Explicit environment (paths, variables, working directory)
- Run visibility (start/end, duration, exit code)
This turns cron from “fire and forget” into “scheduled execution with guardrails.”
Closing
Cron is brutally literal. Production failures happen when teams assume it’s forgiving — that it’ll detect a running job and wait, or notice an exit code and retry, or do anything at all besides “start the next scheduled process.”
Treat it as a dumb trigger and design your jobs for reality: timezones, overlap, missing environment, and silent failure. Do that, and cron is actually one of the more dependable pieces of infrastructure you’ll run.
Written by the Infra Atlas author
I work on infrastructure and software systems across layers: writing code, shipping products, and dealing with the practical trade-offs of hosting, memory, and network behavior in production. When this site says it covers “layer 3 to layer 9,” it’s half a joke and half a truth: from routing and packets, up through operating systems, applications, and the human decisions that actually cause outages.
Infra Atlas is a collection of field notes from that work. Some pages may include affiliate or referral links as a low-key way to support the site. Think of it as buying me a coffee while I write about why systems behave the way they do.