Retries & Dead Letters¶
taskito automatically retries failed tasks with exponential backoff and moves permanently failed jobs to a dead letter queue.
Retry Policy¶
Configure retries at the task level:
@queue.task(max_retries=5, retry_backoff=2.0)
def flaky_api_call(url):
response = requests.get(url)
response.raise_for_status()
return response.json()
| Parameter | Default | Description |
|---|---|---|
max_retries |
3 |
Maximum retry attempts before DLQ |
retry_backoff |
1.0 |
Base delay in seconds for exponential backoff |
Backoff Formula¶
base_delay=retry_backoff(in seconds)max_delay= 300 seconds (5 minutes)jitter= random 0–500ms to prevent thundering herd
Example with retry_backoff=2.0:
| Attempt | Delay |
|---|---|
| 1st retry | ~2s |
| 2nd retry | ~4s |
| 3rd retry | ~8s |
| 4th retry | ~16s |
| 5th retry | ~32s |
Exception Filtering¶
Control which exceptions trigger retries with retry_on and dont_retry_on:
@queue.task(
max_retries=5,
retry_on=[ConnectionError, TimeoutError],
dont_retry_on=[ValueError],
)
def fetch_data(url):
response = requests.get(url)
response.raise_for_status()
return response.json()
| Parameter | Description |
|---|---|
retry_on |
Whitelist — only retry on these exception types. All others skip straight to DLQ. |
dont_retry_on |
Blacklist — never retry on these exception types, even if retries remain. |
If neither is set, all exceptions trigger retries (default behavior).
Note
retry_on and dont_retry_on are mutually exclusive in practice — if retry_on is set, only those exceptions are retried regardless of dont_retry_on.
Retry Flow¶
flowchart TD
A["Task Execution"] --> B{Success?}
B -->|Yes| C["Status: Complete<br/>Store result"]
B -->|No| D["Record error in<br/>job_errors table"]
D --> SR{"Exception passes<br/>retry_on / dont_retry_on?"}
SR -->|No| I["Move to Dead Letter Queue<br/>Status: Dead"]
SR -->|Yes| E{"retry_count < max_retries?"}
E -->|Yes| F["Calculate backoff delay"]
F --> G["Status: Pending<br/>retry_count += 1"]
G --> H["Wait for scheduled time"]
H --> A
E -->|No| I
Dead Letter Queue¶
Jobs that exhaust all retries are moved to the DLQ for inspection and manual replay.
Inspect Dead Letters¶
# List the 10 most recent dead letters
dead = queue.dead_letters(limit=10, offset=0)
for d in dead:
print(f"Job: {d['original_job_id']}")
print(f"Task: {d['task_name']}")
print(f"Error: {d['error']}")
print(f"Retries: {d['retry_count']}")
print()
Replay Dead Letters¶
Config preservation
Replayed jobs preserve the original job's priority, max_retries, timeout, and result_ttl settings. You don't need to re-specify them — the DLQ stores the full configuration.
Purge Old Dead Letters¶
# Delete dead letters older than 24 hours
deleted = queue.purge_dead(older_than=86400)
print(f"Purged {deleted} dead letter(s)")
Error History¶
Every failed attempt is recorded with the error message. Access the full history via job.errors:
@queue.task(max_retries=3)
def unreliable():
raise ConnectionError("timeout")
job = unreliable.delay()
# After the job fails and retries...
for error in job.errors:
print(f"Attempt {error['attempt']}: {error['error']}")
# Attempt 0: timeout
# Attempt 1: timeout
# Attempt 2: timeout
Each error entry contains:
| Field | Type | Description |
|---|---|---|
id |
str |
Unique error record ID |
job_id |
str |
The job this error belongs to |
attempt |
int |
Attempt number (0-indexed) |
error |
str |
Error message |
failed_at |
int |
Timestamp in milliseconds |
Timeout Reaping¶
If a task exceeds its timeout, the scheduler automatically detects it (checking every ~5 seconds) and treats it as a failure — triggering the retry/DLQ logic.