- Introduced a new Redis worker connection function to manage worker-specific connections, enhancing reliability. - Updated the health check job scheduling to use a SHA-256 hash for generating safe job IDs, improving uniqueness and security. - Added a comprehensive README to document the federation background job queue, including job interfaces, exported functions, and usage examples. - Refactored existing code for better organization and clarity, including updates to job processing and error handling. This update aims to strengthen the federation's job processing capabilities and improve overall system documentation. This update also finishes #4
18 KiB
Bull — Federation Background Job Queue
Overview
The bull module provides a Redis-backed background job processing system for the federation layer. It prevents long-running network operations — such as delivering activity payloads to remote servers or probing server health — from blocking the main request/response cycle.
Built on BullMQ, it exposes two queues and their corresponding workers:
- Federation delivery queue — encrypts and delivers activity payloads (follows, posts, inserts, unfollows) to remote servers, processes acknowledgment responses, and performs automatic cleanup of delivery records.
- Health-check queue — periodically probes servers marked as unhealthy, re-classifying them as healthy when they respond successfully, with exponential back-off and a configurable retry limit.
Workers are designed to be started once at application bootstrap and run for the lifetime of the process.
Interfaces and Types
FederationDeliveryJob
interface FederationDeliveryJob {
deliveryJobId: string;
targetUrl: string;
serverUrl: string;
payload: string;
}
Defines the data contract for a federation delivery job.
| Field | Type | Description |
|---|---|---|
deliveryJobId |
string |
Primary key of the corresponding row in the deliveryJobs database table. |
targetUrl |
string |
Full URL of the remote server's federation inbox endpoint. |
serverUrl |
string |
Origin URL of the target server (used for registry lookups and blacklisting). |
payload |
string |
Serialized JSON string containing the activity method and associated data. |
HealthCheckJob
interface HealthCheckJob {
serverUrl: string;
}
Defines the data contract for a health-check job.
| Field | Type | Description |
|---|---|---|
serverUrl |
string |
The remote server URL to probe for health. |
Exported Symbols and Functions
DELIVERY_QUEUE_NAME
const DELIVERY_QUEUE_NAME = 'federation-delivery';
The Redis queue name for federation delivery jobs.
HEALTH_CHECK_QUEUE_NAME
const HEALTH_CHECK_QUEUE_NAME = 'federation-health-check';
The Redis queue name for health-check jobs.
getRedisConnection()
function getRedisConnection(): Redis
Returns a singleton Redis connection (via ioredis) configured with maxRetriesPerRequest: null as required by BullMQ. The connection URL is read from the REDIS_URL environment variable. Intended for queue producers (getFederationQueue, getHealthCheckQueue).
Throws if REDIS_URL is not set.
getRedisWorkerConnection()
function getRedisWorkerConnection(): Redis
Returns a separate singleton Redis connection dedicated to BullMQ Worker instances. Keeping worker connections distinct from producer connections prevents back-pressure on queue-enqueue operations when workers are under high load.
Throws if REDIS_URL is not set.
getFederationQueue()
function getFederationQueue(): Queue<FederationDeliveryJob>
Returns a singleton Queue<FederationDeliveryJob> instance backed by the federation-delivery queue.
Default job options:
| Option | Value | Rationale |
|---|---|---|
attempts |
5 |
Up to 5 retries before the job is marked as failed. |
backoff |
exponential, 5s | Delay doubles on each retry: 5s, 10s, 20s, 40s. |
removeOnComplete |
{ age: 86400 } |
Completed jobs are pruned after 24 hours. |
removeOnFail |
{ age: 604800 } |
Failed jobs are retained for 7 days for diagnostics. |
getHealthCheckQueue()
function getHealthCheckQueue(): Queue<HealthCheckJob>
Returns a singleton Queue<HealthCheckJob> instance backed by the federation-health-check queue. No custom default job options are applied.
scheduleHealthCheck()
function scheduleHealthCheck(serverUrl: string, attempt: number): Promise<void>
Schedules a delayed health-check job for a remote server.
Parameters:
| Parameter | Type | Description |
|---|---|---|
serverUrl |
string |
The remote server URL to check. |
attempt |
number |
Zero-based attempt counter; used to compute the delay and job ID. |
Internal logic:
- Computes the delay as
(5 + attempt * 10)minutes. - Derives a deterministic job ID from the server URL using SHA-256 (first 16 hex chars) to avoid collisions between URLs that differ only in non-alphanumeric characters.
- Adds a single-shot job (auto-removed on completion or failure) to the health-check queue.
Returns: Promise<void>
startFederationWorker()
function startFederationWorker(): { deliveryWorker: Worker<FederationDeliveryJob>; healthCheckWorker: Worker<HealthCheckJob> }
Creates and returns a pair of BullMQ workers that process the federation delivery and health-check queues. This function is idempotent: subsequent calls return the same worker instances.
Workers use a dedicated Redis connection via getRedisWorkerConnection(), separate from the connection used by queue producers (getRedisConnection()). This prevents worker processing from starving queue-enqueue operations on the main thread.
Delivery worker configuration:
| Option | Value |
|---|---|
concurrency |
10 |
Health-check worker configuration:
| Option | Value |
|---|---|
concurrency |
3 |
Lifecycle events:
| Worker | Event | Behavior |
|---|---|---|
| Delivery | ready |
Logs a confirmation that the worker is connected to Redis. |
| Delivery | failed |
Logs the job ID, method, target URL, attempt count, remaining retries, and error. |
| Delivery | completed |
Deletes the corresponding deliveryJobs database row. |
| Delivery | error |
Logs a generic worker-level error to the console. |
| Health-check | ready |
Logs a confirmation that the worker is connected to Redis. |
| Health-check | failed |
Logs the job ID and error message. |
| Health-check | error |
Logs a generic worker-level error to the console. |
Returns: { deliveryWorker, healthCheckWorker }
processFederationDelivery(job)
function processFederationDelivery(job: Job<FederationDeliveryJob>): Promise<void>
The core processor for federation delivery jobs. Executed by the delivery worker for each queue entry.
Processing steps:
- Method validation — Parses the
payloadJSON and validates that the result is an object with a string-typedmethodfield (guards against JSON primitives likenull,42, or"str"that would pass JSON.parse but throw a TypeError on property access). Ifmethodis missing, non-string, or not one ofFEDERATE,FEDERATE_POST,INSERT,UNFOLLOW, the job fails immediately with anUnrecoverableErrorand itsdeliveryJobsrow is deleted. - Blacklist check — Queries the
blacklistedServerstable. If theserverUrlis blacklisted, the job is dropped with anUnrecoverableErrorand the row is cleaned up. - Key resolution — Looks up the target server in the
serverRegistrytable. If the server is not yet registered, automatic discovery is attempted viadiscoverAndRegister(). If discovery fails, a retryable error is thrown. - Encryption — Encodes the payload using the target server's
encryptionPublicKey(base64-decoded into aUint8Array) viaencryptPayload(). - Database update — Sets
lastAttemptedAtand incrementsattemptson the delivery job record. - HTTP delivery — Validates that
BETTER_AUTH_URLis set (throwsUnrecoverableErrorif missing). Signs the original plaintext payload with the local server's signing key and sends the encrypted payload viafederationFetch()with a 15-second timeout and proxy fallback. A non-OK response throws a retryable error. - Ack parsing — Attempts to parse the response body as JSON (throws
UnrecoverableErroron non-JSON response). Inspects the payload for aPROXY_RESPONSEacknowledgment nested underresponseBody.payload. - Ack dispatch — Routes the acknowledgment to a job-name-specific handler (e.g.
deliver-follow→handleFollowAck). If no handler is registered, the ack is silently ignored.
Throws:
UnrecoverableError— Malformed payload, missing or non-string method, invalid method, blacklisted server, missingBETTER_AUTH_URL, non-JSON response, or missing acknowledgment.Error— Auto-discovery failure or HTTP delivery failure (retryable by BullMQ).
processHealthCheck(job)
function processHealthCheck(job: Job<HealthCheckJob>): Promise<void>
The core processor for health-check jobs. Executed by the health-check worker.
Processing steps:
- Server lookup — Queries the
serverRegistrytable. If the server is not found or is already marked healthy, the job exits early. - Threat-policy check — If the server has an
unhealthyReason, the corresponding threat policy is consulted viagetThreatPolicy(). If the reason is notdirectHealthCheckable, the job skips further processing. - Probe — Sends an HTTP
GETto<serverUrl>/discoverwith an 8-second timeout. - Success — If the response is OK, the server is marked healthy via
markServerHealthy()and the job completes. - Failure — On HTTP error or network exception, the attempt counter is atomically incremented in the database via
sqlfragment (healthCheckAttempts + 1), avoiding read-modify-write races between concurrent worker instances. If fewer thanMAX_HEALTH_CHECK_ATTEMPTS(5) have been made, a follow-up health-check job is scheduled with exponential delay. Once exhausted, a warning is logged and no further checks are scheduled.
Returns: Promise<void>
handleFollowAck(ackPayload, serverUrl, cachedServerPublicKey, deliveryJobId, jobId)
function handleFollowAck(
ackPayload: AckPayload,
serverUrl: string,
cachedServerPublicKey: string | undefined,
deliveryJobId: string,
jobId: string | undefined,
): Promise<void>
Processes the acknowledgment (PROXY_RESPONSE) for a deliver-follow job.
Parameters:
| Parameter | Type | Description |
|---|---|---|
ackPayload |
AckPayload |
The acknowledgment payload containing signature and decrypted data. |
serverUrl |
string |
Origin URL of the remote server. |
cachedServerPublicKey |
string | undefined |
The server's signing public key, if already known from the registry at delivery time. |
deliveryJobId |
string |
ID of the delivery job record for cleanup. |
jobId |
string | undefined |
BullMQ job ID for diagnostic logging. |
Internal logic:
- Parses the decrypted payload against
FollowEnvelopeSchema. Invalid payloads cause anUnrecoverableErrorand delete the delivery job record. - Resolves the remote server's signing public key (bumps the database if not cached from the delivery phase).
- Verifies the cryptographic signature on the acknowledgment. A failed signature check throws
UnrecoverableError. - Looks up the local
followsrow matchingfollowerId,followingId, andfollowerServerUrl. If no matching row exists, the ack is silently ignored (the remote acknowledged a follow this node does not know about). - If the remote
acceptedis explicitlyfalse, the local follow record is updated withacknowledged: true(the remote explicitly rejected the follow). - If the remote
acceptedis explicitlytrue, the localacceptedcolumn is updated totrue. - If
acceptedisundefined/null, the local follow record is updated withacknowledged: trueonly (the remote acknowledged receipt without indicating an acceptance state).
Throws:
UnrecoverableError— Invalid follow payload, missing signing public key, or signature verification failure.
Usage Example
// app/bootstrap.ts
import { startFederationWorker, getFederationQueue, scheduleHealthCheck } from '@/lib/bull';
// ────────────────────────────────────────────
// Start workers at application bootstrap
// ────────────────────────────────────────────
const workers = startFederationWorker();
// workers.deliveryWorker — processes federation-delivery queue
// workers.healthCheckWorker — processes federation-health-check queue
// ────────────────────────────────────────────
// Enqueue a federation delivery job
// ────────────────────────────────────────────
const queue = getFederationQueue();
await queue.add('deliver-follow', {
deliveryJobId: 'abc-123',
targetUrl: 'https://remote.example.com/inbox',
serverUrl: 'https://remote.example.com',
payload: JSON.stringify({
method: 'FEDERATE',
// ... activity data
}),
});
// ────────────────────────────────────────────
// Schedule a delayed health check
// ────────────────────────────────────────────
await scheduleHealthCheck('https://remote.example.com', 0);
// Runs in ~5 minutes; doubles delay on each retry.
Error Handling
Unrecoverable Errors (BullMQ UnrecoverableError)
Jobs that throw UnrecoverableError are immediately marked as failed and will not be retried, even if the queue's attempts option is greater than 1.
| Scenario | Thrown From | Description |
|---|---|---|
| Malformed payload JSON | processFederationDelivery |
The job payload cannot be parsed as valid JSON. |
| Missing or non-string method | processFederationDelivery |
The method field is missing, not a string, or not in the allowed set. |
| Blacklisted target server | processFederationDelivery |
The target server is in the blacklistedServers table. |
Missing BETTER_AUTH_URL |
processFederationDelivery |
The environment variable is not set; federation requests cannot be sent. |
| Non-JSON response from remote | processFederationDelivery |
The remote returned a 200 OK with a non-JSON body. |
| Missing acknowledgment | processFederationDelivery |
The remote response does not contain a PROXY_RESPONSE payload. |
| Invalid follow ack payload | handleFollowAck |
The decrypted payload fails FollowEnvelopeSchema validation. |
| Missing signing public key | handleFollowAck |
The server has no publicKey in the registry to verify the ack signature. |
| Signature verification failure | handleFollowAck |
The cryptographic signature on the ack does not match. |
Retryable Errors
Jobs that throw a regular Error are returned to the queue and retried according to the queue's backoff configuration.
| Scenario | Thrown From | Description |
|---|---|---|
| Auto-discovery failure | processFederationDelivery |
The server is not in the registry and discoverAndRegister throws a non-DiscoveryError. |
| HTTP delivery failure | processFederationDelivery |
The remote endpoint returns a non-OK HTTP status code. |
| Network / fetch error | processFederationDelivery |
federationFetch throws due to timeout, DNS failure, etc. |
Silent Skips (No Error)
| Scenario | Location | Description |
|---|---|---|
| Unhealthy reason not checkable | processHealthCheck |
The server's threat policy forbids direct health checks. |
| Server already healthy | processHealthCheck |
The server is already marked healthy in the registry. |
| Server not in registry | processHealthCheck |
The server was removed or never registered. |
| Unknown follow ack | handleFollowAck |
The local follows table has no matching row for the acknowledged follow. |
Worker-Level Errors
Worker-level errors (e.g. Redis connection loss) are emitted via the worker's error event and logged to the console. These do not affect individual jobs; BullMQ will re-establish the connection automatically.