Self-Hosted Issues

This page covers runtime symptoms specific to self-hosted Flagsmith deployments. For initial setup problems (health checks, database connectivity, frontend DNS), see the deployment troubleshooting guide.

Task processor is not running jobs

Tasks are queueing up but never being processed. Symptoms include: webhooks not firing, audit logs not being written, or analytics data not appearing.

Common causes

Task processor container is not running. The task processor is a separate service that must be started alongside the API. Check that a container with the run-task-processor command is running (docker ps or your orchestrator's pod list).
TASK_RUN_METHOD not set to TASK_PROCESSOR. If this environment variable is not set on the API container, Flagsmith runs tasks in an unmanaged background thread inside the API process instead of sending them to the processor. The processor will have nothing to pick up.
Database connectivity from the processor. The task processor must be able to reach the same database as the API (or a dedicated task processor database if you have configured one). Check DATABASE_URL and TASK_PROCESSOR_DATABASE_URL.
Sleep interval too high. The TASK_PROCESSOR_SLEEP_INTERVAL_MS environment variable controls how often each worker thread checks for new tasks. The default is 500 ms. If this has been raised significantly, tasks will appear to be delayed.

Steps to resolve

Verify the task processor container is running and check its logs for errors.
Confirm that TASK_RUN_METHOD=TASK_PROCESSOR is set on the API container.
Check that DATABASE_URL (and TASK_PROCESSOR_DATABASE_URL if using a separate database) is correct and reachable from the processor container.
Review the processor configuration:

Environment variable	Default	Description
`TASK_PROCESSOR_SLEEP_INTERVAL_MS`	500	Milliseconds between polling for new tasks
`TASK_PROCESSOR_NUM_THREADS`	5	Worker threads per processor instance
`TASK_PROCESSOR_GRACE_PERIOD_MS`	20 000	Time before a task is considered stuck
`TASK_PROCESSOR_QUEUE_POP_SIZE`	10	Tasks retrieved per polling iteration

Check the monitoring endpoint at GET /processor/monitoring. It returns the number of tasks waiting in the queue. A consistently growing number indicates the processor is not keeping up.

Related documentation: Asynchronous Task Processor

Database migration failures on upgrade

After upgrading the Flagsmith API image, the container fails to start with a migration error.

Common causes

Skipped versions. Flagsmith migrations are designed to be applied sequentially. If you jump from a much older version to the latest, an intermediate migration may fail because it expects a schema state that was never reached.
Concurrent migration attempts. If multiple API containers start simultaneously and all attempt to run migrations, they can deadlock or conflict. Ensure only one container runs migrations at a time (use an init container or a separate migration job).
Insufficient database permissions. The database user must have permission to create, alter, and drop tables and indexes. Read-only replicas will always fail migrations.

Steps to resolve

Read the full traceback in the container logs to identify which migration failed and why.
If you skipped versions, consider upgrading incrementally through intermediate releases.
If you need to roll back, follow the rollback procedure. For versions v2.151.0 and later, use:

python manage.py rollbackmigrationsappliedafter "<datetime of previous deployment>"

If concurrent containers caused a conflict, restart with a single replica, let migrations complete, then scale back up.

caution

Rolling back migrations may result in data loss if new models or fields were added. Always take a full database backup before attempting a rollback.

Related documentation: Upgrades and Rollbacks

Intermittent 502s from the API container

The API returns 502 Bad Gateway sporadically. The container is running and most requests succeed.

Common causes

Worker processes crashing. Flagsmith's API runs behind Gunicorn. If a worker runs out of memory or hits an unhandled exception, Gunicorn kills and restarts it. Requests in flight during the restart receive a 502 from the reverse proxy.
Too few workers. The default Gunicorn worker count may not be enough for your traffic. If all workers are busy, new connections queue at the proxy and may time out.
Request timeout mismatch. If Gunicorn's --timeout is longer than your reverse proxy's upstream timeout, the proxy will cut the connection before Gunicorn does, resulting in a 502.
Database connection exhaustion. If the API and task processor share a connection pool and traffic spikes, the database may reject new connections. This typically shows as a 502 to the client and a OperationalError: connection to server ... in the API logs.

Steps to resolve

Check the API container's logs for [CRITICAL] WORKER TIMEOUT messages from Gunicorn or OperationalError exceptions from Django.
If workers are timing out, consider raising GUNICORN_TIMEOUT (default 30 s) or GUNICORN_WORKERS (default 3). See Flagsmith's Docker environment variables for the full list, or use GUNICORN_CMD_ARGS to pass arbitrary Gunicorn flags.
Ensure your reverse proxy's upstream timeout is equal to or greater than Gunicorn's timeout.
Monitor database connection usage. If connections are exhausted, increase CONN_MAX_AGE or add a connection pooler such as PgBouncer.
If memory is the bottleneck, raise the container's memory limit or switch Gunicorn to --worker-class gevent to reduce per-worker memory usage.

Related documentation: Caching Strategies • Asynchronous Task Processor

Task processor is not running jobs​

Common causes​

Steps to resolve​

Database migration failures on upgrade​

Common causes​

Steps to resolve​

Intermittent 502s from the API container​

Common causes​

Steps to resolve​

Task processor is not running jobs

Common causes

Steps to resolve

Database migration failures on upgrade

Common causes

Steps to resolve

Intermittent 502s from the API container

Common causes

Steps to resolve