Self-Hosted Issues
This page covers runtime symptoms specific to self-hosted Flagsmith deployments. For initial setup problems (health checks, database connectivity, frontend DNS), see the deployment troubleshooting guide.
Task processor is not running jobs
Tasks are queueing up but never being processed. Symptoms include: webhooks not firing, audit logs not being written, or analytics data not appearing.
Common causes
- Task processor container is not running. The task processor is a separate service that must be started alongside
the API. Check that a container with the
run-task-processorcommand is running (docker psor your orchestrator's pod list). TASK_RUN_METHODnot set toTASK_PROCESSOR. If this environment variable is not set on the API container, Flagsmith runs tasks in an unmanaged background thread inside the API process instead of sending them to the processor. The processor will have nothing to pick up.- Database connectivity from the processor. The task processor must be able to reach the same database as the API
(or a dedicated task processor database if you have configured one). Check
DATABASE_URLandTASK_PROCESSOR_DATABASE_URL. - Sleep interval too high. The
TASK_PROCESSOR_SLEEP_INTERVAL_MSenvironment variable controls how often each worker thread checks for new tasks. The default is 500 ms. If this has been raised significantly, tasks will appear to be delayed.
Steps to resolve
- Verify the task processor container is running and check its logs for errors.
- Confirm that
TASK_RUN_METHOD=TASK_PROCESSORis set on the API container. - Check that
DATABASE_URL(andTASK_PROCESSOR_DATABASE_URLif using a separate database) is correct and reachable from the processor container. - Review the processor configuration:
| Environment variable | Default | Description |
|---|---|---|
TASK_PROCESSOR_SLEEP_INTERVAL_MS | 500 | Milliseconds between polling for new tasks |
TASK_PROCESSOR_NUM_THREADS | 5 | Worker threads per processor instance |
TASK_PROCESSOR_GRACE_PERIOD_MS | 20 000 | Time before a task is considered stuck |
TASK_PROCESSOR_QUEUE_POP_SIZE | 10 | Tasks retrieved per polling iteration |
- Check the monitoring endpoint at
GET /processor/monitoring. It returns the number of tasks waiting in the queue. A consistently growing number indicates the processor is not keeping up.
Related documentation: Asynchronous Task Processor
Database migration failures on upgrade
After upgrading the Flagsmith API image, the container fails to start with a migration error.
Common causes
- Skipped versions. Flagsmith migrations are designed to be applied sequentially. If you jump from a much older version to the latest, an intermediate migration may fail because it expects a schema state that was never reached.
- Concurrent migration attempts. If multiple API containers start simultaneously and all attempt to run migrations, they can deadlock or conflict. Ensure only one container runs migrations at a time (use an init container or a separate migration job).
- Insufficient database permissions. The database user must have permission to create, alter, and drop tables and indexes. Read-only replicas will always fail migrations.
Steps to resolve
- Read the full traceback in the container logs to identify which migration failed and why.
- If you skipped versions, consider upgrading incrementally through intermediate releases.
- If you need to roll back, follow the rollback procedure. For versions v2.151.0 and later, use:
python manage.py rollbackmigrationsappliedafter "<datetime of previous deployment>"
- If concurrent containers caused a conflict, restart with a single replica, let migrations complete, then scale back up.
Rolling back migrations may result in data loss if new models or fields were added. Always take a full database backup before attempting a rollback.
Related documentation: Upgrades and Rollbacks
Intermittent 502s from the API container
The API returns 502 Bad Gateway sporadically. The container is running and most requests succeed.
Common causes
- Worker processes crashing. Flagsmith's API runs behind Gunicorn. If a worker runs out of memory or hits an unhandled exception, Gunicorn kills and restarts it. Requests in flight during the restart receive a 502 from the reverse proxy.
- Too few workers. The default Gunicorn worker count may not be enough for your traffic. If all workers are busy, new connections queue at the proxy and may time out.
- Request timeout mismatch. If Gunicorn's
--timeoutis longer than your reverse proxy's upstream timeout, the proxy will cut the connection before Gunicorn does, resulting in a 502. - Database connection exhaustion. If the API and task processor share a connection pool and traffic spikes, the
database may reject new connections. This typically shows as a
502to the client and aOperationalError: connection to server ...in the API logs.
Steps to resolve
- Check the API container's logs for
[CRITICAL] WORKER TIMEOUTmessages from Gunicorn orOperationalErrorexceptions from Django. - If workers are timing out, consider raising
GUNICORN_TIMEOUT(default 30 s) orGUNICORN_WORKERS(default 3). See Flagsmith's Docker environment variables for the full list, or useGUNICORN_CMD_ARGSto pass arbitrary Gunicorn flags. - Ensure your reverse proxy's upstream timeout is equal to or greater than Gunicorn's timeout.
- Monitor database connection usage. If connections are exhausted, increase
CONN_MAX_AGEor add a connection pooler such as PgBouncer. - If memory is the bottleneck, raise the container's memory limit or switch Gunicorn to
--worker-class geventto reduce per-worker memory usage.
Related documentation: Caching Strategies • Asynchronous Task Processor