The setup: a FastAPI backend on Cloud Run
Columnly's backend is a FastAPI application deployed on Google Cloud Run in asia-south1. It handles LLM routing, data analysis pipelines, billing webhooks, and a SQLite-backed session store. Nothing exotic. The sort of architecture you've seen a hundred times.
What we hadn't seen was how a single misconfigured service account could cascade into three completely different error messages, each pointing you in the wrong direction.
Error 1: The 503 that isn't a 503
The first sign something was wrong: Cloud Run returned HTTP 503 on every request, immediately after deploy. The container was starting — health checks passed — but the moment a real request came in, 503.
The instinctive fix is to look at container logs. But the logs showed nothing. The container started, uvicorn bound to port 8080, FastAPI initialised — all clean. The 503 was happening before the request even reached the application.
Increasing the Cloud Run concurrency limit. Increasing the request timeout. Changing the port. Redeploying with a fresh image. All of these are reasonable guesses for a 503 — and all of them were wrong.
The actual cause: the service account attached to the Cloud Run service did not have permission to pull the container image from Artifact Registry. Cloud Run was starting the container from a cached previous image, passing health checks, then failing to serve traffic because the actual running image was one deploy behind. The new code was never running.
The fix
Grant the Cloud Run service account the roles/artifactregistry.reader role on the specific Artifact Registry repository:
gcloud artifacts repositories add-iam-policy-binding columnly-backend \ --location=asia-south1 \ --member="serviceAccount:YOUR-SA@PROJECT.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader"
Simple. One command. But you would never guess this from a 503 error message, because nothing in the error chain mentions permissions or image pulling.
Error 2: The SQLite path that works locally and fails in prod
With the image pull fixed, the service came up properly. New error: the application was crashing on startup with a sqlite3.OperationalError: unable to open database file.
Locally, the database path was ./data/memo.db — relative to the working directory. In the container, the working directory is /app, and /app/data/ didn't exist. Cloud Run's container filesystem is ephemeral and read-only except for /tmp.
We fixed this in two steps. First, mount a persistent volume:
- name: 'gcr.io/cloud-builders/gcloud'
args:
- run
- deploy
- columnly-backend
- --add-volume=name=memo-db,type=cloud-storage,bucket=columnly-memo-db
- --add-volume-mount=volume=memo-db,mount-path=/data
- --set-env-vars=COLUMNLY_MEMO_DB=/data/memo.db
Second, update the connection code to use an absolute path from the environment variable:
import os
def get_db_path() -> str:
return os.environ.get("COLUMNLY_MEMO_DB", "/tmp/memo.db")
def get_connection():
db_path = get_db_path()
os.makedirs(os.path.dirname(db_path), exist_ok=True)
return sqlite3.connect(db_path, check_same_thread=False)
Never use relative paths for anything that persists state in a containerised environment. If it touches disk, it must be an absolute path from an environment variable. Always. The os.makedirs(..., exist_ok=True) guard is also non-negotiable — the directory may not exist on first boot.
Error 3: The Stripe webhook that silently ate every event
With the service running, billing stopped working. Stripe webhooks were being delivered — the Stripe dashboard showed 200 OK responses — but no subscription updates were being written to the database. Users who paid weren't being upgraded.
The root cause took an embarrassingly long time to find: the webhook signature verification was passing, but the event type comparison was case-sensitive and the Stripe event types use dots, not underscores.
# Wrong — this never matches
if event["type"] == "checkout_session_completed":
...
# Stripe actually sends
# "checkout.session.completed"
The webhook handler was silently catching all events, returning 200 (so Stripe stopped retrying), and doing nothing with them. The fix was a five-character change — replace underscores with dots — but finding it required re-reading the Stripe documentation for the third time and finally noticing the difference.
The most expensive bugs are the ones that return 200 OK and do nothing. A hard crash is easy to find. Silent success is not.
What we'd do differently
Three stacked errors, three different root causes, each requiring a different debugging approach. In hindsight, all three were preventable:
- Permissions: Run a pre-deploy IAM audit script. If the service account is missing any role the service needs, fail the build, not the deployment.
- Filesystem: Add a container startup test that writes to every path the application uses. If it fails, the deployment fails — not the first live request.
- Webhooks: Write a test that replays a real Stripe event payload against your handler before any deploy that touches billing code. Stripe provides test payloads for every event type.
The pattern across all three is the same: the error surfaced as late as possible, in the most obscure way possible. The fix for each was to move the failure earlier — to build time, to startup, to test time — so it could never reach production silently.
The one thing worth remembering
Google Cloud's error messages are written for someone who already knows what's wrong. A 503 from a missing IAM role, a startup crash from a missing directory, a silent no-op from a wrong string constant — none of these messages tell you what actually caused them. You have to already know where to look.
The fastest debugging tool we've found for Cloud Run is not the logs explorer. It's deploying a minimal version of the service — just startup code, no business logic — and adding one piece at a time until the error appears. Tedious, but it works every time.