First to Site
Release 3.10

Google Token Self-Healing (Permit Ready + DA Ready)

Proactive token refresh cron + kernel.exception auto-recovery + 6h-deduped ops alert closes the Trello 1759 incident loop

Overview

For months, the DA Ready and Permit Ready Google API automations would silently stop processing inbound mail every few weeks. The recipe was always the same: someone noticed orders weren't landing, hit /api/google/auth (or /api/da-google/auth) in a browser, completed the OAuth consent screen, and the pipeline resumed. Trello 1759 captured this. Release 3.10 closes the loop.

Root cause (audit)

The audit on Trello 1782 found two compounding defects:

  1. Refresh response written verbatim. AuthService::initializeToken() wrote the OAuth refresh response straight to the on-disk token file. Google omits refresh_token from the response body on subsequent refreshes, so the file got rewritten without it. The next refresh attempt had no refresh_token and 401'd.
  2. Deploy clobber. The deploy script copied token.json from ~/config/<env>/google-api-prod/ into every new release dir, but never copied token_da.json back. So every deploy reset Permit Ready to a stale token (from the persistent config dir) and left DA Ready entirely without a token file in the new release.

Full audit detail (with on-disk evidence including an 82-byte truncated token from a non-atomic write race) lives in ResolutionLibrary/prod-google-da-permit-token-refresh-audit-2026-04-26.md.

Fix - three layers

1. Proactive refresh cron

A new Symfony command, app:google:refresh-tokens, runs every 30 minutes via crontab on fts1:

*/30 * * * * /home/ftsuser/live/app/bin/console app:google:refresh-tokens >> /home/ftsuser/logs/google-token-refresh.log 2>&1

The command iterates both pipelines (Permit Ready and DA Ready), checks each token's expiry against a configurable threshold (default 600s), and refreshes only if stale. Refreshes go through new AuthService::refreshIfStale() and AuthServiceDA::refreshIfStale() helpers that:

  • read the existing on-disk token,
  • merge the refresh response onto it (preserving refresh_token),
  • atomically rewrite via tempfile + rename.

2. Kernel.exception auto-recovery

A new App\EventSubscriber\GoogleTokenFailureSubscriber listens on kernel.exception and matches:

  • requests to /api/google/* or /api/da-google/*,
  • exceptions whose class lives under Google\ or whose message contains token-shaped vocabulary.

On match, it dispatches an async RefreshGooglePipelineMessage so the messenger worker tries refreshIfStale() in the background. If a valid refresh_token still exists on disk, the next request succeeds.

3. Deduped ops alert

The same subscriber also dispatches an alert email through the existing EmailNotificationMessage async path. Dedup uses cache.app with key google_token_alert.<pipeline> and a 6-hour TTL, so a noisy day produces one email per pipeline, not dozens. The cron command shares the same dedup key on its own failure paths, so HTTP and cron failures coalesce into one alert window. Recipient defaults to [email protected]; override via OPS_NOTIFICATION_EMAIL in app/.env.local.

Behaviour after the fix

ScenarioBeforeAfter
Token within 10 min of expiry, cron tick fires(no proactive refresh)Refreshes silently, refresh_token preserved
Token already expired, request hits /api/google/process500 with Invalid token format; pipeline halts500 once for the in-flight request, async refresh dispatched, next request succeeds
Same failure repeats within 6hAlert email per failureSingle alert email; subsequent failures suppressed
Deploy shipstoken.json reset to whatever was last written into ~/config/<env>/; token_da.json deletedBoth files persist across deploys via the new Step 3a back-sync in the release script

Verification

End-to-end verified in production: the proactive cron showed da_ready: refreshed (expires_in=3599s) on the first tick after the new code shipped, the Permit Ready pipeline recovered after a one-time /auth reseed (the persistent refresh_token had been revoked at some point and could not be merged back), and Mailhog confirmed the alert email layout in the local dev environment. Subsequent two requests inside the 6h window suppressed correctly.

  • Trello 1759 - parent incident.
  • Trello 1782 - audit report.
  • Trello 1783 - proactive refresh implementation.
  • Trello 1784 - auto-recovery + alerting.
  • Git tags v3.12.0 (proactive refresh + deploy-script fix) and v3.13.0 (auto-recovery + alerting).