Google Token Self-Healing (Permit Ready + DA Ready)
Proactive token refresh cron + kernel.exception auto-recovery + 6h-deduped ops alert closes the Trello 1759 incident loop
Overview
For months, the DA Ready and Permit Ready Google API automations would silently stop processing inbound mail every few weeks. The recipe was always the same: someone noticed orders weren't landing, hit /api/google/auth (or /api/da-google/auth) in a browser, completed the OAuth consent screen, and the pipeline resumed. Trello 1759 captured this. Release 3.10 closes the loop.
Root cause (audit)
The audit on Trello 1782 found two compounding defects:
- Refresh response written verbatim.
AuthService::initializeToken()wrote the OAuth refresh response straight to the on-disk token file. Google omitsrefresh_tokenfrom the response body on subsequent refreshes, so the file got rewritten without it. The next refresh attempt had no refresh_token and 401'd. - Deploy clobber. The deploy script copied
token.jsonfrom~/config/<env>/google-api-prod/into every new release dir, but never copiedtoken_da.jsonback. So every deploy reset Permit Ready to a stale token (from the persistent config dir) and left DA Ready entirely without a token file in the new release.
Full audit detail (with on-disk evidence including an 82-byte truncated token from a non-atomic write race) lives in ResolutionLibrary/prod-google-da-permit-token-refresh-audit-2026-04-26.md.
Fix - three layers
1. Proactive refresh cron
A new Symfony command, app:google:refresh-tokens, runs every 30 minutes via crontab on fts1:
*/30 * * * * /home/ftsuser/live/app/bin/console app:google:refresh-tokens >> /home/ftsuser/logs/google-token-refresh.log 2>&1The command iterates both pipelines (Permit Ready and DA Ready), checks each token's expiry against a configurable threshold (default 600s), and refreshes only if stale. Refreshes go through new AuthService::refreshIfStale() and AuthServiceDA::refreshIfStale() helpers that:
- read the existing on-disk token,
- merge the refresh response onto it (preserving
refresh_token), - atomically rewrite via
tempfile + rename.
2. Kernel.exception auto-recovery
A new App\EventSubscriber\GoogleTokenFailureSubscriber listens on kernel.exception and matches:
- requests to
/api/google/*or/api/da-google/*, - exceptions whose class lives under
Google\or whose message contains token-shaped vocabulary.
On match, it dispatches an async RefreshGooglePipelineMessage so the messenger worker tries refreshIfStale() in the background. If a valid refresh_token still exists on disk, the next request succeeds.
3. Deduped ops alert
The same subscriber also dispatches an alert email through the existing EmailNotificationMessage async path. Dedup uses cache.app with key google_token_alert.<pipeline> and a 6-hour TTL, so a noisy day produces one email per pipeline, not dozens. The cron command shares the same dedup key on its own failure paths, so HTTP and cron failures coalesce into one alert window. Recipient defaults to [email protected]; override via OPS_NOTIFICATION_EMAIL in app/.env.local.
Behaviour after the fix
| Scenario | Before | After |
|---|---|---|
| Token within 10 min of expiry, cron tick fires | (no proactive refresh) | Refreshes silently, refresh_token preserved |
Token already expired, request hits /api/google/process | 500 with Invalid token format; pipeline halts | 500 once for the in-flight request, async refresh dispatched, next request succeeds |
| Same failure repeats within 6h | Alert email per failure | Single alert email; subsequent failures suppressed |
| Deploy ships | token.json reset to whatever was last written into ~/config/<env>/; token_da.json deleted | Both files persist across deploys via the new Step 3a back-sync in the release script |
Verification
End-to-end verified in production: the proactive cron showed da_ready: refreshed (expires_in=3599s) on the first tick after the new code shipped, the Permit Ready pipeline recovered after a one-time /auth reseed (the persistent refresh_token had been revoked at some point and could not be merged back), and Mailhog confirmed the alert email layout in the local dev environment. Subsequent two requests inside the 6h window suppressed correctly.
Related
Release 3.10 Overview
Google API token self-healing, SPEAR Milestone Whitespace Report, autocomplete search sanitisation, Reference Documents UI restack, async failed-queue hardening, and ordering-portal v3 contacts wiring
Deploy script persists DA tokens across releases
release-prod-safe.ts auto-syncs token_da.json + credentials_da.json across releases via a new Step 3a, stopping the every-deploy DA token wipeout