Observability in Every Project, Live Push in the Browser, and a Lesson About Firebase Configs
Photo: Unsplash
Three days after the closed-alpha invitation comes what you might call "the second layer": not new features, but the instrumentation and reliability you need to see what happens when real people are looking at the app — and to fix it before it becomes a problem. The whole thing was a mix of fifteen new Sentry projects, a completely new real-time layer, two rounds of "why does page X come back as 500," and an honest self-correction about what actually counts as a secret and what doesn't.
Sentry in every project, Clarity + GA4 + Firebase on top
Until now we had no error-tracking layer outside the Aspire OpenTelemetry exports. That was enough as long as the app ran on my machine; once real reader sessions hit Azure Container Apps, it isn't enough anymore. Sentry is now built into every service — Api, Web, Web-Browser (a separate project for the JavaScript bundle), the four AI pipelines (Notification / Image / Audio / Content), the four webhook hosts (Stripe / Apple / Google), the four initialization workers (Secrets / Account / Data / Products), and the MAUI app. Fifteen logical services, fifteen separate Sentry projects.
Each project gets its own DSN, stored in the outastory-secrets Key Vault and injected into the services at runtime. Development/test/production environments are distinguished via the Sentry:Environment tag rather than separate projects — that keeps release-health views and alert rules compact.
Product analytics sit alongside it:
- Microsoft Clarity on the web app for session replay and heatmaps. Configured privacy-first (text masked, media blocked); only kicks in if the reader explicitly consents to the analytics cookie via the CookieYes banner.
- Google Analytics 4 (
G-BDZS2837MZ) with Consent Mode v2 — all four parameters (ad_storage,ad_user_data,ad_personalization,analytics_storage) default todeniedbefore consent. Clicking "Accept" in the CookieYes banner flips them togranted; "Reject" or "Withdraw" flips them back todenied. That's the EEA-compliant path since March 2024. - Firebase Analytics on the mobile apps for Android and iOS, with the same opt-in pattern. On iOS, Apple's ATT dialog is shown on the first analytics consent — whoever chooses "Ask App Not to Track" stays with us for Sentry crashes only (legitimate interest), but Firebase stays off.
- Measurement Protocol for server-side events: when the Stripe webhook confirms a subscription purchase, the backend can explicitly send that event to the Firebase Analytics property, attributed to the mobile user. That's a separate secret; the
api_keyvalues in the Firebase config files are not secrets (more on that below).
Live notifications via Azure SignalR
The notifications table got its own poller a couple of weeks back: every 45 seconds, the web app fetches the new rows and shows them as a toast. That's correct, but it feels sluggish. Real live notifications need a push channel — and that's exactly the moment where Azure SignalR Service becomes worthwhile.
Two things solved at once:
- Blazor Server circuits get offloaded onto Azure SignalR in publish mode. Without this step, every signed-in reader hangs off a sticky WebSocket on exactly one web Container App replica — horizontal scaling turns into roulette. With Azure SignalR as a backplane, the web host becomes stateless for circuits; we can push replicas in and out without an open reader losing their scroll position.
- A dedicated
NotificationsHub([Authorize], keyed per-user group on the Auth0 sub) hangs alongside the Blazor circuits on the same Azure SignalR service. When theNotificationProcessor(the Azure Function that turns Service Bus events into notifications) writes a new row, it additionally dispatches a live event directly to the groupuser:{sub}. Every open tab of the reader picks up the event via the browser SignalR client and immediately shows a toast — without waiting for the 45-second poller.
The 45s poller was deliberately not removed. It stays as a fallback: if Azure SignalR has an outage, if the reader is on a restrictive network, if the browser client for whatever reason can't hold the connection — the poller still gets the notification through. When the hub is active, the poller automatically falls back to a 5-minute interval to conserve quota; as soon as the hub goes quiet, it goes back to 45 seconds.
Locally (Aspire Testing, dev boxes without an Azure SignalR account), the whole thing falls back to the native in-process WebSocket transport via a builder.Configuration.GetConnectionString(...) check. The NotificationProcessor gets a NullLiveNotificationDispatcher, whose DispatchAsync silently no-ops. No code path requires you to have an Azure SignalR account just to run dotnet run --project AppHost.
Three regressions Playwright found — and what they taught us
The test suite started failing on 27 of 288 Playwright tests right after the first Sentry rollout. Three unrelated bugs, all with the same trap: looked at individually, they look like "flaky CI tests"; seen in sequence, they're real production bugs.
Photo: Unsplash
/ and /subscription returned 500
An anonymous visit to the home page, or a visit to the subscription page as a signed-in user, both returned HTTP 500. Home-page tests in the same test suite passed at the same time — so this wasn't a global breakage, but specific code paths.
Root cause: OutaStoryApiClient had five different GetFromJsonAsync calls (stories, authors, last-chance, categories, subscription-plans) that let HttpRequestException fly uncaught out of OnInitializedAsync during the SSR prerender. One API hiccup → exception → Blazor's SSR pipeline returns 500. Fix: try/catch around every individual call, returning an empty collection on failure. Matches the defensive pattern GetAudioStoriesAsync / GetSubscriptionStatusAsync already had.
First round: five methods. Second round (once the next tests hit the same class from other routes): nine more methods (GetHelpTopicsAsync, GetAuthorByUserIdAsync, GetStoriesByAuthorAsync, GetStoriesAsync, GetNotificationsAsync, GetPromotionsAsync, GetCommentsAsync, SearchStoriesAsync, SearchAsync). What I learned from this: start "defensive" once, and then apply it to the entire interface, instead of "fix each case individually as it shows up." The 5+9 split wasn't a plan, it was two reactions.
The German locale "slipped" across requests
LocalizationTests.CultureCookie_German_* failed with html[lang]="en", even though the .AspNetCore.Culture cookie clearly said de-DE. At first it looked like a client-side bug: culture.js was overwriting the cookie based on navigator.language — and headless Chromium reports "en-US" regardless of what the server says. Fix: the read order in getCulture() was changed to localStorage → .AspNetCore.Culture cookie → navigator.language; in the cookie branch, the cookie is not written back (the server is authoritative there).
But that didn't explain everything. A second round of these tests failed sporadically, even on fresh culture cookies. The cause: CultureService.ApplyCulture had been setting CultureInfo.DefaultThreadCurrentCulture and DefaultThreadCurrentUICulture. Those are process-wide statics. A German circuit wrote de-DE into the thread-pool default; the next request (with an English cookie) got served on a recycled thread whose default was still de-DE — and Home.razor computed its code paths based on that default. Result: the home-page code jumped into an untested branch that threw a NullReferenceException, which came out as a 500.
The fix was a small diff — only set CurrentCulture / CurrentUICulture (async-local, per logical thread), never DefaultThreadCurrent*. But the diagnosis was more expensive than the fix: the bug disguised itself as "Home is sometimes 500" and only reproduced consistently once we combined two Playwright test classes.
ProfileCompletion validation let empty fields through
ProfileCompletionTests.Validation_EmptyFirstName_* went through the form with an empty first name and was afterward redirected to /home, without client-side validation kicking in. In the log: "waiting for [data-testid='account-complete-error-firstname'] → timeout, navigated to https://localhost/home."
Looks like a validation bug, but it was actually a test-infrastructure bug. The test helper ResetCompletionStateAsync fired a fetch('/api/profile/_e2e-reset') call from the browser — and hit the web host, not the API host. Because the web host has no reverse proxy for /api/*, the call silently 404'd. The profile stayed in "complete" state from a previous happy-path test, the form got pre-filled with "E2E" / "Tester," the "skip firstname" branch in the test never actually cleared the pre-filled value, the validation ran correctly against a filled-in value, and the form was correctly submitted.
Fix: ResetCompletionStateAsync now calls Fixture.ApiBaseUrl + /api/profile/_e2e-reset with the X-E2e-Test-User header directly via HttpClient, like every other signed-in Playwright test does. Plus four new bunit tests that check the same "server pre-filled the values" case directly at the component level and fail in 50ms instead of after a 15-second timeout. The kind of regression you go through once and never go through again, because the tests are fast enough.
The lesson about Firebase configs
Alongside the observability work, there was an absurd little zigzag path in the MAUI release pipeline. First version: commit google-services.json and GoogleService-Info.plist with REPLACE_ME sentinels into the repo, have local scripts overwrite them, have CI decode a base64 version from GitHub Secrets. The firm belief behind this: the real Firebase configs are sensitive and must not live in the repo.
That worked, until the first CI runs on Android and iOS promptly failed with "Bundle Resource not found" and "Failed to Read GoogleServicesJson," because my BeforeBuild targets were copying the template files into place too late — Xamarin's ProcessGoogleServicesJson had already hooked into the evaluate phase. The workaround with --skip-worktree and base64 CI decoding was... functional, but inelegant.
The moment I researched the matter once more, it flipped my original belief. Google itself says (https://firebase.google.com/docs/projects/api-keys):
"Firebase API keys are different from typical API keys. Unlike how API keys are typically used, API keys for Firebase services are not used to control access to backend resources; [...] API keys for Firebase services are ok to include in code or checked-in config files."
Access control on Firebase mobile services runs through the Android package name + SHA signing fingerprint and through the iOS bundle identifier — not through the API key. Your google-services.json already ships inside every shipped APK anyway. It's not a secret; it's a project identifier. Protecting it was a reflex, not a thought-through threat model.
So: template renaming reverted, decode steps removed from release.yml, --skip-worktree logic thrown out, the SEED_FIREBASE_*_B64 GitHub secrets sit there as orphans (deletable), and the Firebase sync script shrank from ~450 to ~250 lines. All committed exactly the way Google recommends.
What actually is secret — the Measurement Protocol API secret for backend-to-GA4 events — still lives separately in Key Vault and in GitHub Secrets. The difference isn't in file size, it's in the endpoint's access control.
PowerShell provisioning, so it still holds up in two years
Everything above lives on top of a stack of secrets — fifteen Sentry DSNs, three Firebase values, two GA4 values, plus the already-existing Stripe/OneSignal/IAP/Auth0 credentials. Locally, those are "AppHost user secrets" (dotnet user-secrets set); in production they live in the outastory-secrets Key Vault, injected through Initialization.Secrets.
The pattern already existed, but the new integrations extended it:
tools/sentry/Provision-SentryProjects.ps1— creates fifteen Sentry projects, fetches DSNs, pushes them to AppHost user secrets and GitHub Secrets, and provisions alert rules (teams for high-priority issues, three GitHub-ticket rules per project — one per environment, with a[Development|Test|Production]prefix in the ticket title). Idempotent, state-aware, with a-ReplaceAlertRulesswitch for re-provisioning.tools/sentry/Provision-SentryUptimeMonitors.ps1— two uptime monitors (Api + Web) with Teams notification, parameterizable URLs (currently pointed at the Azure Container App FQDNs; switchable toapi.outastory.com/www.outastory.comafter the DNS cutover).tools/google-analytics/Sync-GoogleAnalyticsSecrets.ps1— pushes the measurement ID and the measurement-protocol API secret, with config state inconfig/state.json(gitignored).tools/firebase/Sync-FirebaseSecrets.ps1— extracts the Android and iOS app IDs from the committed config files, pushes them plus the optional measurement-protocol secret. The config files themselves are no longer copied (see above).
release.yml got a thorough cleanup, to carry the same env vars as the local flow: 44 Secrets__* lines per azd provision and per azd deploy (diff-identical between the two, to avoid drift), organized with comment sections per provider. Previously, OneSignal / IAP / SendGrid values only ended up in GitHub Secrets and not in Azure; now they flow all the way into the production Key Vault.
One real find in the process: the MS Teams integration in our Sentry org had two installs — one named "OutaStory Team" (the intended target) and one named "KN-CMT Team" (a separate install, a relic from other work). Get-Integration filtered by provider key and then blindly picked [0] — which happened to be KN-CMT. For two days, all the "Teams" alerts went into a stranger's channel. Sentry's browser UI didn't show any warning, the rules looked configured, the only thing missing was the Teams bot in the target channel. Fix: added -NameFilter to Get-Integration, exact matching against $MsTeamsTeamName, and a loud abort with a listing of the available names if no match is found. Much rather "script aborts, operator has to choose" than "script picks some install and stays quiet."
Android permissions, iOS ATT, and the privacy-label marathon
A side quest, but accumulated fine-tuning for the store submission:
AndroidManifest.xml: addedcom.google.android.gms.permission.AD_ID(Firebase Analytics needs it on Android 13+). With a comment on how to strip it back out viatools:node="remove"if we ever want to stay IDFA-free.iOS/Info.plist: addedNSUserTrackingUsageDescriptionwith an explanatory text. Apple rejects binaries that linkATTrackingManagerwithout the usage description present (ITMS-91051 / 91053). OurFirebaseAnalyticsService.GrantAsynccalls ATT; without the key, the next TestFlight submission would bounce right back.- App Store Connect Privacy Labels and Google Play Data Safety — both sides now have complete documentation in
docs/OBSERVABILITY.mdon which data types have to be declared under which category. "Not shared, not linked to user, analytics purpose" for Firebase + Sentry; "App Functionality" for Sentry crashes. Tracking: no — we don't track any reader across other apps or websites.
That was the part of the work that produces nothing spectacular, but if left undone, means the next release build gets stopped at the store's front door.
Numbers, for the record
Unit tests: 1,872 green, 0 red (before the observability work: about 1,660; 90 new tests came in — 53 options tests, 24 secret-seed tests, 13 bunit tests for the App.razor analytics bootstrap, plus 7 dispatcher and 7 backoff tests each). Playwright suite: 261 passed / 27 failed on the first run, and after two rounds of fix-and-rerun, 278 / 13, and finally, after the profile-completion infrastructure fix, expected green (I haven't run the last round yet).
Commits this work week: ~30 between the SignalR rollout, observability foundation, backend wave, web wave, MAUI wave, test wave, the three regression fixes, the Firebase simplification, and the Teams-integration drama. Most of the effort concentrated in the two "oh no"s and the one "aha," not in the ten "done"s.
What's next
A push is due now — the 32 local commits want to go to origin/main, so the release pipeline runs with the new env-var stack and the new Sentry DSNs. After that comes what was already sketched out in the alpha invitation: the alpha testers (web, Android, iOS soon) provide the next priority signal. Until then, the part that answers "what happens if something goes wrong?" is done.
