From flagship catalog to a real library — the rohtext.org import, per-pipeline Sentry observability, and a login path that cleans up after itself

From flagship catalog to a real library — the rohtext.org import, per-pipeline Sentry observability, and a login path that cleans up after itself

rohtext-importpublic-domainobservabilitysentryauthalpha
Fiction

From flagship catalog to a real library — the rohtext.org import, per-pipeline Sentry observability, and a login path that cleans up after itself

Full library shelves with bound classics — a symbol for the jump from 42 flagship stories to a real library with thousands of German-language classics Photo: Inaki del Olmo on Unsplash

The last few weeks were spread wide — push notifications, test coverage, quiet in the Sentry dashboard. This week has one clear center, and it's a big one: OutaStory has stopped being a 42-story flagship catalog and is on its way to becoming a real library. A source of German-language classics that we're bringing onto the platform legally and with the blessing of its operator, with its own mini AI pass per book (categorization, age-rating classification, short description, author biography), a seven-phase build from spec to admin dashboard, and an overnight wave that has now permanently put two thousand books into the search path and the story detail pages. Alongside that — smaller, but no less important — a profile-completion path that went quiet again, triggered by a single bug report from a test user, and a second wave of auth hardening for circuits that, after a long idle period, had only ever looked logged in.

rohtext.org as a source — and what that actually means

Before I get to the pipeline, a quick why. Since the relaunch, OutaStory's flagship catalog has consisted of 42 carefully assembled original stories, 11 chapters each, hand-curated and with all the trimmings (AI cover, audio, German and English versions). That's a starting catalog that shows what a story on OutaStory should look like — but not a catalog you as a reader can browse for "something new for tonight."

rohtext.org — run by Austrian developer Stefan Gündhör — has spent years collecting public-domain German-language classics (texts from the DACH region whose 70-year copyright term has expired) and serves them via a well-formed JSON API. Current state: 9,235 books in the catalog, from Goethe and Schiller through Romanticism all the way to Karl May, Theodor Storm, Else Lasker-Schüler, and many, many lesser-known voices. The raw texts are explicitly cleared for commercial reuse with no attribution requirement.

The nice thing about rohtext: it's the raw-text sibling of a commercial platform called freitext.org. freitext.org adds its own AI-generated covers, descriptions, genre classifications, and author portraits — those are Stefan's own work and copyrighted. We only use rohtext (i.e., just the raw, public-domain texts) and build the companion content ourselves. Stefan gave the project a green light; I'm in touch with him and respect the CloudFront costs on his origin server by not firing off wget 9,235 times in one burst, but running a respectful polling cadence instead.

Seven phases, one pipeline, three AI tasks per book

Long, curving library with thousands of colorful book spines — a symbol for the scale of the 9,235 books cataloged overnight Photo: Susan Q Yin on Unsplash

The actual importer is its own Azure Functions host (OutaStory.RohtextImporter), orchestrating three functions: a daily timer that polls the catalog and pushes new/updated books onto a Service Bus queue; a consumer that, per book, asks the rohtext API for the full text, runs a small AI pass, and materializes the result into the OutaStory database; and a retry timer that, in between, retries failed AI calls with exponential backoff. The pipeline was built in seven phases, each with its own pull request and its own review loop.

Phase 1 — schema + UI plumbing. The Story and Author tables got five new columns for the import trail (ImportSource, ImportSourceId, ImportSourceStoryUrl, ImportSourceJsonUrl, ImportSourceVersion). A two-component Razor block on the story detail pages now shows a clear attribution box: "This story comes from the rohtext.org catalog by Stefan Gündhör. Public domain (DACH region, 70-year copyright term expired)." Edit permission on imported stories is restricted to Developer + Admin — authors shouldn't be able to casually edit public-domain classics.

Phase 2 — Functions-host skeleton + ops plumbing. A dedicated Service Bus queue (rohtextimport), a dedicated RohtextSyncState table for the ETag-based incremental diff against the rohtext API, a container-app resource in the Aspire AppHost. A ServiceDefaults-compliant health probe and the usual OpenTelemetry bundle.

Phase 3 — client + materializer + admin endpoint. A typed HTTP client for the rohtext JSON API (with ETag caching against their origin), an AuthorResolver that maps German author names onto existing author rows (with an alias YAML for "Goethe" vs. "Johann Wolfgang von Goethe" vs. "J.W. v. Goethe") and a StoryMaterializer that turns the diff into EF Core operations. Plus a developer-only admin endpoint POST /api/admin/rohtext/import/{guid} for manual single-book imports, running the whole pipeline against one book.

Phase 4 — AI subsystems. Four gpt-5 calls per book via Azure OpenAI with structured JSON schemas:

  • CategoryClassifier — assigns the book to one of the 302 categories, based on title, year of publication, and a 2 KB sample from the first chapter.
  • MinAgeClassifier — decides between 0, 6, 12, 16, 18 (in the spirit of the German USK rating), based on title, year, and a 4 KB sample plus 1 KB from the middle.
  • SummaryGenerator — produces a 2- to 3-sentence description in the OutaStory voice, based on a 6 KB sample plus 1 KB from the end.
  • AuthorBioGenerator — if the author row doesn't yet have a bio, generates a concise biography from the author's name and a sample of that person's titles.

Azure Content Safety runs over the full text before every AI call — most classics pass without a flag, occasionally an explicit flag comes back, which we follow with a higher MinAge default.

Phase 5 — SeedDumper + multi-YAML loader + version chain. When the pipeline runs locally (e.g., on the developer's machine), every successfully materialized import is written out as a YAML file at SeedData/stories/rohtext/{author-slug}/{story-slug}-v{N}.yaml. On the next boot, the Init.Data worker loads these YAMLs into a freshly set-up database, which preserves the state of the library across Aspire restarts. When rohtext publishes a new version of a book (textVersion: 2), a second YAML file is created and the story version chain does the rest — V1 stays reachable at /stories/{slug}/v/1/, V2 becomes the new default.

Phase 6 — production timer + queue consumer. The daily timer fires at 03:00 UTC and pushes a diff onto the queue — new books and version bumps go in, deleted rohtext entries are deliberately ignored (as a non-goal: no OutaStory page should disappear because of a rohtext maintenance sweep). The consumer works through the queue book by book, with Service Bus retry-and-DLQ semantics on hard failures. Embedding the full-text content in the Service Bus message spares the rohtext origin a second round trip per processing step — across 9,235 books, we'd otherwise have sent 15 GB of CDN egress to Stefan's origin, which wouldn't have been friendly.

Phase 7 — admin dashboard + DLQ UI. On /management/rohtext/hub, a compact overview shows the current state (number of imported books, last successful poll time, number of DLQ messages, AI failure table per call type). A second panel at /management/rohtext/failures lists every DLQ message for inspection — book GUID, operation, JSON payload, error message, and a "Retry Now" button that pushes the message back onto the queue. The same panel lists AI failures (categorization failed, description failed, etc.) with the same retry path.

One night, two thousand books, a few lessons

Thursday evening ran the first real backfill — a simple PowerShell script that walks the rohtext catalog list and calls the manual import endpoint per GUID, plus a /loop watchdog loop that checks status every 20 minutes, verifies AppHost liveness, and restarts the driver in an emergency. The driver ran all night, and the following morning produced 2,256 stories + 586 authors out of the 9,235-book catalog — just under a quarter of the library, at an average processing time of 15 seconds per book (dominated by gpt-5 latency, not the database).

The stories are now committed into the repo as YAMLs — no data-migration script, just seed data that every environment loads on the next boot via the Init.Data worker. The rest of the catalog (roughly 6,900 books) will be pulled in over the next few days via the production Azure Functions importer.

A few lessons from that night:

  • gpt-5 has an availability rate above 90%, but not 100%. Around 200 books hit transient 400/500 responses on the summary or MinAge call. The RohtextAiFailure table caught all of them; the nightly retry timer handles the retry with exponential backoff (1h → 4h → 16h → 24h cap).
  • Slug lengths can explode. A book called "Einfache Erzählung von dem schrecklichen Absturze des Schrofenberges und der dadurch erfolgten Verwüstung bei Brannenburg im August 1851 zum Bessten der Verunglückten" produces a slug that blows past the Windows 260-character path limit. Worked around locally with git config core.longpaths true; capping the slug to ~80 characters in SeedDumper is a follow-up task.
  • The RohtextSyncState column originally had an nvarchar(max) column type, which SQLite (the EF provider used for unit tests) doesn't accept. During a test we got "near 'max': syntax error." Quick fix: dropped the explicit HasColumnType, letting EF Core choose on its own (NVARCHAR(MAX) on SQL Server, TEXT on SQLite).
  • PollSchedule = "" for local development was an anti-pattern. Function discovery at boot time can't resolve an empty CRON pattern and disables the whole function. That cost two hours of driver runtime before I figured it out. The fix was trivial: remove the empty override from appsettings.Development.json.

Visibility, covers, and avatars — the UI-layer fix

Testing Friday morning — you click a rohtext author, see their library cards, click a story, and see "Story not found." Three layers had gaps at the same time:

  1. A visibility switch, turned off. As a precaution, the runtime feature flag RohtextStoriesVisible was set to "off" for the first seed drop — StoryRepository.GetBySlugAsync was filtering rohtext imports out of the catalog before they reached the detail page. The default for fresh tenants is now on, and the live toggle lives at /developer/feature-flags. (Operator workflow: click once, done.)
  2. Search and detail were going through two different repositories, only one of which knew about the visibility filter. Search showed the rohtext book, clicking the search result fell into a 404 — a real asymmetry, fixed with a second repository update.
  3. Covers and author avatars were blank for rohtext imports — the materializer logic sets CoverUrl="", because the specified "title-on-color" compositor was never built, and public-domain classics naturally have no author avatars. A quick fallback layer in the projection layer maps empty values onto deterministic DiceBear SVGs: https://api.dicebear.com/9.x/thumbs/svg?seed=author-{slug} for avatars, …/shapes/svg?seed=story-{slug} for covers. Same SVG for the same slug, different shapes for different slugs — visually consistent, without us having to upload 2,000 covers as blobs.

A proper title-on-color cover compositor (the story title rendered as text on a background color chosen from a hash of the slug) is a follow-up task; the DiceBear variant is the clean immediate fix that works without a data migration.

Sentry observability for the importer pipeline

Code on a dark monitor with line numbers — a symbol for the structured Sentry tagging that now makes every import step visible Photo: Markus Spiske on Unsplash

A dedicated Sentry project collection (outastory-rohtext-importer) got its DSN pushed this week, had its alert rules configured for GitHub issue creation and MS Teams, and got its boot-beacon wiring built into the Functions host. Three small code layers make the importer the best-observed pipeline in the stack:

  • Per-function scope tagsrohtext.function=PollCatalogTimer / ProcessStoryQueueFunction / RetryAiFailuresTimer, set at the start of every invocation. Filtering the Sentry dashboard by pipeline step becomes a single tag query.
  • Per-message scope tagsrohtext.bookId, rohtext.op, rohtext.textVersion, set after successfully deserializing the Service Bus message. Operator triage down to "which book triggered this" becomes a filter query.
  • Breadcrumbs at AI-call seams — the StoryMaterializer drops a breadcrumb before each of the four gpt-5 calls (rohtext.ai-call: Category | MinAge | Summary | AuthorBio), so the Sentry event timeline panel shows which AI call failed — not just "something in the AI pass."

The Sentry project provisioning itself ran this week through an existing Provision-SentryProjects.ps1 tool, which picked up the new importer slug in the project matrix and idempotently sets up the project + the DSN + the alert rules on the next run. Bonus: a new -SkipKeyVault switch in the Set-SentryKeys.ps1 tool for environments where Key Vault isn't deployed yet — the GH secrets land on their own, and the next azd provision run seeds the vault.

Profile completion — a bug that never showed up in Sentry

A hand drawing a thread between app wireframes on a white wall — a symbol for the diagnostics that made an invisible bug measurable again Photo: Alvaro Reyes on Unsplash

In the middle of the week, a bug report came in from Eiko Wachholz, a test user: he could sign in, landed on the profile-completion page (/account/complete), filled it out, clicked "Save," and saw "Your profile couldn't be saved right now. Please try again." Over and over. Sentry was completely silent — no trace, no event, no telemetry.

The cause was almost embarrassing: a bare catch { } block in the Razor page. Every exception was turned into the generic banner without logging; OutaStoryApiClient.CompleteProfileAsync returned null on every non-2xx status without logging anything anywhere. The entire error path was invisible to the Sentry dashboards.

Three layers now have structured logging at every failure path:

  • Razor page (Complete.razor.HandleSubmitAsync): every distinct error shape (exception, null response, IsComplete=false) gets its own structured LogWarning/LogError call with email context and field lengths. Sentry groups the three shapes as separate issues — three distinct causes.
  • API client (OutaStoryApiClient.CompleteProfileAsync): on non-2xx status, status code, a body excerpt (up to 512 characters), and the email are logged; network exceptions get their own log form.
  • API controller (ProfileController.Complete): Sentry scope tags (profile.action=complete, profile.sub=auth0|…), an explicit try/catch around the EF save path that logs DbUpdateException cases with the inner error message + input lengths and then rethrows, plus a catch-all clause with the same logging logic (but excluding OperationCanceledException, because client-disconnect noise doesn't deserve a Sentry event).

Plus 14 new regression tests pinning each of the three layers to the same log format — if someone reintroduces the bare catch in the future, the build fails red instead of silently hiding a bug.

Eiko's actual bug still hasn't been diagnosed (the banner is still a generic "please try again" — we don't know what exactly is failing). But his next attempt after the deploy will produce a Sentry event that shows the root cause — and then the fix is a second, smaller pull request.

Auth hardening — the follow-up wave to the "logged-in-but-401" bug

Last week's PR #621 had built a proactive access-token refresh-and-retry path into WebBearerTokenHandler to fix the annoying "logged in but the API silently 401s after a long idle period" effect. Three separate test sessions this week confirmed: the refresh path works on new circuits, but on existing circuits (i.e., sessions that were already open before the deploy) it was still occasionally showing 401s.

The cause was in the cache-hydration logic. The AccessTokenCache entry gets populated from the HTTP request pipeline at sign-in time. When an existing circuit reopens its SignalR connection after a pod restart, there's no HTTP request — the circuit opens directly via the /_blazor/negotiate endpoint. Along that code path, the AccessTokenCache was simply never (re-)populated; the handler had nothing to read and immediately fell back to the 401 path without even attempting a refresh.

The fix is a small piece of middleware that hooks specifically into /_blazor/negotiate and rehydrates the cache from the cookie before the circuit starts. A few regression tests pin down the behavior — this was also the trigger behind 5 previously never-before-seen WEB-401 bug clusters on Sentry, all of which have now dropped off the list with the current deploy wave.

Additionally: a subtler ObjectDisposedException race in ReadingProgressService — the service fires several background drains at _drainLock.Release() call sites, and if the Blazor circuit calls Dispose() while a drain is in progress, the SemaphoreSlim.Release() call throws an ObjectDisposedException that bubbles up as an UnobservedTaskException on the finalizer thread and lands in Sentry as an unexplained "TaskException" cluster. Fixed with two try / catch (ObjectDisposedException) blocks in the respective finally clauses — releasing the lock is a no-op anyway once the circuit is already gone.

Smaller refinements shaping the picture this week

  • Ad-free on every operational surface (PRs #646, #648, #651): the ad layer is now switched off on /coming-soon, /maintenance, /Error, /admin, /management, /moderation, and /developer — it made no sense there, visually or performance-wise. Plus: Monetag (the overlay ad network) is set to "off" by default, because it doesn't fit the impression we're building in these early test weeks.
  • MAUI sign-in hang fixed (PR #641): a cross-platform sign-in hang in the MAUI variant that looked different on Windows/macOS/iOS but shared a common root cause (a path synchronously waiting on a Task.Result). Plus extensive Sentry diagnostic tagging along the auth path.
  • NotificationProcessor + SignalR REST permission (PR #640): the NotificationProcessor function couldn't publish its in-app live pushes over the shared SignalR service, because it didn't have the correct service role. An az role assignment create step in the release pipeline resolves this.
  • Trusted Signing — finally green (PR #634): after a three-week-old open Microsoft ticket, the cause turned out to be a wrong role GUID in one of our Bicep templates plus a dlib version regression in the Trusted Signing action package. Both fixed, the Windows MSIX installer out of CI signs correctly again. Plus a tolerate-403 fallback path (PR #628) that publishes an unsigned bundle if signing ever trips up again — not a perfect solution, but better than a blocked release.
  • Init.Data Sentry phase observability (PR #632): each of the ~15 seed phases (Categories, Authors, Stories, Subscriptions, Promotions, …) now has a Sentry scope tag and a phase heartbeat that warns on unusually long run times. Practically useful for the rohtext seed wave: the phase materializing rohtext stories is visible as its own row in the Init.Data dashboard.
  • CI workflow cleanup (PR #664): the four MAUI build workflows we no longer need since the platform pivot have been deleted. The central ci.yml workflow is now the only build and test path — faster, more predictable, ~$115/month saved in macOS runner minutes.
  • NuGet and Aspire updates (PRs #662, #663): the usual weekly routine. The Aspire update brings a few EF Core 10.0.x hotfixes along with it; the NuGet updates are all minor versions with no breaking changes.

Numbers as they stand

  • Pull requests: 52 across the 6 days since the last blog post (Sat–Fri).
  • Main themes: rohtext import (15+ PRs through the seven phases plus follow-up fixes plus Sentry provisioning plus visibility-flag tweaks), auth hardening (5 PRs), ad-free surface selection (3 PRs), Sentry quiet (3 PRs), profile completion (1 PR with 14 new tests).
  • Code status: around 3,420 passing unit tests (up from ~3,380 a week ago). Coverage gate still in report mode with 35 of 36 projects green.
  • Catalog status: 2,256 stories + 586 authors committed as seed YAMLs (in the repo), the production importer standing by for the remaining ~6,900 books in the catalog.
  • Migrations: two new production migrations (AddRohtextImportColumns for the 5 new story columns + 2 new author columns, AddRohtextSyncState for the singleton state table + AddRohtextAiFailure for the failure table).

What's next

  • Pull in the rest of the rohtext library. The production Azure Functions importer is deployed, but with the master kill switch set to "off" — we want to trigger the backfill under operator control once the seed curation is complete.
  • Title-on-color cover compositor. A small SVG endpoint that renders the story title on a background color hashed from the slug. Makes the DiceBear shapes replaceable with a visual style that fits OutaStory.
  • Track down Eiko's bug root cause once the next Sentry event lands, and fix it with a second, small PR.
  • App.Shared Razor-page tests — the last red coverage entry, still spread over several weekly waves.
  • Meta app review for Instagram — the instagram_content_publish permission is still open.

If you come across a rohtext story with a noticeable cover glitch while browsing — wrong category, an ill-fitting MinAge rating, an awkward short description — let me know. The AI classifiers are good, but not perfect, and a few concrete examples per weekly triage wave would sharpen the prompts considerably.


Comments (0)

No comments yet.


Leave a comment

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please reload the page.