AtlasDesk
Production ITSM platform — single-tenant on-prem and multi-tenant SaaS, one codebase.
AtlasDesk is an IT service-management platform — tickets, assets, reservations, notifications, and an AI resolution copilot — running today for an enterprise customer with 4,000+ users on a single-tenant on-premises deployment, and shipping next as a multi-tenant SaaS at atlasdesk.app.
I built it solo. The interesting part isn’t the feature list. It’s that one codebase has to satisfy two very different deployments — an on-prem SQL Server 2022 box behind a corporate firewall with SAML/Entra auth, and a multi-tenant cloud stack on PostgreSQL with password + OTP. Every architectural decision had to survive both worlds.
This page walks through the system. The diagrams animate on their own — that’s a stylistic choice, not a gimmick. I find static architecture diagrams lie about systems being static when they’re actually full of in-flight work.
The whole picture
Seven services, three lanes. Edge is a Next.js 15 frontend. The services layer is six NestJS microservices, each with its own Postgres schema. The data layer forks: SaaS deployments use PostgreSQL, the on-prem deployment uses SQL Server 2022. Same Docker image runs both — DB_PROVIDER is a build arg.
A few things in that diagram are worth calling out, because they’re load-bearing decisions I learned the hard way.
Permissions are admin-editable at runtime.
No @Roles('operator') decorators. Authorization flows through a no-code matrix that lives in the database. When the customer wants to give a new role permission to merge tickets, an admin clicks a checkbox — no redeploy, no PR. Every permission check fans into permission.service.hasPermission(...). It took longer to build than hardcoded decorators, but the deployment savings are huge.
asset-svc is the only writer of Asset.status.
Reservation-svc never touches the asset table directly — it makes an HTTP call to asset-svc, which serializes status transitions with SELECT … FOR UPDATE. Before this rule existed, a reservation flow and a manual-assignment flow could both flip an asset to reserved simultaneously. Two reservations for the same laptop, one annoyed user.
notification-svc is the only thing that touches email.
SMTP, IMAP, Microsoft Graph polling, SES outbound — all of it. Ticket-svc never reads a mailbox; it receives parsed payloads. One transport, one audit trail, one retry queue.
The pgvector branch is gated twice.
PostgreSQL-only features (the AI copilot’s vector store) need both a schema-level guard (the migration runner refuses to run on SQL Server) and a runtime feature flag (COPILOT_ENABLED). Schema-only guarding is a trap — the runtime hooks still fire on the on-prem deploy and try to query a table that doesn’t exist, spamming error logs forever and waking up on-call. So: defaults off, on-prem never sets it, SaaS sets it to true.
Identity service · nightly Entra sync
The on-prem customer authenticates through Microsoft Entra ID via SAML. JIT provisioning handles first-time logins, but I also need a nightly reconciliation pass — for departures, role changes, group membership updates that happen out-of-band.
A cron fires at 02:00. identity-svc calls Microsoft Graph with a delta query — the token from the previous run, so we only get changes. The service diffs the result against the local User table and fans out into three buckets: rows to add, rows to update, rows to deactivate. Every change writes to an audit log.
Edge Case: Shadow Users
If an inbound email arrives from someone who isn’t in Entra (a vendor, a parent, a contractor), the inbound pipeline creates a shadow User with syncSource = EXTERNAL_EMAIL_SHADOW. The Entra sync must not overwrite shadow users — they have no Entra identity to reconcile against. So the upsert is conditional on syncSource = ENTRA. Admins can later link a shadow user to an Entra user manually via UserAlternateEmail, retiring the shadow.
Ticket service · the lifecycle
A ticket starts as either an email or a web-form submit. If it’s email, notification-svc parses it and hands a structured payload to ticket-svc. From there it’s six stages: submit, inbound, create, assign, resolve, notify. The “owning service” row underneath shows which microservice handles each stage — and you can see that ticket-svc owns the middle four, with notification-svc bracketing the boundary on either side.
The branch you can see going up to PGVECTOR is the AI copilot’s embed pipeline. After every CREATE (and after most public tech comments), the ticket text gets re-embedded into pgvector. The arrow is highlighted because it’s the most architecturally interesting part of the lifecycle: it’s fire-and-forget, never blocks the write path, and is gated such that it’s a complete no-op on the on-prem (SQL Server) deploy. More on that below.
Escalation is role-based, not user-based. When a ticket needs to escalate from a tier-1 tech to a system admin, it doesn’t pick a specific person — it targets a role. Anyone in that role can pick it up. De-escalation, on the other hand, returns to the specific originating tech, not the general pool, so the original owner sees their work return. This required threading assignedToUserId and escalationType through every transition, and is one of those product details that didn’t fit neatly into either “stateful” or “stateless” — it’s both at different times.
Asset service · the single-writer rule
Reservations are the highest-concurrency operation in the system. A class of forty students all clicking “reserve a MacBook” at the same moment is a thing that actually happens.
The flow goes through reservation-svc — request, review, approve. The interesting boundary is the third arrow: HTTP TRANSITION. Reservation-svc never writes to the asset table directly. It makes an HTTP call to asset-svc with the asset id and the desired transition. Asset-svc takes a row-level lock with SELECT … FOR UPDATE, validates the transition is legal (no double-booking), and only then flips the status. If the lock contends, the second caller waits, then sees the asset is no longer available, and gets a clean error.
This pattern has a cost: every reservation flow is now an extra network hop and a serialization point. But it removes an entire class of race conditions, and we don’t have to convince every future contributor not to write to Asset.status from a new place. The schema can’t be the source of truth for “who’s allowed to write here” — only the service boundary can.
Notification service · single email transport
”One transport, one audit trail, one retry queue.”
Because notification-svc isn’t a flow, it’s a guarantee: nothing else in the codebase opens a mail connection. Ticket-svc doesn’t poll email. Asset-svc doesn’t send confirmations directly. They publish events; notification-svc is the listener.
This single rule has saved more debugging time than any other architectural decision. When an email doesn’t go out, there’s exactly one place to look. When the customer’s Graph credentials rotate, there’s exactly one place to update them. When a retry queue is needed for outbound mail, it has one owner.
Resolution Copilot · the RAG pipeline
This is the part I most enjoyed building, and it’s also the part I’d most expect to be questioned about in an interview, so here’s the full picture.
When a tech opens a ticket, the system has already drafted a reply for them — citing past tickets and KB articles that solved similar issues. The retrieval is over the live ticket corpus, not a pre-built index, and the corpus stays current without a nightly batch job.
Embedding Model
Xenova/bge-small-en-v1.5
Running locally via transformers.js (ONNX, CPU). 384 dimensions. The first call is ~1–3s while the model loads; subsequent calls are sub-50ms. No external API for embeddings — important both for cost and for avoiding sending ticket content to a third party.
Vector Store
pgvector inside the ticket-svc Postgres schema. Two tables, TicketEmbedding and KbArticleEmbedding, both with ivfflat indexes. Cosine distance via <=>.
Retrieval
Top-5 tickets and top-3 KB articles fused via Reciprocal Rank Fusion (k=60), trimmed to top 3.
Synthesis
Gemini 2.5 Flash. The system prompt tells it to act as a senior tech, draft a concise reply using only context, and cite sources. Importantly: only titles and handles go to Gemini — never descriptions, bodies, or PII.
Caching
In-memory, keyed by ticketId + mutationKey, where mutationKey is a hash of createdAt | latestCommentAt | status | md5(title + description). This means out-of-band edits to a ticket bust the cache too, even when our own re-embed didn’t fire. TTL is 1 hour, max 500 entries.
The part I’m proudest of is the self-update mechanism. Three triggers re-embed a ticket the moment its content meaningfully changes:
Ticket creation
Embedded immediately after the row commits.
Edited title or description
Re-embedded only if those specific fields changed. Status changes, assignment changes, and other edits don’t move retrieval quality, so we don’t pay for them.
Public tech comment
Re-embedded so resolutions land in the corpus. Internal notes are filtered out (they’re discussion, not knowledge), and so are the requester’s own replies (we want techs’ fixes, not users’ confusion).
Re-embeds are fire-and-forget. They never block the write path. They no-op completely on the on-prem (SQL Server) deploy because the feature flag is unset and the service constructor short-circuits. Whenever the embedding model changes, a backfill script walks every ticket whose stored modelVersion lags the current one and re-embeds in batches.
Engineering rules I committed to
A few discipline rules that don’t fit neatly into the per-service sections, but show up everywhere:
01. No Prisma migrations with data
Every schema change is a hand-rolled SQL file run through a TypeScript migration runner that asserts row counts before and after. I lost a database to prisma migrate dev exactly once. That was enough.
02. No SQL Server JSON / Prisma enums
SQL Server doesn’t support either cleanly. JSON is NVARCHAR(MAX) with manual parse/stringify. Enums are strings. The PostgreSQL deploy gets nothing extra; on-prem gets nothing missing.
03. Tests derive from requirements
Each service has a requirements/<svc>.md file in Given/When/Then format. Integration tests cite a section. If the code disagrees with the requirement, the code is wrong.
04. One repo, two branches
main for SaaS, wooster-main for on-prem. Features merge to main first, then cherry-pick to the on-prem branch. The on-prem branch only contains environment-specific config deltas.
What’s running where
The on-prem deploy is single-tenant, behind the customer’s firewall, talking to their existing Active Directory. SAML, no passwords. SQL Server 2022. 4,000+ users in production, ~14 active techs, ~7000 asset inventory.
The SaaS demo tier is in flight. Multi-tenant Postgres on Railway, Next.js frontend on Vercel, password + OTP auth, Microsoft Entra still as an SSO option for orgs that want it.
The codebase is the same.
The deployment is different.
That was the whole point.