5 ADRs that show how we think. Every significant decision - documented before implementation, with trade-offs, alternatives rejected, and consequences accepted.
Selected from 48 ADRs in the codebaseArchitecture Decision Records are the backbone of how this platform was built. When AI is your primary development tool, tribal knowledge doesn't work - you can't say "ask Steve, he built that part." Every decision needs to be written down: the context, the constraints, the options considered, what we chose, and why.
48 ADRs in 6 months. That's more architectural documentation than most Series B companies have. These aren't templates filled in for compliance - they're working documents that the AI reads before touching the codebase. When a new session starts, the first thing that happens is reading the relevant ADRs. That's how context survives across sessions.
What you're reading below are 5 curated selections - chosen to show range: product architecture (0006), data modeling (0009), security design (0011), methodology (0014), and a production incident turned into principled design (0046). The other 43 cover everything from HIPAA compliance to UUID migration to AI kill switches.
This is the core product architecture. Orbiit delivers 8 daily touchpoints to patients via SMS - meditations, affirmations, reflections (text-only), and micro-courses (magic link to mobile web page with quiz). The question was: how do you deliver personalized content to someone without making them log in?
Patients in early recovery have high cognitive load and low tolerance for friction. Passwords are a barrier. App downloads are a barrier. Even email links are a barrier - SMS has a 98% open rate vs 20% for email. We needed a way to deliver one course to one user, collect quiz responses, and attribute the interaction - without standing up full login flows.
Scope-bound, short-TTL magic links. Each token binds to (user_id, course_id, run_id) with a 14-day TTL. No PII in the URL. The token can only read one course and submit one quiz response. Just-in-time generation reduces exposure by 96.7% compared to pre-generated tokens.
| Time | Type | Delivery |
|---|---|---|
| 7:00 AM | Meditation | Text-only SMS |
| 9:00 AM | Course 1 | Magic link → mobile web + quiz |
| 11:30 AM | Affirmation | Text-only SMS |
| 1:00 PM | Course 2 | Magic link → mobile web + quiz |
| 3:30 PM | Affirmation | Text-only SMS |
| 5:00 PM | Course 3 | Magic link → mobile web + quiz |
| 7:30 PM | Affirmation | Text-only SMS |
| 9:00 PM | Reflection | Text-only SMS |
| Type | ID Pattern | Delivery | Example |
|---|---|---|---|
| Course | C##### | Magic link → mobile web page + quiz | C00101 |
| Meditation | M### | Text-only SMS body | M001 |
| Affirmation | A### | Text-only SMS body | A001 |
| Reflection | R### | Text-only SMS body | R001 |
Text-only items never require a web page. Courses always render a page + quiz.
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/courses/magic/{token} | GET | Read course via magic token. Marks assignment opened (idempotent). |
/api/v1/submissions/token | POST | Submit quiz answers. Resolves (user, course) by token. Creates Submission once per (user, quiz, run_id). Returns 201 or 200 already_completed. |
/api/v1/sms/status | POST | Twilio status callback. Updates OutboundMessage/DeliveryPlan. |
/api/v1/sms/inbound | POST | Twilio inbound. Handles STOP/START/UNSUBSCRIBE/CANCEL/END/QUIT. |
send_at, create OutboundMessage, attach status callback URLsopted_outopted_out=true, cancels all future DeliveryPlan rows, and raises an AlertEvent for guide/admin follow-uprun_id in UserCourseAssignment and new token. Prior submissions remain immutablesecrets.token_urlsafe(32) - exceeds NIST SP 800-63B(user, quiz, run_id) can't create duplicate recordsThis is the product's DNA. The magic link architecture is what makes Orbiit work for a population that won't download apps, won't remember passwords, and won't tolerate friction. Every design choice - TTLs, scope binding, idempotent submission, JIT generation - flows from one principle: meet the patient where they are.
The platform serves multiple user types with very different access needs: patients viewing their own recovery data, clinicians managing caseloads, administrators overseeing clinics or entire organizations, billers processing claims, and platform superadmins. The question: how do you model this without a God Object User model?
"Clinician and Admin are separate profiles, not roles." A StaffMember can have a ClinicianProfile (clinical work), an AdministratorProfile (management), a BillingProfile (financial), or any combination. In small treatment centers, the same person often wears multiple hats - this architecture handles that without role flags or permission spaghetti.
Every query is scoped by organization. The hierarchy is Organization → Region → Clinic → Patient. Administrator scope is explicit:
| Action | Patient | Clinician | Region Admin | Org Admin | Super Admin |
|---|---|---|---|---|---|
| View own data | Yes | Yes | Yes | Yes | Yes |
| View assigned patients | - | Yes | Yes | Yes | Yes |
| View all org patients | - | - | - | Yes | Yes |
| Create staff | - | - | Region | Org-wide | Yes* |
| Impersonate users | - | - | - | - | Yes* |
* Requires specific toggle enabled (IAM-style granular permissions)
if staff.is_clinician: reads better than if staff.role in ['clinician', 'supervisor', 'clinical_director']:| Type | Detail |
|---|---|
| Positive | Clean separation of concerns. Flexible multi-hat capability (common in small orgs). Easy to add new profile types. Granular Super Admin permissions reduce platform risk. |
| Negative | More models than simple Staff approach. Slightly more complex queries for cross-profile operations. Multiple table joins for staff with multiple profiles. |
| Mitigations | select_related() and prefetch_related() for query performance. Admin interface makes adding/removing profiles intuitive. |
Data modeling decisions made at the foundation define everything built on top. This ADR shows deliberate domain modeling for healthcare - not forcing clinical workflows into a generic SaaS user/role pattern. The profile-based approach has scaled cleanly from 1 organization to the current multi-tenant deployment.
Two user populations with completely different needs. Patients in early recovery - high stress, low tolerance for friction, privacy concerns, shared devices. Staff at treatment organizations - enterprise security requirements, centralized access control, compliance obligations. One auth system can't serve both. So we didn't try.
No passwords. No app downloads. No accounts to create. The patient's daily touchpoints arrive via SMS with embedded magic links. Click the link, view the content. The link is the authentication.
| Property | Value |
|---|---|
| Token entropy | 256-bit (secrets.token_urlsafe(32)) |
| Expiration | 3 days (org-configurable: 24-168 hours) |
| Reusability | Reusable within expiration window |
| Session timeout | 30 minutes inactivity |
| Rate limiting | 5 attempts per identifier per 15 minutes |
Why reusable? Rehabilitative, not punitive. A patient who clicks a link twice shouldn't get an error. They're studying, reviewing, or retrying from a different device. The population we serve needs friction removed, not added.
Microsoft Entra ID (primary), Google Workspace (optional). No password option exists for staff - all authentication delegated to the corporate identity provider. MFA enforced by the IdP, not by us. Immediate revocation on termination.
For smaller organizations without IT resources, staff can also use email magic links with tighter security: 15-minute expiration, single-use only. Superadmins are excluded from this fallback - they must use SSO.
Added in December 2025 for organizations without IT resources for SSO setup. Deliberately tighter security than patient links:
| Property | Patient Magic Links | Staff Magic Links |
|---|---|---|
| Expiration | 3 days (org-configurable) | 15 minutes (fixed) |
| Reusability | Reusable within window | Single-use only |
| Token entropy | 256-bit | 256-bit |
| Delivery | SMS (primary), email (fallback) | Email only |
| Superadmins | N/A | Excluded - must use SSO |
| Threat | Impact | Likelihood | Mitigation |
|---|---|---|---|
| SMS not delivered | High | Low | Email fallback, 3-day validity |
| Phone stolen | Medium | Low | Remote deactivation, session timeout |
| SSO provider outage | High | Very Low | Provider SLA (99.9%+), staff magic link fallback |
| Token enumeration | Medium | Low | Rate limiting (5/15min), audit logging, 256-bit entropy |
| Session hijacking | High | Very Low | HTTPS only, secure cookies, 30-min timeout |
| Email interception (staff) | Medium | Low | 15-minute expiration, single-use |
| Option | Rejected Because |
|---|---|
| Password + MFA for all | High cognitive burden for patients in recovery. Password resets = top support burden. Patients will choose weak passwords. |
| Password only (no MFA) | Fails HIPAA strong authentication requirements (§164.312(d)). Vulnerable to credential stuffing, password reuse across sites. |
| Biometric (Face ID / fingerprint) | Not universally available on older phones. Can't revoke biometric data. Privacy concerns with biometric storage. Complexity exceeds benefit for MVP. |
| Email magic links only (no SMS) | 20% open rate vs 98% for SMS. Doesn't align with daily touchpoint flow. Patients less likely to check email daily. |
| SAML for staff SSO | Deferred. OAuth2 covers 70%+ of market (Microsoft/Google). SAML more complex to implement. Enterprise customers can wait. |
| Metric | Target |
|---|---|
| Magic link click rate | > 95% |
| Session timeout complaints | < 1% |
| "Can't login" support tickets | < 2% |
| Average time to login | < 5 seconds (patients), < 10 seconds (staff) |
| Password reset tickets | 0 (no passwords exist) |
| Rate limit triggers per day | < 10 (indicates attack if higher) |
Post-production security audit (Oct 2025): The audit flagged magic link token entropy and cross-organization access as potential vulnerabilities. Investigation confirmed both were intentional design: 256-bit entropy exceeds NIST standards (audit had incorrectly calculated 188-bit), and cross-org access enables authorized family viewing of educational content (non-PHI). Course content is educational (CBT exercises, reflections, meditations), not PHI-containing - even if a token is accessed by the wrong party, no patient identity, treatment history, or sensitive health information is disclosed. Full security assessment in the repo: docs/audit/20251029/MAGIC_LINK_SECURITY_INVESTIGATION.md
This is the ADR that defines how the platform gets built. When your primary development tool is an AI agent, you need operating rules - not guidelines, not best practices, but a contract. What the AI must do before writing code. What it must never do. How git, documentation, and context management work.
All project management, documentation, and planning live in the repository. No Jira. No Asana. No Confluence. The repo IS the source of truth. ADRs are immutable architectural decisions. The backlog is a JSON file that AI edits programmatically. Technical docs are in markdown next to the code.
Why: AI needs context accessible in the repo. External tools break the context chain. Git-based tracking means every change is versioned, diff-able, and auditable.
| Do | Don't |
|---|---|
| Read ADRs before architectural changes | Start coding without reading existing code |
| Commit every 30-60 minutes | Work 3+ hours without committing |
| Create management commands for admin tasks | Make breaking changes without confirmation |
| Update backlog when completing features | Leave todos in "in progress" state |
| Be concise, direct, and helpful | Redesign working systems mid-task |
What happened: During Staff Dashboard development, approximately 3 hours of work was lost - content type badges, filters, clickable rows, clinician notes, program warnings. All fully implemented and working. Never committed to git.
| Item | Cost |
|---|---|
| Initial token usage (12 hrs development) | ~$50 |
| Token usage during recovery attempt | ~$50 (hit limit) |
| Max plan upgrade | $100/month ongoing |
| Token usage for rebuild | ~$50 |
| Total one-time cost | ~$150 |
| Ongoing increased cost | $100/month |
git status, commit or stash all changes, document WIP in session notesAI development broke the time-effort correlation. Architecture research, UX iterations, documentation, planning - all count as story points. Sessions 8 and 9 originally showed 0 SP despite 4-6 hours of work each, because the planning/research wasn't captured.
The consulting work principle: "If I need to document or plan, that is still development time for me." All development work is tracked - not just code. Architecture research, UX design, planning sessions, documentation. This corrected velocity from 89 SP to 97 SP across the first 11 sessions when the gap was found.
| For Features | For Bug Fixes | For ADRs |
|---|---|---|
| Code implemented and tested Tests pass Documentation updated Committed with clear message Pushed to GitHub Verified in production Backlog updated |
Root cause identified Fix implemented and tested No regressions Committed with reference Deployed and verified |
Context clearly stated Decision with rationale Consequences documented Alternatives considered Implementation notes Related ADRs linked |
This isn't a style guide. It's an operational contract between human and AI that evolved from real failures - data loss, context drift, untracked work. The methodology has produced 662 story points across 199 sessions. The rules exist because we learned what happens when they don't.
This ADR exists because of a production incident. On February 7, 2026, patients reported "Server Error" when clicking magic links. Redis Cloud Azure had a regional outage. Our rate limiting depended on Redis - when Redis went down, the throttle middleware raised exceptions that blocked all patient requests. For 30 minutes, no one could access their recovery content.
Root Cause Chain: Patient clicks magic link → Django view with throttle decorator → PatientRateLimitThrottle.allow_request() → cache.get() to Redis → Redis unavailable → Exception raised → 500 Server Error. The rate limiter designed to protect patients was the thing blocking them.
Fail open. When Redis is unavailable, rate limiting allows requests through rather than blocking them. Patient access to recovery content is more important than rate limiting during infrastructure failures.
/health/ready/ uses 3-second ThreadPoolExecutor timeout for Redis ping - won't hang container orchestration| Option | Rejected Because |
|---|---|
| Fail closed (block all requests) | Patients can't access recovery content during outage. Unacceptable for this population. |
| In-memory rate limiting fallback | Per-instance limits (not distributed), complex to implement correctly. Over-engineering for a rare edge case. |
| Multiple Redis providers | Significant complexity, cost, and maintenance. YAGNI. |
| Class | Endpoint | Fail-Open Behavior |
|---|---|---|
PatientRateLimitThrottle | Magic link views | Allow all requests |
PatientMagicLinkThrottle | Magic link generation | Allow all requests |
DemoSMSThrottle | Demo SMS sending | Allow all requests |
"Cache unavailable for rate limiting, allowing request" at WARNING level/health/ready/ returns warning state when Redis unavailable (does not block container orchestration)cache.get() failures > 5 minutes, health check warning stateThis is a 137-line document that came from a real production incident. It shows the team's values: patient access is sacred. It shows engineering maturity: the incident was diagnosed, the fix was designed with trade-off analysis, alternatives were rejected with reasoning, and the monitoring plan was documented. Not a hotfix thrown at the wall - a principled architectural decision born from operational pain.
These 5 ADRs represent different aspects of the engineering discipline at Orbiit:
| ADR | Shows |
|---|---|
| 0006 - Micro-Courses | Product thinking embedded in architecture. Domain-specific design for a specific population. |
| 0009 - User Hierarchy | Deliberate data modeling. Healthcare domain expertise baked into the schema. |
| 0011 - Passwordless Auth | Security design that serves the user, not the other way around. HIPAA mapped to implementation. |
| 0014 - AI Collaboration | Operational methodology. How one engineer + AI ships a platform. |
| 0046 - Fail-Open | Production incident response. Values-driven engineering under pressure. |
The full set of 48 ADRs is in the repo at docs/architecture-decisions/. When you onboard, you read them. That's day one. By day three, you understand not just what the system does, but why it was built that way.
48 ADRs, 199 session logs, 11 runbooks, root cause analyses - all in the repo. Let's walk through the codebase together.
Back to Engineering Overview bert@myorbiit.com