# Multi-Tenant Architecture: Isolating Healthcare Data at Scale ## The Multi-Tenancy Challenge Imagine you're running a SaaS application. Now imagine that application handles the most sensitive data possible: medical records. Now imagine that a single bug could expose Patient A's cancer diagnosis to Hospital B. That's the stakes we're playing with in a multi-tenant healthcare system. ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1767622362440/485ecc9a-6055-4fa9-8f34-c00ae7da3ec2.png align="center") Most SaaS apps solve multi-tenancy at the application layer—same database, different `tenant_id` columns. That works for Slack or Notion. It doesn't work for healthcare. Here's why: * **Regulatory compliance:** HIPAA requires strong data isolation * **Different schemas:** Hospital A stores patient names as `patient_name`, Hospital B uses `full_name` * **Legacy systems:** Each tenant runs their own ancient MySQL database on-premise * **Zero trust:** One tenant's data must be *physically impossible* to access from another tenant's queries ## The Architecture: Database-Level Isolation Instead of logical isolation (same DB, different rows), I went with **physical isolation**—each tenant gets their own database connection to their own database. ```rust // Each tenant has their own encrypted database URL pub struct TenantDatabaseConfig { pub id: i32, pub client_id: String, pub database_url: String, // Encrypted pub database_type: DatabaseType, } ``` ### The Connection Pool Dance Here's the tricky part: you can't create a new database connection for every request. Connections are expensive (100ms+ handshake). But you also can't keep 1000 connection pools open if you have 1000 tenants. I built a `TenantPoolManager` that: 1. **Lazily creates** connection pools (only when needed) 2. **Caches them** in memory for fast access 3. **Evicts idle pools** after 30 minutes of inactivity 4. **Validates connections** before returning them ```rust pub struct TenantPoolManager { pools: Arc>>, encryption_key: Vec, } impl TenantPoolManager { pub async fn get_pool(&self, client_id: &str) -> Result { // Check cache first { let pools = self.pools.lock().unwrap(); if let Some(pool) = pools.get(client_id) { return Ok(pool.clone()); } } // Not cached - fetch config, decrypt URL, create pool let config = self.fetch_tenant_config(client_id).await?; let decrypted_url = self.decrypt_database_url(&config.database_url)?; let pool = self.create_pool(&decrypted_url).await?; // Cache it let mut pools = self.pools.lock().unwrap(); pools.insert(client_id.to_string(), pool.clone()); Ok(pool) } } ``` ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1767622392052/ffc64773-6887-44d8-9177-dc2225f6e9f8.png align="center") ## Encryption Database URLs contain credentials. Storing them in plaintext would be... unwise. Every tenant's database URL is encrypted using AES-256-GCM before hitting the database: ```rust fn encrypt_database_url(&self, url: &str) -> Result { let cipher = Aes256Gcm::new(Key::::from_slice(&self.encryption_key)); let nonce = Aes256Gcm::generate_nonce(&mut OsRng); let ciphertext = cipher.encrypt(&nonce, url.as_bytes()) .map_err(|e| AdapterError::Encryption(e.to_string()))?; // Encode as base64 for storage let mut combined = nonce.to_vec(); combined.extend_from_slice(&ciphertext); Ok(base64::encode(&combined)) } ``` The encryption key lives in environment variables, never in the database. ## Configuration Caching: Speed vs Consistency Each tenant has a configuration: * Which FHIR resources they support * How to map FHIR paths to database columns * Custom transformations (e.g., "1"/"0" → true/false) * Nested array handling strategies Loading this from MySQL on every request would be slow (5-10ms per query). But caching it forever means configuration changes don't take effect. I built a `ConfigResolver` with a TTL-based cache: ```rust pub struct ConfigResolver { cache: Arc>>, cache_ttl: Duration, } struct CachedConfig { config: TenantConfig, loaded_at: Instant, } impl ConfigResolver { pub async fn get_config(&self, client_id: &str) -> Result { // Check cache { let cache = self.cache.lock().unwrap(); if let Some(cached) = cache.get(client_id) { if cached.loaded_at.elapsed() < self.cache_ttl { return Ok(cached.config.clone()); } } } // Cache miss or expired - reload let config = self.load_from_database(client_id).await?; let mut cache = self.cache.lock().unwrap(); cache.insert(client_id.to_string(), CachedConfig { config: config.clone(), loaded_at: Instant::now(), }); Ok(config) } // Admin panel calls this after configuration changes pub async fn invalidate(&self, client_id: &str) { let mut cache = self.cache.lock().unwrap(); cache.remove(client_id); } } ``` ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1767622413017/4a629de8-82e7-4a84-ab63-0468b349d41b.png align="center") ## Authentication: Keycloak Integration Multi-tenancy isn't just about data—it's about *who can access* that data. I integrated Keycloak for enterprise SSO: * **Client ID embedded in JWT:** Every request includes `client_id` claim * **Role-based access:** `fhir-read`, `fhir-write`, `fhir-admin` The authentication flow: 1. User logs in via Keycloak 2. Every request to the adapter includes this JWT 3. Middleware validates JWT and extracts `client_id` 4. All database queries are scoped to that `client_id` Keycloak issues JWT with claims: ```json { "sub": "user-123", "client_id": "tenant-42", "realm_access": { "roles": ["fhir-read", "fhir-write"] } } ``` **No way to query across tenants. The** `client_id` is the security boundary. ```rust // Middleware that extracts and validates tenant context pub async fn auth_middleware( State(state): State, req: Request, next: Next, ) -> Result { let token = extract_bearer_token(&req)?; // Validate JWT signature and expiration let claims = state.keycloak.validate_token(&token).await?; // Extract client_id - this determines data isolation let client_id = claims.client_id .ok_or_else(|| AppError::Unauthorized("Missing client_id"))?; // Inject into request extensions for downstream handlers req.extensions_mut().insert(AuthContext { user_id: claims.sub, client_id, roles: claims.realm_access.roles, }); Ok(next.run(req).await) } ``` ## Lessons Learned ### ✅ What Worked 1. **Database-level isolation:** No worrying about leaked `WHERE` clauses 2. **Lazy pool creation:** Most tenants are idle most of the time 3. **TTL-based config cache:** 5-minute TTL is a sweet spot ### ❌ What Didn't 1. **Initial design had no pool eviction:** Memory grew unbounded 2. **Encrypted URLs in logs:** Accidentally logged encrypted URLs (look like gibberish, but still bad) 3. **No circuit breakers:** One tenant's broken DB took down the whole adapter ### 🔧 Improvements Made 1. **Added pool eviction** after 30 min idle 2. **Sanitized all logging** (URLs, credentials, PHI) 3. **Per-tenant circuit breakers** (now one tenant can fail without affecting others) ## Up Next: The Mapping Engine Physical isolation solves *security*. But how do we handle the fact that Hospital A stores patient names as `patient_name` while Hospital B uses `nombre_paciente`? That's where the **dynamic mapping engine** comes in. --- *Part 3 will dive deep into how the adapter translates arbitrary database schemas into standard FHIR resources at runtime—without code generation.* --- **Discussion:** How do you handle multi-tenancy in your systems? Physical vs logical isolation? Let me know your thoughts.