Observability & Reliability
Rust API Logging
Framework: tracing + tracing-subscriber
Initialized in observability/logging.rs:
tracing_subscriber::registry()
.with(EnvFilter::new(
std::env::var("RUST_LOG").unwrap_or_else(|_| "api=debug".into())
))
.with(tracing_subscriber::fmt::layer())
.init();Default filter: api=debug — logs all messages from the api crate at DEBUG level, other crates at WARN.
Request tracing: TraceLayer from tower-http automatically emits structured span events for every request:
- On request: method, path, version
- On response: status code, duration
- On failure: error description
Example output:
2026-03-01T10:00:00Z DEBUG api::chat::handlers: Creating conversation user_id=abc123
2026-03-01T10:00:00Z INFO tower_http::trace: 200 POST /chat/conversations in 12msPython Intelligence Logging
Framework: Python logging module, configured in core/lifecycle.py:
logging.basicConfig(
level=getattr(logging, config.log_level), # "INFO" default
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)x-correlation-id from gRPC metadata is included in log lines for distributed tracing between Rust and Python.
Health & Readiness Probes
| Endpoint | Type | Checks |
|---|---|---|
GET /health/api | Liveness | Axum process alive, returns uptime |
GET /health/intelligence | Liveness | gRPC Health.Check (5s timeout) |
gRPC Health.Ready | Readiness | DB health (SELECT 1) + embedding model loaded |
Startup implication: If the Intelligence service fails its readiness probe (embedding model not loaded), the API’s /health/intelligence will return an error. Load balancers can use this to delay routing chat requests until the service is fully ready.
Background Tasks
Session Cleanup (auth/background.rs)
pub fn start_session_cleanup_task(db: PgPool) -> JoinHandle<()> {
tokio::spawn(async move {
loop {
sqlx::query!("DELETE FROM sessions WHERE expires_at < NOW()")
.execute(&db).await.log_error();
tokio::time::sleep(CLEANUP_INTERVAL).await;
}
})
}Runs as a detached Tokio task. If it panics, it is not restarted automatically — the sessions table will accumulate expired rows until the API restarts.
Migration Safety
Rust — sqlx Offline Mode
The .sqlx/ directory contains a precomputed JSON cache of all query type information. This enables the Rust service to compile without a live database connection:
cargo sqlx prepare # regenerate .sqlx/ from live DB
cargo build # uses .sqlx/ cache if SQLX_OFFLINE=trueIf a developer changes a query without running cargo sqlx prepare, the CI build will fail at compile time — providing a compile-time guarantee that all SQL is valid against the current schema.
Python — Manual Migration Scripts
Python migrations in server/intelligence/migrations/ are plain SQL files run manually or via a migration script. There is no Alembic versioning table — migration state is implicit (check if table exists). This creates risk of:
- Forgetting to run migrations in production
- Running migrations twice (creating duplicate indexes, etc.)
Recommendation: Add Alembic version tracking or use explicit IF NOT EXISTS guards on all DDL statements (which the current migrations do use: CREATE TABLE IF NOT EXISTS, CREATE EXTENSION IF NOT EXISTS).
Testing Strategy
Python Intelligence Tests (test/)
- Framework:
pytest+pytest-asyncio - pytest.ini: configures
asyncio_mode = auto(all async tests run automatically) - Test scope: unit tests for
TextChunker,EmbeddingModel,DocumentProcessor; integration tests for gRPC servicers against a test DB
Rust API Tests
- Framework: Rust’s built-in
#[tokio::test] - Strategy: integration tests against a test PostgreSQL instance;
sqlx::testmacro provides per-test transaction rollback
Scaling Considerations
| Bottleneck | Current State | Scaling Path |
|---|---|---|
| Session DB lookup (per request) | 1 query per authenticated request | Add Redis session cache; token → role/user_id |
| Embedding model | Single process, CPU/GPU | Run multiple Python workers behind a gRPC proxy |
| Connection pool | Rust: 10 connections, 20 max overflow | Increase pool size; add PgBouncer |
| LLM provider | External API, rate-limited | Add request queue + backpressure in QueryPipeline |
| HNSW index | Monolithic in PostgreSQL | Migrate to dedicated vector DB (Qdrant, Weaviate) at scale |
| Single PostgreSQL | Both services share one DB | Read replicas for search queries; separate cluster for vector data |
Failure Isolation
The most critical single point of failure is the shared PostgreSQL database. A database outage affects all three processes simultaneously. Redis-based session caching would allow the Rust API to serve cached sessions during short DB outages.