Skip to Content
Observability & ReliabilityObservability & Reliability

Observability & Reliability

Rust API Logging

Framework: tracing + tracing-subscriber

Initialized in observability/logging.rs:

tracing_subscriber::registry() .with(EnvFilter::new( std::env::var("RUST_LOG").unwrap_or_else(|_| "api=debug".into()) )) .with(tracing_subscriber::fmt::layer()) .init();

Default filter: api=debug — logs all messages from the api crate at DEBUG level, other crates at WARN.

Request tracing: TraceLayer from tower-http automatically emits structured span events for every request:

  • On request: method, path, version
  • On response: status code, duration
  • On failure: error description

Example output:

2026-03-01T10:00:00Z DEBUG api::chat::handlers: Creating conversation user_id=abc123 2026-03-01T10:00:00Z INFO tower_http::trace: 200 POST /chat/conversations in 12ms

Python Intelligence Logging

Framework: Python logging module, configured in core/lifecycle.py:

logging.basicConfig( level=getattr(logging, config.log_level), # "INFO" default format="%(asctime)s [%(levelname)s] %(name)s: %(message)s" )

x-correlation-id from gRPC metadata is included in log lines for distributed tracing between Rust and Python.

Health & Readiness Probes

EndpointTypeChecks
GET /health/apiLivenessAxum process alive, returns uptime
GET /health/intelligenceLivenessgRPC Health.Check (5s timeout)
gRPC Health.ReadyReadinessDB health (SELECT 1) + embedding model loaded

Startup implication: If the Intelligence service fails its readiness probe (embedding model not loaded), the API’s /health/intelligence will return an error. Load balancers can use this to delay routing chat requests until the service is fully ready.

Background Tasks

Session Cleanup (auth/background.rs)

pub fn start_session_cleanup_task(db: PgPool) -> JoinHandle<()> { tokio::spawn(async move { loop { sqlx::query!("DELETE FROM sessions WHERE expires_at < NOW()") .execute(&db).await.log_error(); tokio::time::sleep(CLEANUP_INTERVAL).await; } }) }

Runs as a detached Tokio task. If it panics, it is not restarted automatically — the sessions table will accumulate expired rows until the API restarts.

Migration Safety

Rust — sqlx Offline Mode

The .sqlx/ directory contains a precomputed JSON cache of all query type information. This enables the Rust service to compile without a live database connection:

cargo sqlx prepare # regenerate .sqlx/ from live DB cargo build # uses .sqlx/ cache if SQLX_OFFLINE=true

If a developer changes a query without running cargo sqlx prepare, the CI build will fail at compile time — providing a compile-time guarantee that all SQL is valid against the current schema.

Python — Manual Migration Scripts

Python migrations in server/intelligence/migrations/ are plain SQL files run manually or via a migration script. There is no Alembic versioning table — migration state is implicit (check if table exists). This creates risk of:

  • Forgetting to run migrations in production
  • Running migrations twice (creating duplicate indexes, etc.)

Recommendation: Add Alembic version tracking or use explicit IF NOT EXISTS guards on all DDL statements (which the current migrations do use: CREATE TABLE IF NOT EXISTS, CREATE EXTENSION IF NOT EXISTS).

Testing Strategy

Python Intelligence Tests (test/)

  • Framework: pytest + pytest-asyncio
  • pytest.ini: configures asyncio_mode = auto (all async tests run automatically)
  • Test scope: unit tests for TextChunker, EmbeddingModel, DocumentProcessor; integration tests for gRPC servicers against a test DB

Rust API Tests

  • Framework: Rust’s built-in #[tokio::test]
  • Strategy: integration tests against a test PostgreSQL instance; sqlx::test macro provides per-test transaction rollback

Scaling Considerations

BottleneckCurrent StateScaling Path
Session DB lookup (per request)1 query per authenticated requestAdd Redis session cache; token → role/user_id
Embedding modelSingle process, CPU/GPURun multiple Python workers behind a gRPC proxy
Connection poolRust: 10 connections, 20 max overflowIncrease pool size; add PgBouncer
LLM providerExternal API, rate-limitedAdd request queue + backpressure in QueryPipeline
HNSW indexMonolithic in PostgreSQLMigrate to dedicated vector DB (Qdrant, Weaviate) at scale
Single PostgreSQLBoth services share one DBRead replicas for search queries; separate cluster for vector data

Failure Isolation

The most critical single point of failure is the shared PostgreSQL database. A database outage affects all three processes simultaneously. Redis-based session caching would allow the Rust API to serve cached sessions during short DB outages.

Last updated on