LLM Client & Memory

`LLMClient` (`engine/query/llm/`)

The LLM client is an abstraction layer over the LLM provider API. Currently configured for OpenAI-compatible APIs with a google-genai dependency also present.


class LLMClient:
    def __init__(self):
        self.provider = config.llm.provider    # "openai"
        self.model = config.llm.model          # "gpt-4o"
        self.temperature = config.llm.temperature  # 0.7
        self.max_tokens = config.llm.max_tokens    # 1000
        self.base_url = config.llm.base_url    # "https://api.openai.com/v1"
        self.api_key = config.llm.api_key
 
    async def generate(self, query, context, memory, config=None) -> str:
        # Build system prompt (memory + context chunks)
        # Build user message (query)
        # POST to LLM API
        # Return response text
 
    async def stream(self, query, context, memory, config=None) -> AsyncGenerator[str]:
        # Same as generate() but yields tokens as they arrive

Prompt Construction

The prompt is assembled from up to 4 components:

System memory: content from user_memories.memory (persistent cross-conversation context)
Retrieved context: top-k chunks from hybrid search, formatted as citations
Conversation history: prior messages in the current conversation (for multi-turn coherence)
User query: the current message


System: You are a helpful assistant.

User memory context:
{memory}

Relevant knowledge base context:
[1] {chunk_1_content} [source: {document_title}, score: {score}]
[2] {chunk_2_content} ...
...

User: {query}

config.use_rag (from ChatConfig) controls whether the retrieval step is executed. If use_rag=false, context is empty and the LLM operates in pure conversation mode.

Per-Request Config Override

ChatConfig (from Protobuf) allows per-request overrides:


message ChatConfig {
    optional float temperature = 1;
    optional int32 max_tokens = 2;
    optional bool use_rag = 3;
    optional string model = 4;       // override model per-request
    optional int32 context_limit = 5; // max context chars
}

The Python extract_chat_config() function in interfaces/chat.py reads these optionals and falls back to service-level defaults when not set.

User Memory (`MemoryStorage`)

OpenTier maintains a persistent per-user memory across all conversations:


CREATE TABLE user_memories (
    user_id VARCHAR(255) PRIMARY KEY,
    memory TEXT NOT NULL DEFAULT '',
    metadata JSONB NOT NULL DEFAULT '{}',
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Pattern: one row per user; the memory column is a free-text field. The Intelligence service reads this before every chat generation and includes it in the prompt as additional context.

Update mechanism: the memory is updated by the Intelligence service during or after chat turns (exact update trigger is in MemoryStorage.update_user_memory()). This could be:

Explicit summaries generated by the LLM
Key facts extracted from the conversation
User preferences detected from messages

The memory subsystem enables personalization across sessions without re-sending conversation history.

LLM Provider Notes

Default: LLM_PROVIDER=openai with LLM_BASE_URL=https://api.openai.com/v1 — compatible with any OpenAI-compatible API (OpenAI, Azure OpenAI, Ollama, vLLM, etc.)
google-genai: the google-genai package is listed as a dependency (≥1.63.0), suggesting Google’s Gemini models are a supported provider path, likely behind a provider=google config switch
Model override: any request can specify a different model via ChatConfig.model, enabling per-user model routing (e.g., premium users get gpt-4o, free users get gpt-4o-mini)

Retry Logic (`engine/ingestion/retry.py`)

Uses tenacity for LLM call retries:


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((httpx.TimeoutException, RateLimitError)),
    reraise=True
)
async def _call_llm_with_retry(self, ...):
    ...

Rate limit errors trigger exponential backoff (2s → 4s → 8s). Timeouts are retried immediately (no backoff) up to the attempt limit.