Skip to Content

LLM Client & Memory

LLMClient (engine/query/llm/)

The LLM client is an abstraction layer over the LLM provider API. Currently configured for OpenAI-compatible APIs with a google-genai dependency also present.

class LLMClient: def __init__(self): self.provider = config.llm.provider # "openai" self.model = config.llm.model # "gpt-4o" self.temperature = config.llm.temperature # 0.7 self.max_tokens = config.llm.max_tokens # 1000 self.base_url = config.llm.base_url # "https://api.openai.com/v1" self.api_key = config.llm.api_key async def generate(self, query, context, memory, config=None) -> str: # Build system prompt (memory + context chunks) # Build user message (query) # POST to LLM API # Return response text async def stream(self, query, context, memory, config=None) -> AsyncGenerator[str]: # Same as generate() but yields tokens as they arrive

Prompt Construction

The prompt is assembled from up to 4 components:

  1. System memory: content from user_memories.memory (persistent cross-conversation context)
  2. Retrieved context: top-k chunks from hybrid search, formatted as citations
  3. Conversation history: prior messages in the current conversation (for multi-turn coherence)
  4. User query: the current message
System: You are a helpful assistant. User memory context: {memory} Relevant knowledge base context: [1] {chunk_1_content} [source: {document_title}, score: {score}] [2] {chunk_2_content} ... ... User: {query}

config.use_rag (from ChatConfig) controls whether the retrieval step is executed. If use_rag=false, context is empty and the LLM operates in pure conversation mode.

Per-Request Config Override

ChatConfig (from Protobuf) allows per-request overrides:

message ChatConfig { optional float temperature = 1; optional int32 max_tokens = 2; optional bool use_rag = 3; optional string model = 4; // override model per-request optional int32 context_limit = 5; // max context chars }

The Python extract_chat_config() function in interfaces/chat.py reads these optionals and falls back to service-level defaults when not set.

User Memory (MemoryStorage)

OpenTier maintains a persistent per-user memory across all conversations:

CREATE TABLE user_memories ( user_id VARCHAR(255) PRIMARY KEY, memory TEXT NOT NULL DEFAULT '', metadata JSONB NOT NULL DEFAULT '{}', updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() );

Pattern: one row per user; the memory column is a free-text field. The Intelligence service reads this before every chat generation and includes it in the prompt as additional context.

Update mechanism: the memory is updated by the Intelligence service during or after chat turns (exact update trigger is in MemoryStorage.update_user_memory()). This could be:

  • Explicit summaries generated by the LLM
  • Key facts extracted from the conversation
  • User preferences detected from messages

The memory subsystem enables personalization across sessions without re-sending conversation history.

LLM Provider Notes

  • Default: LLM_PROVIDER=openai with LLM_BASE_URL=https://api.openai.com/v1 — compatible with any OpenAI-compatible API (OpenAI, Azure OpenAI, Ollama, vLLM, etc.)
  • google-genai: the google-genai package is listed as a dependency (≥1.63.0), suggesting Google’s Gemini models are a supported provider path, likely behind a provider=google config switch
  • Model override: any request can specify a different model via ChatConfig.model, enabling per-user model routing (e.g., premium users get gpt-4o, free users get gpt-4o-mini)

Retry Logic (engine/ingestion/retry.py)

Uses tenacity for LLM call retries:

@retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10), retry=retry_if_exception_type((httpx.TimeoutException, RateLimitError)), reraise=True ) async def _call_llm_with_retry(self, ...): ...

Rate limit errors trigger exponential backoff (2s → 4s → 8s). Timeouts are retried immediately (no backoff) up to the attempt limit.

Last updated on