LLM Client & Memory
LLMClient (engine/query/llm/)
The LLM client is an abstraction layer over the LLM provider API. Currently configured for OpenAI-compatible APIs with a google-genai dependency also present.
class LLMClient:
def __init__(self):
self.provider = config.llm.provider # "openai"
self.model = config.llm.model # "gpt-4o"
self.temperature = config.llm.temperature # 0.7
self.max_tokens = config.llm.max_tokens # 1000
self.base_url = config.llm.base_url # "https://api.openai.com/v1"
self.api_key = config.llm.api_key
async def generate(self, query, context, memory, config=None) -> str:
# Build system prompt (memory + context chunks)
# Build user message (query)
# POST to LLM API
# Return response text
async def stream(self, query, context, memory, config=None) -> AsyncGenerator[str]:
# Same as generate() but yields tokens as they arrivePrompt Construction
The prompt is assembled from up to 4 components:
- System memory: content from
user_memories.memory(persistent cross-conversation context) - Retrieved context: top-k chunks from hybrid search, formatted as citations
- Conversation history: prior messages in the current conversation (for multi-turn coherence)
- User query: the current message
System: You are a helpful assistant.
User memory context:
{memory}
Relevant knowledge base context:
[1] {chunk_1_content} [source: {document_title}, score: {score}]
[2] {chunk_2_content} ...
...
User: {query}config.use_rag (from ChatConfig) controls whether the retrieval step is executed. If use_rag=false, context is empty and the LLM operates in pure conversation mode.
Per-Request Config Override
ChatConfig (from Protobuf) allows per-request overrides:
message ChatConfig {
optional float temperature = 1;
optional int32 max_tokens = 2;
optional bool use_rag = 3;
optional string model = 4; // override model per-request
optional int32 context_limit = 5; // max context chars
}The Python extract_chat_config() function in interfaces/chat.py reads these optionals and falls back to service-level defaults when not set.
User Memory (MemoryStorage)
OpenTier maintains a persistent per-user memory across all conversations:
CREATE TABLE user_memories (
user_id VARCHAR(255) PRIMARY KEY,
memory TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);Pattern: one row per user; the memory column is a free-text field. The Intelligence service reads this before every chat generation and includes it in the prompt as additional context.
Update mechanism: the memory is updated by the Intelligence service during or after chat turns (exact update trigger is in MemoryStorage.update_user_memory()). This could be:
- Explicit summaries generated by the LLM
- Key facts extracted from the conversation
- User preferences detected from messages
The memory subsystem enables personalization across sessions without re-sending conversation history.
LLM Provider Notes
- Default:
LLM_PROVIDER=openaiwithLLM_BASE_URL=https://api.openai.com/v1— compatible with any OpenAI-compatible API (OpenAI, Azure OpenAI, Ollama, vLLM, etc.) - google-genai: the
google-genaipackage is listed as a dependency (≥1.63.0), suggesting Google’s Gemini models are a supported provider path, likely behind aprovider=googleconfig switch - Model override: any request can specify a different model via
ChatConfig.model, enabling per-user model routing (e.g., premium users getgpt-4o, free users getgpt-4o-mini)
Retry Logic (engine/ingestion/retry.py)
Uses tenacity for LLM call retries:
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type((httpx.TimeoutException, RateLimitError)),
reraise=True
)
async def _call_llm_with_retry(self, ...):
...Rate limit errors trigger exponential backoff (2s → 4s → 8s). Timeouts are retried immediately (no backoff) up to the attempt limit.