Skip to main content

Core concepts

The semantic layer

AgentData sits between your raw databases and whoever (or whatever) asks questions. Instead of exposing tables and columns, it exposes a business model: entities, measures and dimensions with human-readable names. Questions are answered against that model, so the same query works regardless of how the underlying tables are shaped or which engine runs it.

Entities

An entity is a business object discovered from your schema. Each has a role:

  • Fact — something measurable (orders, transactions). Usually large tables.
  • Dimension — something descriptive (customer, product). Usually small tables.

An entity can have multiple bindings to physical tables across sources. When the same entity (say, Customer) exists in several systems, AgentData collapses them into one entity with merged bindings.

Entities move through a lifecycle: pending_reviewconfirmed (queryable), with optional rejected or archived. You confirm the model once; queries only ever hit confirmed entities.

Attributes and measures

  • Attributes are dimensions and descriptors — canonical name, physical name, type, and a PII flag.
  • Measures are aggregatable metrics: sum, avg, count, or a custom SQL expression such as unit_price * qty * (1 - discount).

A column map ties each canonical attribute to a physical column or SQL expression. Because it's engine-agnostic, the same mapping works whether queries run through the built-in shim or through Cube + Trino.

Each entity has a detail view showing its source bindings, column mapping, measures and attributes (with PII flags):

Entity model detail

The Model view shows the whole semantic model — facts, dimensions and the inferred PK/FK join relationships between them:

Semantic model graph

Query execution

Question ──▶ Planner (LLM) ──▶ Structured query ──▶ Validator ──▶ Executor ──▶ Result
  1. Planner — the LLM turns your question into a structured query against the model (never raw SQL from the LLM).
  2. Validator — checks the query is valid against confirmed entities and measures.
  3. Executor — runs it:
    • Shim (default) — pushes down per-source and merges in Python. Exact for sum/count; approximate for avg/distinct count across sources.
    • Cube + Trino (federation) — exact cross-source joins and measures via a federated engine.

Privacy model

What the LLM sees and doesn't see:

Reaches the LLMNever reaches the LLM
Entity / attribute / measure namesRow data
Your questionQuery results
Embeddings

The generated query runs locally against your sources; only the result is returned to you. Small samples captured during profiling are used purely for discovery and classification — not sent to the LLM.

Self-hosting & air-gapped

AgentData can run with no external egress at all. Point it at a local OpenAI-compatible model server and nothing leaves your network:

config.yaml
ai:
provider: openai
base_url: "http://localhost:11434/v1" # Ollama / vLLM / TGI

Other options:

  • AWS Bedrock with VPC PrivateLink — LLM traffic stays on AWS's private network; auth via instance IAM role; Bedrock doesn't retain prompts.
  • Anthropic cloud — simplest; external egress to api.anthropic.com. Best for non-sensitive or evaluation use.

Source credentials are encrypted at rest, and adapters enforce read-only access. An optional per-tenant policy (allow_llm_egress, allow_prospecting) lets you lock egress down further. For the full deployment posture — fail-closed MCP auth, the data-egress policy, audit logging and the recommended setup for regulated clients — see MCP & data-egress security.