01

ComplianceRAG

Hybrid RAG + Analytical Agent for Regulatory Compliance

github.com/qsanchez/compliancerag Live demo
3
Regulations
GDPR · NIS2 · DORA
110+
Normative articles
indexed
3,142
Real GDPR fines
in the dataset
0.98
RAGAS Faithfulness
Golden Dataset: 30 questions
8
Delivery phases
complete
What it is

A production-grade AI assistant for GDPR, NIS2, and DORA regulatory compliance. Ask it a question and it returns a grounded answer with exact article citations from 110+ normative articles. Ask it about enforcement trends and it queries a live 3,142-row GDPR fines dataset to surface penalty breakdowns, country comparisons, and fine distributions — routing each query automatically to the right pipeline.

Why I built it

Most GenAI projects live in notebooks or rely on pre-built cloud AI services at a high level. This one covers the complete engineering stack built hands-on on AWS — Bedrock for LLM and embeddings, pgvector on RDS for hybrid retrieval, LangGraph for agentic orchestration, Lambda + API Gateway for serving, and CloudWatch for observability. No pre-assembled AI pipelines: every component is implemented from the ground up with explicit architectural trade-offs documented in ADRs.

LLMOps discipline is built in from day one: prompt templates are version-controlled in Git, RAGAS evaluation against a fixed golden dataset is the quality gate for every retrieval change before it merges, and an online LLM-as-judge samples 10% of live traffic to detect distribution drift continuously — not just at evaluation time.

How I built it

Every significant technical decision is recorded in an ADR with explicit alternatives and trade-offs documented. Implementation was AI-first: I directed Claude Code through each phase as architect and reviewer. The discipline shows in what didn't get built: no speculative features, no premature abstractions, and retrieval quality tracked with a RAGAS golden dataset on every change.

02

Architecture

Context click to expand
Context diagram
Data Flow click to expand
Data flow diagram
Infrastructure click to expand
Infrastructure diagram
Module Dependencies click to expand
Module dependency diagram
03

Stack

Layer Technology Rationale
LLMAWS Bedrock — Claude Haiku 4.5Selected for latency/cost balance on compliance Q&A; Claude Sonnet available via one LiteLLM config change for higher-complexity tasks
EmbeddingsTitan Embeddings v21,536-dim vectors; AWS-native data residency; Cohere Embed v3 (Bedrock) identified as next upgrade candidate — measurable via RAGAS A/B before merging
LLM abstractionLiteLLMModel-agnostic — swap Claude for GPT-4o, Llama 3, or Azure OpenAI with a single config change; no LLM provider lock-in
OrchestrationLangGraphExplicit, auditable routing graph; typed state; fixed depth; each node unit-testable
Vector storepgvector on Amazon RDSNo vendor lock-in; hybrid retrieval in one DB; cost-effective at PoC scale
Hybrid retrievalpgvector + pg_trgm + RRFSemantic + keyword + metadata, rank-fused without tuned interpolation weights
AnalyticsAmazon S3 + AthenaServerless SQL over Parquet; no running cost between queries
APIFastAPI + MangumAsync, typed, OpenAPI docs; Mangum adapts ASGI to Lambda events
ServingLambda + API GatewayZero idle cost; Cognito JWT authorizer on /chat; OAC-secured S3 frontend
AuthAmazon CognitoHosted UI + User Pool; JWT tokens validated at API Gateway layer
FrontendS3 + CloudFrontSPA routing, OAC, HTTPS; vanilla JS — no build step
IaCTerraformCloud-agnostic; 9 modules; no manual AWS console operations
CI/CDGitHub ActionsLint, typecheck, unit tests on every PR; tf-validate on infra changes
EvaluationRAGASFaithfulness, answer relevancy, context precision/recall; offline + online
ObservabilityAWS CloudWatchStructured JSON logs, 10 custom metric filters, per-span latency, token cost
04

Technical Features

Hybrid RAG Pipeline

Three retrieval legs fused via RRF (k=60): pgvector cosine similarity, pg_trgm trigram keyword match, and article metadata scoring. Each leg fetches top_k × 3 candidates; fusion merges by rank position alone — no interpolation weights to tune.

rag/retriever.py — _fetch_legs(), _fuse(), _RRF_K = 60

LLM Reranker

All candidates ranked in a single Claude Haiku prompt; each passage gets a regulation · Art.N · title metadata header. Returns a JSON array of ranked indices. Falls back to vector-similarity order on any exception — never hard-fails.

rag/reranker.py — rerank() → json.loads(ranked_indices)

Per-Regulation Diversity

Cross-regulation queries trigger separate RRF searches per regulation with a WHERE metadata->>'regulation' filter. Post-rerank quota enforcement guarantees each detected regulation occupies at least ⌊top_k/n⌋ context slots.

rag/pipeline.py — _enforce_balance() · retriever.py — _detect_regulations()

LangGraph Agentic Router

Three-node StateGraph (router → rag | analytics) with a typed AgentState. Hard conditional edge — not an LLM loop. Fixed execution depth of 2. Each node is a plain Python function, unit-testable without the full graph.

agent/graph.py — 53 lines · agent/router.py — classify() max_tokens=5

Quantitative Analytics

LLM generates Athena SQL from natural language. SQL validator rejects any non-SELECT statement before execution. Results summarised in 2-3 sentences; charts returned as base64 matplotlib PNG. Dataset: 3,142 real GDPR enforcement records.

analytics_query/query_metrics.py — _validate_sql(), _generate_sql()

Per-Span Observability

Every request emits a structured JSON log with retrieve_ms, rerank_ms, generate_ms, input_tokens, output_tokens, and cost_usd. Ten CloudWatch metric filters extract these fields; dashboard shows per-span latency and token cost.

rag/pipeline.py:112 — logger.info("pipeline_spans", ...)

Online LLM-as-Judge

10% of production queries sampled asynchronously via FastAPI BackgroundTask. RAGAS faithfulness + answer_relevancy run on live traffic; scores emitted to CloudWatch. Alarm fires if 1-hour rolling faithfulness drops below 0.80.

ADR-010 — ONLINE_EVAL_SAMPLE_RATE=0.1

Audit Trail

Every query written to a Postgres audit_log table: question, route, answer, citations (JSONB), model version, latency_ms, and injection_blocked flag. Schema created idempotently on first write; failure never propagates to the user response.

audit/logger.py — log_query(AuditRecord)

Prompt Injection Defense

11 compiled regex patterns covering common injection templates. Unicode NFC normalization and control-character stripping applied before pattern matching. Blocked queries flagged in the audit log with injection_blocked=True.

agent/sanitizer.py — 11 _COMPILED patterns, unicodedata.normalize("NFC")

Production Deployment

FastAPI on Lambda (Mangum ASGI adapter) behind HTTP API Gateway. Cognito User Pool + Hosted UI with JWT authorizer on /chat. Frontend on S3 + CloudFront with OAC and SPA routing. One-command deploy: task app:deploy.

infra/modules/ — lambda, api_gateway, cognito, frontend

IaC — 9 Terraform Modules

Fully reproducible infrastructure: networking, rds (pgvector), s3, athena, lambda, api_gateway, cognito, frontend, cloudwatch. Local and prod environments via environments/*.tfvars. No manual AWS console operations.

LLMOps & CI/CD

Prompt templates are version-controlled in Git as plain text files, loaded by path at runtime — never inline strings. GitHub Actions runs ruff lint, format check, mypy, and unit tests on every PR. RAGAS evaluation and integration tests require live AWS infrastructure (Bedrock + private RDS) and run manually — the golden dataset is the quality gate before any retrieval change merges (see ADR-007). A separate workflow validates terraform fmt + validate on infra changes.

.github/workflows/ci.yml — lint + typecheck + unit tests  ·  rag/prompts/ — versioned prompt files  ·  ADR-007

Context Engineering

Retrieved chunks assembled into a structured context window with exact article citations by context_builder.py. System and user prompt templates versioned in Git under rag/prompts/ — loaded by path, never inline strings. Pydantic v2 models enforce structured JSON outputs at the API boundary. Injection-aware: sanitised queries only; blocked input never reaches context assembly.

rag/context_builder.py  ·  rag/prompts/ — rag_system.txt + rag_user.txt
05

Evaluation — RAGAS

Final scores — 30 questions across GDPR, NIS2, DORA — LLM reranker + per-regulation diversity. evaluation/reports/ ↗

0.98
Faithfulness
0.91
Answer Relevancy
0.83
Context Precision
0.85
Context Recall

The Phase 2 regression is a diagnostic signal, not a failure. Phase 1 evaluated 10 GDPR questions with semantic-only retrieval — scores were strong for a single-regulation corpus. Phase 2 expanded to 30 questions across three regulations with the cross-encoder reranker still on Chroma; without hybrid retrieval or per-regulation diversity the retriever failed to cover NIS2 and DORA adequately, causing answer relevancy to drop from 0.88 to 0.29 and context precision from 0.73 to 0.18. Phase 5 resolved this with pgvector hybrid retrieval, per-regulation diversity quotas, and the LLM reranker — recovering all four metrics to or above Phase 1 levels on 3× as many questions across 3× as many regulations.

06

Architecture Decisions

Four of ten ADRs — selected for trade-off depth. Full record in docs/adr/.

ADR-009
Reranker Infrastructure — three pivots to LLM reranking
Accepted — Amended

The reranker started as a local PyTorch cross-encoder and required three pivots before a working production solution landed.

  1. 1 Cross-encoder — local PyTorch (ms-marco-MiniLM-L-6-v2)
    Lambda cold starts trigger PyTorch JIT warmup: 30–60 s on Graviton2 ARM64 (no AVX-512). At fetch_k=15 candidates, that is 15 sequential forward passes on a cold CPU — consistently hitting the 60 s function timeout. Fundamentally incompatible with stateless serverless compute.
  2. 2 Cohere Rerank v3.5 via Amazon Bedrock
    Same cross-encoder algorithm, inference offloaded to Cohere GPU infrastructure. No PyTorch, no cold-start penalty, IAM-controlled, ~100–200 ms round-trip. Required migration eu-west-1 → us-east-1 (only availability region). Migration completed — then bedrock-agent-runtime.rerank() returns HTTP 403: AWS Marketplace subscription wall with no self-service activation path. The Marketplace listing prices Cohere Rerank as a dedicated SageMaker endpoint at $3.50/host/hour — a different product from the serverless Bedrock API.
  3. 3 LLM reranking — current approach
    All candidates ranked in a single Claude Haiku completion. Each passage is annotated with [regulation · Art.N · title]; model returns a JSON array of ranked indices. Graceful fallback to vector-similarity order on any failure. Zero new AWS resources or IAM permissions beyond what generation already requires.
Lesson: Lambda handles request routing, LLM API calls, and SQL queries. Managed APIs (Bedrock) handle GPU-backed inference. Persistent compute (ECS/EC2) handles models that must stay resident between requests. Recognising this boundary early is an architectural decision, not a workaround.
ADR-003 + Amendment
Hybrid Retrieval — three-leg RRF + cross-regulation diversity
Accepted — Amended

Regulatory text requires two distinct retrieval signals. Semantic similarity retrieves articles conceptually related to the query even without surface vocabulary overlap. Keyword matching retrieves articles with exact legal references — "Article 32", "Recital 83" — that pure semantic search consistently missed in early RAGAS runs. Metadata scoring rewards chunks whose article_number and title fields match query terms. Three legs run in parallel; RRF (k=60) merges by rank position alone — no interpolation weight to tune.

Amendment (Phase 7): For cross-regulation queries, standard RRF returned candidates from one regulation only — the query embedding gravitates toward whichever regulation dominates the corpus. Fix: detect regulation names in the query, run a separate three-leg RRF search per regulation with a WHERE metadata->>'regulation' = %s filter, allocate retrieval slots evenly. Post-rerank quota enforcement (_enforce_balance()) guarantees each regulation occupies at least ⌊top_k/n⌋ context slots even if the reranker would otherwise favour one side.
ADR-010
Observability: CloudWatch-first over LangSmith
Accepted

The Lambda function runs in a private subnet with no NAT gateway. It cannot reach api.smith.langchain.com; all LangSmith traces are silently dropped. Adding a NAT gateway costs ~$32/month plus data transfer — unjustified for a PoC solely to recover a tracing UI.

LangSmith's two value propositions required separate replacements: per-span operational metrics (retrieve/rerank/generate latency, token counts, cost per query) become enriched structured log fields + CloudWatch metric filters; quality monitoring on real traffic becomes an online LLM-as-judge evaluator running asynchronously on 10% of production queries — detecting query distribution shift and regulation data staleness that the fixed offline golden dataset cannot surface.

Trade-off: LangSmith's waterfall trace UI (full call tree with inputs and outputs at each node) is not replicable in CloudWatch without a structured trace viewer. Mitigated by per-span timing fields in every log line, queryable via Log Insights to reconstruct the latency breakdown per request.
ADR-001
Vector Store: pgvector over managed vector databases
Accepted

pgvector on Amazon RDS PostgreSQL over Pinecone, Weaviate, and Qdrant for three reasons: hybrid retrieval in one store (pg_trgm lives in the same DB as the vectors, enabling semantic + keyword RRF without a separate BM25 service); no vendor lock-in (standard PostgreSQL — switching cloud providers requires no application code changes); cost ($0 marginal cost beyond the existing RDS instance vs. managed vector DB minimum-tier pricing at PoC scale).

Upgrade path: If corpus exceeds ~10M chunks or query concurrency demands sub-10ms p99 latency, add HNSW indexing to ingestion/indexer.py:_ensure_pgvector_schema without changing the retrieval interface, or evaluate Cohere Embed v3 (on Bedrock) as a drop-in embedding upgrade — measure via RAGAS comparison before merging.