AVG(text): The Quest for Quantitative Qualitativeness
AVG(text): The Quest for Quantitative Qualitativeness
We can now run AVG(text) in PostgreSQL. Semantic SQL brings LLMs directly into the database, so analysts can answer the "Why" as easily as the "How much"—using the SQL they already know.
This post explains why that matters, how it works, and how to use it.
1. The gap: We analyze what we can count
Situation. Imagine Sarah, a data analyst. A product team asks: "Why are customers returning our new premium line?" The return rate is up 15%. She has the numbers—return_count, revenue_lost—but the reason is buried in 10,000 unstructured reviews and support tickets.
Complication. You can calculate AVG(sales), but you can't calculate AVG(feedback). So Sarah does what every analyst does: she manually reads 50 reviews, squints at the patterns, and writes "I think they don't like the battery life." It's slow, biased, and unscalable.
Insight. We default to quantitative data not because it's more truthful, but because it's aggregatable. We analyze what we can count.
2. The solution: Semantic primitives in SQL
pg_llm is a PostgreSQL extension that brings Large Language Models into the database. It enables Semantic Computation—analysis by meaning, not exact values—using standard SQL.
Simple case: "Is this review bad?"
Instead of guessing, Sarah writes:
SELECT
product_category,
COUNT(*) as reviews,
AVG(
CASE
WHEN llm(
'Is this positive? 1=yes, 0=no: {1}',
review_text
) = '1' THEN 1 ELSE 0
END
) as sentiment_score
FROM reviews
GROUP BY product_category;
Zero data movement. No Python scripts. Just SQL.
Advanced case: "What are the themes?"
Two primitives unlock full analytical power:
| Primitive | Replaces | Purpose |
|---|---|---|
llm_cluster |
GROUP BY |
Groups by meaning, discovers themes automatically |
llm_agg |
SUM/AVG |
Synthesizes a group into a summary |
The Holy Grail query—theme discovery with full Map-Reduce-Finalize control:
WITH semantic_groups AS (
SELECT review_text, star_rating,
llm_cluster_id(review_text, num_clusters => 10) OVER () AS cluster_id
FROM reviews
)
SELECT cluster_id, COUNT(*) AS num_reviews, ROUND(AVG(star_rating), 2) AS avg_stars,
llm_agg(review_text,
'TYPE (summary text) Summarize the key theme: {1}',
'TYPE (summary text) Combine summaries: {1} + {2}',
'TYPE (name text) Name this group: {1}'
) AS cluster_name
FROM semantic_groups
GROUP BY cluster_id ORDER BY num_reviews DESC;
The first prompt extracts a summary per row; the second merges partial results (associative); the third produces the final name. Chaos becomes a structured report—"Battery Life," volume, sentiment—in seconds.
3. Why it scales: Database-native architecture
pg_llm is not a thin wrapper. It maps the LLM lifecycle to PostgreSQL's query planner:
| Phase | PostgreSQL | Role |
|---|---|---|
| Map | SFUNC |
Parallel workers extract insights from rows concurrently |
| Reduce | COMBINEFUNC |
Partial results merged associatively |
| Finalize | FINALFUNC |
Final summary generated |
The bottleneck is I/O (API latency), not CPU—so parallelism delivers concurrent API throughput. Sequential: 100 rows × 1s = 100 seconds. With 4 workers: ~25 seconds.
4. How to write prompts: The type algebra
Aggregation prompts define a Map-Reduce pipeline. Three concepts matter:
- I (Input): Raw text from the table column
- A (Accumulator): Intermediate structure between parallel workers
- O (Output): Final result
Rules:
Always start with
TYPE. TheTYPE (field type, ...)directive returns structured JSON. Without it, parsing fails.Merge must be associative and commutative.
Merge(A, B)must equalMerge(B, A). Use "merge sets" or "sum counts," not "append to list"—order can vary across workers.
Common mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Forgetting TYPE |
Raw text breaks JSON parsing | Start every prompt with TYPE (...). |
| Non-commutative merge | Results change with worker order | Ask for "merge sets," not "append lists". |
| Heavy logic in Final | Hallucination on large data | Do math in Extract/Merge. Use Final for formatting only. |
Ignoring NULL state |
Crash on first row | "If state is null/empty, initialize it." |
5. The future is fuzzy
We are moving from Exact Compute (WHERE id=1) to Semantic Compute (WHERE meaning IS similar). pg_llm is the bridge. Analysts like Sarah can answer the Why as easily as the How much.