AVG(text): The Quest for Quantitative Qualitativeness

Data / AI March 16, 2024 12 min read Arjun Srivastava

AVG(text): The Quest for Quantitative Qualitativeness

We can now run AVG(text) in PostgreSQL. Semantic SQL brings LLMs directly into the database, so analysts can answer the "Why" as easily as the "How much"—using the SQL they already know.

This post explains why that matters, how it works, and how to use it.


1. The gap: We analyze what we can count

Situation. Imagine Sarah, a data analyst. A product team asks: "Why are customers returning our new premium line?" The return rate is up 15%. She has the numbers—return_count, revenue_lost—but the reason is buried in 10,000 unstructured reviews and support tickets.

Complication. You can calculate AVG(sales), but you can't calculate AVG(feedback). So Sarah does what every analyst does: she manually reads 50 reviews, squints at the patterns, and writes "I think they don't like the battery life." It's slow, biased, and unscalable.

Insight. We default to quantitative data not because it's more truthful, but because it's aggregatable. We analyze what we can count.


2. The solution: Semantic primitives in SQL

pg_llm is a PostgreSQL extension that brings Large Language Models into the database. It enables Semantic Computation—analysis by meaning, not exact values—using standard SQL.

Simple case: "Is this review bad?"

Instead of guessing, Sarah writes:

SELECT
    product_category,
    COUNT(*) as reviews,
    AVG(
        CASE
            WHEN llm(
                'Is this positive? 1=yes, 0=no: {1}',
                review_text
            ) = '1' THEN 1 ELSE 0
        END
    ) as sentiment_score
FROM reviews
GROUP BY product_category;

Zero data movement. No Python scripts. Just SQL.

Advanced case: "What are the themes?"

Two primitives unlock full analytical power:

Primitive Replaces Purpose
llm_cluster GROUP BY Groups by meaning, discovers themes automatically
llm_agg SUM/AVG Synthesizes a group into a summary

The Holy Grail query—theme discovery with full Map-Reduce-Finalize control:

WITH semantic_groups AS (
    SELECT review_text, star_rating,
        llm_cluster_id(review_text, num_clusters => 10) OVER () AS cluster_id
    FROM reviews
)
SELECT cluster_id, COUNT(*) AS num_reviews, ROUND(AVG(star_rating), 2) AS avg_stars,
    llm_agg(review_text,
        'TYPE (summary text) Summarize the key theme: {1}',
        'TYPE (summary text) Combine summaries: {1} + {2}',
        'TYPE (name text) Name this group: {1}'
    ) AS cluster_name
FROM semantic_groups
GROUP BY cluster_id ORDER BY num_reviews DESC;

The first prompt extracts a summary per row; the second merges partial results (associative); the third produces the final name. Chaos becomes a structured report—"Battery Life," volume, sentiment—in seconds.


3. Why it scales: Database-native architecture

pg_llm is not a thin wrapper. It maps the LLM lifecycle to PostgreSQL's query planner:

Phase PostgreSQL Role
Map SFUNC Parallel workers extract insights from rows concurrently
Reduce COMBINEFUNC Partial results merged associatively
Finalize FINALFUNC Final summary generated

The bottleneck is I/O (API latency), not CPU—so parallelism delivers concurrent API throughput. Sequential: 100 rows × 1s = 100 seconds. With 4 workers: ~25 seconds.


4. How to write prompts: The type algebra

Aggregation prompts define a Map-Reduce pipeline. Three concepts matter:

  • I (Input): Raw text from the table column
  • A (Accumulator): Intermediate structure between parallel workers
  • O (Output): Final result

Rules:

  1. Always start with TYPE. The TYPE (field type, ...) directive returns structured JSON. Without it, parsing fails.

  2. Merge must be associative and commutative. Merge(A, B) must equal Merge(B, A). Use "merge sets" or "sum counts," not "append to list"—order can vary across workers.

Common mistakes

Mistake Consequence Fix
Forgetting TYPE Raw text breaks JSON parsing Start every prompt with TYPE (...).
Non-commutative merge Results change with worker order Ask for "merge sets," not "append lists".
Heavy logic in Final Hallucination on large data Do math in Extract/Merge. Use Final for formatting only.
Ignoring NULL state Crash on first row "If state is null/empty, initialize it."

5. The future is fuzzy

We are moving from Exact Compute (WHERE id=1) to Semantic Compute (WHERE meaning IS similar). pg_llm is the bridge. Analysts like Sarah can answer the Why as easily as the How much.

Explore Semantic SQL →