Part II: Retrieving Data Chapter 5: Aggregation and Grouping

Part II: Retrieving Data

Chapter 5: Aggregation and Grouping

Summarizing data is essential when you need high-level insights from large tables. Aggregate functions let you collapse detailed rows into single metrics—like totals, averages, or counts. Grouping then partitions those rows into buckets for segmented analysis. In this chapter, we’ll explore:

Core aggregate functions: COUNT, SUM, AVG, MIN, MAX
Using GROUP BY to create logical buckets
Filtering groups with HAVING
Handling NULL values within aggregations
Practical examples for generating charts and reports

1. Aggregation Functions Overview

Aggregate functions process multiple rows to produce a single summary value. They ignore row-level granularity and calculate metrics across a set:

COUNT(expr) returns the number of non-NULL values or * for all rows
SUM(expr) adds numeric values across rows
AVG(expr) computes the average of numeric values
MIN(expr) finds the smallest value
MAX(expr) finds the largest value

Each function can be applied to a column, an expression, or * in the case of COUNT. You cannot mix aggregates and non-aggregates without grouping.

sql

SELECT
  COUNT(*)     AS total_orders,
  SUM(amount)  AS total_revenue,
  AVG(amount)  AS average_order,
  MIN(amount)  AS smallest_order,
  MAX(amount)  AS largest_order
FROM orders;

2. Counting Rows and Calculating Sums

2.1 COUNT

COUNT(*) counts all rows, including those with NULL values in other columns
COUNT(column) counts only non-NULL values in that column

sql

-- Total number of orders
SELECT COUNT(*) AS order_count
FROM orders;

2.2 SUM

SUM(column) returns the total of numeric values, ignoring NULL
Can be wrapped in COALESCE to treat NULL as zero

sql

-- Total revenue from completed orders
SELECT SUM(amount) AS revenue
FROM orders
WHERE status = 'completed';

3. Averages, Minimums, and Maximums

3.1 AVG

Calculates the mean of non-NULL numeric values
Subject to rounding—use ROUND() for formatting

sql

SELECT AVG(amount) AS avg_order_value
FROM orders
WHERE order_date >= '2025-01-01';

3.2 MIN and MAX

MIN(column) finds the smallest value
MAX(column) finds the largest value

sql

SELECT
  MIN(order_date) AS first_order,
  MAX(order_date) AS last_order
FROM orders
WHERE customer_id = 123;

4. Grouping Data into Buckets with GROUP BY

The GROUP BY clause partitions rows by one or more columns. Each unique combination becomes a bucket for aggregation.

sql

SELECT
  customer_id,
  COUNT(*)      AS orders_placed,
  SUM(amount)   AS total_spent
FROM orders
GROUP BY customer_id;

– You can group by multiple columns:

sql

GROUP BY year, month, product_category

– All non-aggregate columns in SELECT must appear in GROUP BY.

5. Filtering Aggregated Groups with HAVING

WHERE filters individual rows before aggregation; HAVING filters groups after aggregation.

sql

-- Top customers who spent over $1,000
SELECT
  customer_id,
  SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;

Use HAVING sparingly—pushing filters into WHERE (when possible) usually performs better.

6. Dealing with NULLs in Grouped Data

NULL values can skew counts and averages:

COUNT(column) ignores NULL; COUNT(*) includes them
SUM(NULL) and AVG(NULL) skip NULL rows
Use COALESCE(column, default) to replace NULL before aggregation

sql

SELECT
  region,
  COUNT(*)                     AS total_customers,
  COUNT(email)                 AS customers_with_email,
  SUM(COALESCE(sales, 0))      AS total_sales
FROM customers
GROUP BY region;

This ensures that missing sales values are treated as zero rather than excluded.

7. Practical Examples

7.1 Monthly Sales Summary

sql

SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(amount)                      AS revenue,
  COUNT(*)                         AS orders_count,
  AVG(amount)                      AS avg_order
FROM orders
GROUP BY month
ORDER BY month;

7.2 Product Category Performance

sql

SELECT
  p.category,
  SUM(oi.quantity)    AS units_sold,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.id
GROUP BY p.category
HAVING SUM(oi.quantity) > 100;

8. Generating Charts and Reports

Aggregated query results are the basis for dashboards and visualizations:

Export results to BI tools (Tableau, Power BI) via CSV or direct connections
Build time-series charts from monthly or quarterly buckets
Compare categories side by side using bar or pie charts
Highlight outliers with conditional formatting

Example CSV export:

sql

COPY (
  SELECT month, revenue
  FROM (
    SELECT
      DATE_TRUNC('month', order_date) AS month,
      SUM(amount)                    AS revenue
    FROM orders
    GROUP BY month
  ) sub
  ORDER BY month
) TO '/tmp/monthly_revenue.csv' CSV HEADER;

9. Performance Considerations and Best Practices

Index grouping columns: Speeds up grouping on large datasets
Pre-aggregate: Use materialized views for expensive, frequently queried summaries
Avoid unnecessary columns: Selecting only aggregates and group keys reduces I/O
Use sampling: For interactive analysis on massive tables, sample a subset

Conclusion

Aggregation and grouping transform row-level details into actionable insights at a glance. With COUNT, SUM, AVG, MIN, MAX, plus GROUP BY and HAVING, you can summarize millions of transactions in seconds. Handling NULLs and applying best practices ensures accuracy and performance. In the next chapter, we’ll explore subqueries and derived tables to layer complex analyses seamlessly.

Big Data and AI

Search This Blog