You've probably heard many times that the database is the bottleneck of many web applications. This isn't necessarily true. Often it happens that some heavy queries can be substiantially optimized, making them really efficient and fast. As the time passes by, however, the data can remarkably grow in size, especially in several years time which can indeed make the database a bottleneck - the tables are huge and don't fit into memory any longer, the indexes are even bigger making queries much slower and you can't really optimize further any query. In many cases deleting old records that are no longer used is not an option - they still can have some value, e.g. for data analysis or statystical purposes. Fortunately, it's not a dead end. You can make your database performance great again with a new secret weapon: table partitioning.

Table partitioning and inheritance 101

To put it in simple words, table partitioning is about splitting one big table into several smaller units. In PostgreSQL it can be done by creating master table serving as a template for other children tables that will inherit from it. Master table contains no data and you shouldn't add any indexes and unique constraints to it. However, if you need some CHECK CONSTRAINTS in all children tables, it is a good idea to add them to master table as they will be inherited. Due to inheritance mechanism, there is no point in adding any columns to children tables either. Creating tables inheriting from other tables is pretty straight-forward - you just need to use INHERITS clause:

CREATE TABLE orders(
  id serial NOT NULL,
  client_id integer NOT NULL,
  created_at timestamp without time zone NOT NULL,
  updated_at timestamp without time zone NOT NULL
);

CREATE TABLE other_orders() INHERITS (orders);

and that's it - all columns and extra constraints will be defined on other_orders thanks to inheritance from orders table.

Defining constraints for partitioning criteria

As we are going to split one big table into the smaller ones, we need to have some criterium that would be used to decide which table should we put the data in. We can do it either by range (e.g. created_at between 01.01.2016 and 31.12.2016) or by value (client_id equal to 100). To ensure we have only the valid data which always satisfy out partitioning criterium, we should add proper CHECK CONSTRAINTS as guard statements. To make sure the orders in particular table were e.g. created at 2016, we could add the following constraint:

CHECK (created_at >= DATE '2016-01-01' AND created_at <= DATE '2016-12-31')

If we were to create tables for orders for the upcoming few years (assuming that we want to cover entire year in each of them), we could do the following:

CREATE TABLE orders_2016 (
    CHECK (created_at >= DATE '2016-01-01' AND created_at <= DATE '2016-12-31')
) INHERITS (orders);
CREATE TABLE orders_2017 (
    CHECK (created_at >= DATE '2017-01-01' AND created_at <= DATE '2017-12-31')
) INHERITS (orders);
CREATE TABLE orders_2018 (
    CHECK (created_at >= DATE '2018-01-01' AND created_at <= DATE '2018-12-31')
) INHERITS (orders);

Beware of potential gotcha when defining CHECK CONSTRAINTS. Take a look at the following example:

CHECK (client_id >= 1 AND client_id <= 1000);
CHECK (client_id >= 1000 AND client_id <= 2000);

In this case orders with client_id equal to 1000 could be inserted to any of these tables. Make sure the constraints are not inclusive on the same value when using ranges to avoid such problems.

Performance optimization

To provide decent performance it is also important to make constraint_exclusion in postgresql.conf enabled. You can set it either to on or partition:

constraint_exclusion = on # on, off, or partition

or

constraint_exclusion = partition  # on, off, or partition

That way we can avoid scanning all children tables when the CHECK CONSTRAINTS exclude them based on query conditions. Compare the query plans between queries with constraint_exclusion disabled and enabled:

# SET constraint_exclusion = off;
SET

# EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT * FROM orders WHERE created_at >= DATE '2018-01-01' AND created_at <= DATE '2018-12-31';
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..97.95 rows=25 width=24) (actual time=0.001..0.001 rows=0 loops=1)
   ->  Seq Scan on orders  (cost=0.00..0.00 rows=1 width=24) (actual time=0.001..0.001 rows=0 loops=1)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
   ->  Seq Scan on orders_2016  (cost=0.00..32.65 rows=8 width=24) (actual time=0.000..0.000 rows=0 loops=1)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
   ->  Seq Scan on orders_2017  (cost=0.00..32.65 rows=8 width=24) (actual time=0.000..0.000 rows=0 loops=1)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
   ->  Seq Scan on orders_2018  (cost=0.00..32.65 rows=8 width=24) (actual time=0.000..0.000 rows=0 loops=1)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
# SET constraint_exclusion = partition;
SET

# EXPLAIN SELECT * FROM orders WHERE created_at >= DATE '2018-01-01' AND created_at <= DATE '2018-12-31';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Append  (cost=0.00..32.65 rows=9 width=24)
   ->  Seq Scan on orders  (cost=0.00..0.00 rows=1 width=24)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
   ->  Seq Scan on orders_2018  (cost=0.00..32.65 rows=8 width=24)
         Filter: ((created_at >= '2018-01-01'::date) AND (created_at <= '2018-12-31'::date))
(5 rows)

We wanted to select only the orders created on 2018, so there is no need to scan other children tables that don't contain such data.

The difference between on and partition setting is that the former checks constraints for all tables and the latter only for children tables inheriting from parent table (and also when using UNION ALL subqueries).

Create new records easily with triggers

Having multiple tables is certainly difficult to manage when creating new records. Always remembering that they should be inserted to a specific table can be cumbersome. Fortunately, there's an excellent solution for this problem: PostgreSQL triggers! If you are not familiar with them, you can read my previous blog post about database triggers.

We can automate the insertion process by checking the value of created_at and decide which table it should be put in. Here's an example how we could approach it:

CREATE OR REPLACE FUNCTION orders_create_function()
RETURNS TRIGGER AS $$
BEGIN
    IF (NEW.created_at >= DATE '2016-01-01' AND
         NEW.created_at <= DATE '2016-12-31') THEN
        INSERT INTO orders_2016 VALUES (NEW.*);
    ELSIF ( NEW.created_at >= DATE '2017-01-01' AND
            NEW.created_at <= DATE '2017-21-31' ) THEN
        INSERT INTO orders_2017 VALUES (NEW.*);
    ELSIF (NEW.created_at >= created_at '2018-01-01' AND
            NEW.created_at <= created_at '2018-12-31') THEN
        INSERT INTO orders_2018 VALUES (NEW.*);
    ELSE
        RAISE EXCEPTION 'Date out of range, probably child table is missing';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER orders_create_trigger
BEFORE INSERT ON orders
FOR EACH ROW EXECUTE PROCEDURE orders_create_function();

There's a potential gotcha though: as we as we are dealing with multiple tables, the id of the orders might not necessarily be unique across all of them. When creating new records we can ensure that the next id will be based on "global" sequence for partitioned tables, but it still gives a possibility to have duplicated ids in some tables, e.g. by accidental update of the id. Probably the best way to make sure there are no duplicates in all partitioned tables would be using uuid which most likely will be unique.

Wrapping up

Table partitioning might be a great solution to improve performance of your application when the amount of data in tables is huge (especially when they don't fit into memory any longer). In the first part, we learned some basics from raw SQL perspective about table partitioning. In the next one we will be applying this concept to real-world Rails application.

posted in: PostgreSQL, Databases, Ruby, Ruby on Rails, Scaling, Performance, Architecture