Alembic migrations involving PostgreSQL column default values
#TIL
#Python
#PostgreSQL
Today I Learnt how to create Alembic migrations when adding a new non-nullable column to an existing table
I wanted to create an Alembic migration for a new column in a Postgres table. The column should be non-nullable and needs a default value since there is already data in this table.
There are 2 ways to do this:
- Create the column, allowing null values. Then update all rows to set a default. Then make the column non-nullable
- Create the column with a default value
In Alembic, option 1 would look something like:
def upgrade():
# Add column, then set default value, then make non-nullable
op.add_column("file", sa.Column("file_format", sa.String(), nullable=True))
op.execute("UPDATE file SET file_format = 'hex' WHERE file_format is NULL")
op.alter_column("file", "file_format", nullable=False)
whereas option 2 would look something like:
def upgrade():
op.add_column(
"file",
sa.Column("file_format", sa.String(), server_default="hex", nullable=False),
)
Initially I decided to go for option 1 since I would like to update the default value in the future and I could do this without requiring another migration (I could just set default
to another value).
However, as soon as I started running the migration I could see it was taking a long time to run, specifically the second UPDATE
command. The table was not particularly large but had around 7 million rows and this command was taking over 5 minutes to run. Since this migration was running inside of a single transaction, it was also taking a lock on the table and thus impacting live traffic - thankfully I was just running this in the staging environment!
I did a bit of searching online and found (via StackExchange) the following blog post, which implies option 2 above should be a lot faster since PostgreSQL (as of version 11) does not need to rewrite the whole table when adding a new column with a default value. This is backed up by the official PostgreSQL docs which under the ‘Notes’ section says:
When a column is added with
ADD COLUMN
and a non-volatileDEFAULT
is specified, the default is evaluated at the time of the statement and the result stored in the table's metadata. That value will be used for the column for all existing rows. If noDEFAULT
is specified, NULL is used. In neither case is a rewrite of the table required.
I changed the migration to use server_default
instead and reran the migration. It now took just 5 seconds!
(Note that versions of PostgreSQL prior to version 11 would lock the table even when using a default value due to it needing to rewrite the all data for the table on disk, so option 1 in this case would be better, although updating the rows in small batches is recommended to reduce downtime)
In summary, whenever adding a new non-nullable column to an existing table, always use server_default
rather than populating it manually since PostgreSQL knows how to optimise for this.