Google Summer of Code: Toolkit for StreamField data migrations in Wagtail
During Google Summer of Code, I created a toolkit for StreamField data migrations to help make working with StreamField data easier.
Google Summer of Code is a program that is intended to introduce new contributors from all around the world to open source software development projects. Wagtail took part in GSoC 2022 with several projects, including this one. The aim of this particular project was to develop a toolkit with utilities to help write data migrations for changes made to StreamField block definitions in Wagtail.
Like any other GSoC project, our team included a contributor (me) and mentors. To introduce myself, my name is Sandil Ranasinghe and I am currently a CS engineering undergraduate at the University of Moratuwa, Sri Lanka and a full stack developer in my free time. My mentors, Jacob Topp-Mugglestone, Joshua Munn and Karl Hobley are all very experienced developers from Torchbox.
What is a StreamField?
Instead of RichTextFields where you edit all the content and their formatting (layout, styles, structure etc.) in one place, StreamField lets you use content ‘blocks’ that have a predefined format that can simply be created and inserted where you want and filled in without worrying about the formatting. For example, you could add a quote saying "The sky is blue" in a RichTextField with a bold font, followed by the name of the person saying it in italics. Alternatively, you could use a quote block defined in a StreamField and add "The sky is blue" and the name of the person saying it as content. In the latter case, the content editor has to worry only about the words and name (content) used in the block and the formatting (font weight, style, structure etc.) for the block type being used (a quote block in this example) because the block has already have been defined by the developer. You can think of StreamField blocks as ready-made templates for adding types of content, whether they are headings, images, or complex forms. This article goes over the advantages and uses of StreamField blocks.
What's tricky about writing data migrations for StreamFields?
A StreamField is stored as a single column of JSON data in the database where blocks are stored within the JSON representation. These blocks can also be nested to form complex block structures. While the structure of the StreamField is defined in the code, it is not reflected in the database schema. Therefore, as far as Django is concerned when making schema migrations, everything inside this column is just some JSON data and there are no changes to be made to the database schema even when the StreamField definition is changed. As a result, whenever changes are made to the StreamField’s definition, any existing data must be changed into the required structure by using a data migration that is created manually.
While it is fairly straightforward for a very simple change, like renaming a top level CharBlock, it can easily become very complicated when nested blocks and multiple blocks are involved.
This is what a migration that is written manually looks like:
from django.db import migrations from wagtail.blocks import StreamValue def forward(apps, schema_editor): BlogPage = apps.get_model("blog", "BlogPage") for bp in BlogPage.objects.all(): stream_data =  mapped = False for block in bp.content.raw_data: if block['type'] == 'somestreamblock': for child_block in block['value']: if child_block['type'] == 'title': mapped = True child_block['type'] = 'heading' print(block) stream_data.append(block) if block['type'] == 'anotherstreamblock': for child_block in block['value']: if child_block['type'] == 'title': mapped = True child_block['type'] = 'heading' print(block) stream_data.append(block) else: stream_data.append(block) if mapped: stream_block = bp.content.stream_block bp.content = StreamValue(stream_block, stream_data, is_lazy=True) bp.save() class Migration(migrations.Migration): dependencies = [ ("blog", "0004_alter_blogpage_content"), ] operations = [ migrations.RunPython(forward) ]
As you can see, this approach requires writing a lot of code, and you can imagine that it would get even more complicated if we were dealing with another level of nesting. That's why it would be great if there was a way to do it without having to write so much code or deal with too much complexity.
What we wanted to do
The end goal was to create a package that provides reusable utilities for easily updating StreamFields within data migrations. This project would include utilities for recursing through different types of StreamField structures, recognising blocks, and making changes. In addition, there would be a set of functions to make appropriate changes for the most common data migration use cases (such as adding a new block with required value, changing the type of a block, or moving a block to within a StructBlock). The utilities would also help make the same changes to revision data as well.
In addition, we wanted to explore the possibility of automatically detecting basic changes (like renaming or removing blocks) made to StreamField structures and generating corresponding data migrations.
How we decided to approach this
There were several things we had to do before starting work on writing code for our toolkit. Before the beginning of the coding period, our first step was to find out what common changes were made by the community that required writing data migrations. We were able to recognize and list several common use cases, such as renaming and removing blocks, moving a block inside a new StructBlock, etc.
Then we moved on to an exploratory phase where we created some sample data with StreamFields in a new Wagtail project and experimented with writing manual data migrations for changes made to it like renaming blocks, moving blocks inside StructBlocks, etc. During this experiment, we were able to discuss and decide the approach we would take when writing our package, answering questions that we had. For example, one question that came up with here was how we were going to deal with our data? Were we going to load the data as blocks, as semi-validated data with .raw_data, or as raw data straight from the database column. We also discussed how recursing through different types of block structures would be done.
By comparing our findings to what is involved in writing a manual migration, we came up with the following three-part approach. First, we would need to get the model containing the StreamField from the current project state and query all the instances, do something to alter the data for each instance as needed, and then save it again. A similar process would be needed if revisions are being updated too. For this, a utility that handles all the querying, applying changes and then updating the changes in the database was needed. Second, once we have the data for each instance, we need to alter the data as needed. Unlike in other fields where we directly have the specific data we need for the field, StreamField will contain other unrelated blocks, parent blocks, etc. So we needed a utility to recurse through different types of StreamField structures (nested blocks in StructBlock, StreamBlock, ListBlock) and obtain the specific blocks that contain the data we want to alter and map all the old blocks to new blocks. Finally, once we get the blocks that we want to alter, we would need to write the logic for doing the actual change (for example, renaming a block). For this, a set of common operations that can be applied as well as a way to write custom operations was needed.
A look at progress during the project
Writing the actual code started with a test-driven approach for creating the logic for recursing through types of StreamField structures. This involved writing tests with expected changes for raw (JSON-like) data to form different structures with basic rename and remove operations. After creating tests, the recursion logic as well as the data operations (rename, remove) were developed.
Some challenges we faced involved working with the representation of models in the ProjectState at the time of migration as well as supporting both Wagtail 3 and 4, which had significant differences when it came to the Revision models and reusing code. Working on the autodetect feature was also a challenge because we had to do some digging into how Django's migrations/autodetection etc. worked.
I tracked my progress in weekly meetings with my mentors and code reviews in addition to writing code. Of course, writing comments and documenting the code was also a big part of our work.
What we came up with
We created a `MigrateStreamData` class that handles all the querying, applying changes ,and saving. Functions for recursing through the different kinds of structures and obtaining blocks corresponding to a given block path can also be called from within `MigrateStreamData` when the relevant block path/s are given. A set of sub-operations also cover common use cases like renaming blocks, removing blocks, moving a StreamBlock child inside a StructBlock, altering a value, etc. It is possible to define custom sub-operations for other use cases. Using these sub-operations is as simple as passing the required sub-operation with parameters and the corresponding block path to the MigrateStreamData class.
What it looks like for users
We managed to accomplish the goals we mentioned earlier through a package. This package greatly reduces the amount of code that needs to be written. A data migration written using our package looks like this:
from django.db import migrations from wagtail_streamfield_migration_toolkit.migrate_operation import MigrateStreamData from wagtail_streamfield_migration_toolkit.operations import RenameStreamChildrenOperation class Migration(migrations.Migration): dependencies = [ ('wagtailcore', '0069_log_entry_jsonfield'), ("blog", "0004_alter_blogpage_content") ] operations = [ MigrateStreamData( app_name="blog", model_name="BlogPage", field_name="content", operations_and_block_paths=[ (RenameStreamChildrenOperation(old_name="title", new_name="heading"), "somestreamblock"), (RenameStreamChildrenOperation(old_name="title", new_name="heading"), "anotherstreamblock"), ] ), ]
As you can see, there is far less code involved, and it is simpler to use. It also makes it easier to recurse through various types of block structures without worrying about what type of blocks you are accessing; you can just give a 'block path' to a nested block that you want to change without worrying about whether there are StructBlocks, ListBlocks or StreamBlocks as its parents.
Finally, the autodetect feature (which is still being worked on), is a step towards providing a tool similar to Django's `makemigrations`; a utility that will go through the changes you have made to your StreamField blocks and automatically come up with data migrations. Currently, this works only for basic rename and remove operations, although we look forward to expanding its functionality in the future.
Usage of the autodetect feature looks like this:
python manage.py streamchangedetect CHANGES FOR MODEL blogpage Was 'somestreamblock.title' renamed to 'heading' ? [y/N] y Was ’anotherstreamblock.title' renamed to 'heading' ? [y/N] y RENAME somestreamblock.title TO heading RENAME anotherstreamblock.title TO heading
After confirming the changes made, it would generate a migration file with the same content as the manually created migration code shown above.
Call to action
The package is available on pypi (https://pypi.org/project/wagtail-streamfield-migration-toolkit/), and we plan on moving it into wagtail at some point in the future. If you have StreamFields in your wagtail project and need to work on data migrations, we'd love it if you try out the package. You can find it here: Wagtail StreamField Migration Toolkit
This project was a great learning opportunity for me. I learned a lot about writing better and more reusable code, coding conventions inPpython as well as writing good comments/descriptions/docstrings, which was something I barely used to do before. In addition, getting to work with more experienced developers was very enlightening when it came to applying concepts like single responsibility. Also, thinking "What would my mentors have said about this?" whenever I'm writing any code has helped me stick to writing better code.
In addition, I learned a lot about Wagtail as well as how Django's migration process works. In addition to technical details, I also got to see how an open source organisation works and what the community is like, which was a great experience. I would like to give special thanks to my mentors for their wonderful guidance and extensive code reviews.