Get started

Tutorials

9 Jan 2024

Recovering deleted Wagtail pages and Django models

Accidentally deleted a Wagtail page or Django model? Start here!

Jake Howard

Jake Howard

Senior Systems Engineer at Torchbox

Wagtail is a great content-management system, suitable for a variety of uses. Most associate Wagtail as being a way of creating websites and blogs, but it actually works really well as an intranet too. At Torchbox, we use Wagtail as our source for internal documentation ("intranet"), processes and any other company information we may want to keep track of and make available to all co-owners.

Our intranet has been around for a while, with content being added wherever anyone feels it fits best. Back in 2022, we started the slow journey of restructuring the content into a cohesive, easy-to-navigate hierarchy. This requires some planning, and then a lot of moving and restructuring pages to meet the new requirements. Fortunately, Wagtail makes it very simple to move a page or group of pages around the site, and carry any child pages with it. In a few places, we had some content duplication, or content just needed to be rewritten, so the old content could be deleted.

Sadly, this didn't go quite according to plan. One afternoon, as I was trying to reference one of our process pages, I couldn't find it. In fact, I soon discovered I couldn't find the entire "Sysadmin" section. It had completely vanished.

Wagtail's site history report

Wagtail, in its infinite wisdom, had a feature to help understand what happened, and more importantly who happened (not to assign blame, but to better understand what happened): The site history report. The site history report can be used to filter and search through all actions performed through Wagtail and to associate them with a user and object (page, snippet etc) in a nice list. My plan was to review this, look for all the files which went amiss, and contact the person to understand what happened.

Fortunately for me, Wagtail showed almost exactly what I had expected. It appears one of our staff members deleted the "Sysadmin" section page a few days before, which consequently deleted all pages underneath it, all 105 of them. Armed with this information, I messaged the person on Slack to get a better idea what may have happened - working on the assumption they didn't mean to delete those pages (Hanlon’s Razor).

It seems the person in question had created a "Sysadmin" page in a new location a while ago, but had instead switched strategy to shuffle around a few pages in place and then move the existing page. The wrong "Sysadmin" page was then deleted, deleting everything under it.

When you delete a page through the Wagtail admin, it shows a confirmation page, to mitigate against a miss-click. On this confirmation page, Wagtail confirms the page you’re deleting, shows how many child pages will also be removed, and prompts for an explicit confirmation when removing too many pages (more than 10 by default).

When Wagtail deletes a page, it isn't a "soft delete" - the pages, revisions, and anything else are completely removed from the database (known as a “cascading” delete). Therefore, I had no choice but to restore from a backup.

Restoring from backups

At Torchbox, we take data integrity seriously. All our databases are backed up nightly, monitored, and stored off-site. We have backups from both the day of the delete and the days around it. However, in this case, simply restoring from a backup wasn't an option.

Our intranet is a living document that is constantly updated by any and all members of staff at any time. Rolling the entire page tree back almost 2 days would have meant losing potentially critical changes, not to mention other people's time making the changes. Personally, I use the sysadmin pages more than any other part of the intranet - so it's very important to me. But for our digital marketing, HR or business development teams, less so. This meant there were lots of content changes to the intranet since the "Sysadmin" pages were deleted.

Partial restores from backup

Ideally, what I needed was to only restore the sysadmin pages, leaving all others completely untouched. Whilst in this case, we might have been able to roll back the entire database, in other situations that may not be possible.

By diving through a few Django internals, it's possible to discover exactly what was deleted, save it, and restore it in production, all without any downtime.

1. Setup the backup database locally

Because Wagtail fully deleted the relevant database rows which contained our page data, the only way to obtain the deleted pages is through a database backup. Downloading the backup from just before the pages were deleted, importing it locally, and starting up an instance of our intranet to point to it was incredibly simple (all of our projects run in Docker containers and require a single command to get up and running). Once loaded, I confirmed the pages were where we left them, and continued.

2. Locate the page models

Wagtail's page models lean heavily on django-treebeard for their tree-like structure. When a page is deleted, It's treebeard which deletes all the child pages too. When treebeard deletes a page, it runs through all the pages descendants and deletes those too. The actual implementation is far more complex, but all we need to do is find the descendant models, which is just a method call away:

from wagtail.models import Page

sysadmin_page = Page.objects.get(id=91)

child_pages = sysadmin_page.get_descendants()

Notably, get_descendants returns all descendants (recursively), as opposed to get_children, which only returns the immediate child pages.

"91" is the page id for the deleted "Sysadmin" page. When opening the Wagtail admin locally, I could visit the deleted "Sysadmin" page and copy the URL from the admin (eg https://example.com/admin/pages/91/edit/).

3. Locate what was deleted

This is where most of the magic happens. When you call delete on a model or queryset, it doesn't directly translate to a DELETE FROM database query as you'd expect. Django does a lot itself for features like on_delete to determine exactly which models need deleting, and in what order, so as not to blindly rely on database cascading and to implement other niceties. Worse still, if there are foreign key relationships, those are also handled by Django.

Naively using just the child pages would miss out on any models related to the given page like inline panels, and that's before we even think about how page models are really 2 different models stuck together.

If you’ve ever used the Django admin, you’ve already seen that Django knows which models are going to be deleted before actually deleting them. Well, that's handled by a simple class, which wraps Django's database collector to determine what's going to be deleted.

From there, we can go from a list of pages, to a list of exactly what Django would delete, without actually deleting anything.

from django.contrib.admin.utils import NestedObjects

collector = NestedObjects()
collector.collect(list(child_pages) + [sysadmin_page])

This is subtly different from just blindly calling Django's built-in delete method because it doesn't need to actually issue any DELETE queries. It's possible to achieve this result with .delete() in a transaction, but it’s not as clean and is much more intensive.

The collector now contains (in collector.data) all the deleted pages, models, revisions and anything else we’d need to restore the pages completely.

4. Serialize

Now, in memory, I have all the models which were deleted when the sysadmin page was deleted. What's better yet, is that I only have those pages - nothing else has been caught in the crossfire. But, that's all in memory, and like any good sysadmin, my laptop isn't what's running production. The data now needs to get onto production. I need some way of serializing models into an intermediary format, which can be loaded onto production. If you're thinking "Isn't this fixtures?", then you're right.

Fixtures create a JSON representation of a model / models, such that they can be saved in 1 location and loaded into another. This is commonly used for tests which need more complex model set ups, but can absolutely be reused here.

By taking all the deleted models, and serializing them into a rather large fixture file, we now have a portable file format containing all the deleted pages, which can be transferred off of my laptop.

from django.core import serializers

class NoM2MSerializer(Serializer):
    def handle_m2m_field(self, obj, field):
        pass

def get_model_instances():
    for qs in collector.data.values():
        yield from qs

with open("deleted-models.json", "w") as f:
    NoM2MSerializer().serialize(
        get_model_instances(), 
        stream=f
    )

You'll notice something different compared to normally serializing models - a custom serializer. When Django serializes a model with a many-to-many field which doesn't use a custom through table, it inlines them on the parent model. That's helpful when serializing just a set of models of one type, but not helpful in our case. Because the NestedObjects collector discovers these through tables, we don't need the serializer to create those links for us, as it results in referential integrity issues. Instead, we can tell the serializer to not inline these fields, safe in the knowledge they'll still be included in the fixture and thus loaded back into the database.

4a. Deserialize

With a fixture file, loading the fixtures is a well known and documented process. Running manage.py loaddata deleted-files.json, and then waiting a little, will load all of our pages back into Wagtail, ready to be used.

5. Test test test

For what I hope are obvious reasons, this process absolutely needed to be tested. Once I had the fixtures saved, I performed the same delete operation through the Wagtail Admin to delete the sysadmin pages, confirmed they were gone, and then promptly restored them, and confirmed they were back.

I'm glad I tested, because it uncovered an interesting issue. When pages are deleted, Wagtail helpfully removes them from the search and reference indexes. Unhelpfully, those were also discovered by the collector and attempted to be restored, which Wagtail didn’t like. The indexes are incredibly easy to rebuild, so rather than spend too much time satisfying Wagtail, I ignored the index models. The search index is only relevant if you’re using a database-backed search engine (PostgreSQL in our case).

Wagtail helpfully includes a fixtree management command to confirm that the page tree looks correct and there are no orphan pages, and fixes any obvious issues when it finds them. Both before and after the import, this command passed.

6. Showtime!

Once tested, and I was happy, it was time to run against production.

Our intranet, like all our other applications (including this website) run on Heroku. To get the deleted files onto the application dyno, I used a mixture of Python’s HTTP server and ngrok (not the best solution, but it worked). Once there, the loaddata command runs exactly the same. Of course, just before running, I took yet another backup, just in case this time I did need to do a proper rollback.

I copied the file to the live intranet application, ran the import command, and crossed absolutely everything. But no, the import worked successfully. Pages popped right back up in the admin as if they had never left, and began showing up in the frontend. Re-running the checktree command still reported the page tree looked exactly as it should.

Because the search and reference indexes had been removed from the fixtures, the indexes had to be rebuilt. With Wagtail, this is as simple as a management command (update_index and rebuild_references_index respectively). And with that, it was like nothing happened.

Conclusion

With a few hours work, our precious intranet pages were back, with 0 loss of data. Using a partial restore like this meant there was no need for a content freeze or to take the site down. A regular restore would require rolling everything back to a specific point in time, potentially losing other critical work. With this, there was exactly 0 downtime, and most people didn't know the pages had gone, or that anything had happened.

This trick is one I've had to use only a few times in my career, but knowing about it makes for a great tool in the toolbox. Whilst this is in the context of a Wagtail site, there’s nothing Wagtail-specific about it - deleting any kind of model should be recoverable in the exact same way.

Hopefully now, if you ever delete a large number of models and have to restore, you can restore without impacting normal operation.