A tale of digging into Wagtail’s page tree internals
Wagtail is a popular Django-based CMS that provides a nice API to interact with pages. But sometimes the nice API just doesn't cut it. In this post, we'll explore how Wagtail stores the page tree internally and how we can use that knowledge to solve a problem that the API doesn't provide a solution for.
I've always taken some features like the page tree in Wagtail for granted. It just works and Wagtail provides a nice API to interact with to query for pages. I never really thought about how it all works under the hood. That is, until I ran into a problem that required me to understand how the page tree works.
In this post I'll share my journey of digging into Wagtail's page tree internals and how I used that knowledge to find a solution to my specific problem.
Adding context to our search results
Recently, I worked on a project where I needed to improve the search feature to make it easier for users to find the page they are looking for. As it stands, the search results were not very helpful because they only showed the title of the page. To illustrate this problem, let’s say we are dealing with a fictional tax website. An excerpt of the page tree might look like this:
- Home
- Questions answered in plain English
- My partner has died, what do I need to do?
- What is inheritance tax?
- Do I need to pay inheritance tax?
- How do I file an inheritance tax return?
- Knowledge Base
- What is inheritance tax?
- Calculating inheritance tax
- Filing an inheritance tax return
- Services
- Get help with inheritance tax from our experts
- News
- Supreme Court ruling on inheritance tax
- Events
- Tax seminar in April
- Questions answered in plain English
If a user searches for “inheritance tax”, the results might look something like this:
- What is inheritance tax?
- What is inheritance tax?
- Calculating inheritance tax
- Do I need to pay inheritance tax?
- How do I file an inheritance tax return?
- Filing an inheritance tax return
- Supreme Court ruling on inheritance tax
- Get help with inheritance tax from our experts
- My partner has died, what do I need to do?
- Tax seminar in April
In this example, the search results have a lot of pages with similar titles (and even duplicate titles!). To help users determine which result might be more relevant for them, it would be helpful if we can give them a bit more context. We decided to show the parent page title of each search result, as that would give users a better idea of the section of the website the page is in.
With this in mind, the search results should look more like this:
- What is inheritance tax? (Questions answered in plain English)
- What is inheritance tax? (Knowledge Base)
- Calculating inheritance tax (Knowledge Base)
- Do I need to pay inheritance tax? (Questions answered in plain English)
- How do I file an inheritance tax return? (Questions answered in plain English)
- Filing an inheritance tax return (Knowledge Base)
- Supreme Court ruling on inheritance tax (News)
- Get help with inheritance tax from our experts (Services)
- My partner has died, what do I need to do? (Questions answered in plain English)
- Tax seminar in April (Events)
Getting stuck querying the page tree
While Wagtail provides a way to get the parent page given a specific page instance, it doesn’t provide a way to get the parent page of multiple pages at once. This is a problem because as the number of items in the search results grows, the number of queries made to the database also grows. In technical terms this is called an N+1 query. It makes your website slow.
This is where I initially got stuck - how do I make this query efficiently? I couldn’t find a way to do this with the QuerySet API Wagtail provides, nor did the internet give me usable pointers. At this point, I asked ChatGPT for pointers.
First attempt: ask ChatGPT
(I'm sure you've heard this warning before, but here it is once more: never blindly trust ChatGPT!)
I asked ChatGPT how I could get the parent page given a queryset of pages. ChatGPT suggested that I could use `select_related` on a field `parent` to fetch the parent page. Here's an example (not the actual code it gave me, but similar):
# Find all pages with the title containing "inheritance tax" and fetch them along with their parent page results = Page.objects.filter(title__icontains="inheritance tax").select_related('parent') for result in results: print(result.title, result.parent.title)
But that didn’t work. Django threw an error at me:
FieldError: Invalid field name(s) given in select_related: 'parent'.
Tartar sauce! There is no field called `parent` on the Page model. ChatGPT made that up!
I could have continued conversing with an LLM system, but instead I decided to do some digging in Wagtail's source code as the answer was not forthcoming after a few attempts and it would be a great opportunity to expand my own knowledge.
Second attempt: do the legwork and read the source code
I started by looking at the Page model in Wagtail’s source code. I figured out that Wagtail uses a library called django-treebeard to implement the page tree internally. This library provides the means to store hierarchical data in a database and to query it efficiently.
Treebeard supports multiple tree implementations, but Wagtail uses the Materialized Path tree implementation. This implementation adds a few fields to the Page model that are used to store the tree structure. The most important one being the `path` field.
Understanding the Materialized Path tree
The path field contains a piece of text that represents the path from the root page to the current page and gets longer with each level of nesting. Here's an example of what the paths for our page tree might look like:
- 0001 - Home
- 00010001 - Questions answered in plain English
- 000100010001 - My partner has died, what do I need to do?
- 000100010002 - What is inheritance tax?
- 000100010003 - Do I need to pay inheritance tax?
- 000100010004 - How do I file an inheritance tax return?
- 00010002 - Knowledge Base
- 000100020001 - What is inheritance tax?
- 000100020002 - Calculating inheritance tax
- 000100020003 - Filing an inheritance tax return
- 00010003 - Services
- 000100030001 - Get help with inheritance tax from our experts
- 00010004 - News
- 000100040001 - Supreme Court ruling on inheritance tax
- 00010005 - Events
- 000100050001 - Tax seminar in April
- 00010001 - Questions answered in plain English
Are you seeing the pattern? The path is a concatenation of the parent page’s path and the child page’s position among its siblings. The path is padded with zeros to ensure that the path is always the same length, this makes it easier to query the database because the length of the path is predictable.
If we are looking for the parent page of ”What is inheritance tax?” (path 000100020001), we can find it by removing the last four characters from the path. This will give us the path of the parent page, which is 00010002 (Knowledge Base). We can then use this path to query the database for a page that matches that path.
Fun trivia
According to the django-treebeard documentation, the default configuration allows for a maximum nesting of 63 levels. And each node in the tree can have up to 1,679,616 siblings (including itself). That's a lot of siblings!
Anyone up for a challenge and create a page tree with that many pages in it?
Putting our newfound knowledge to use
Now that we understand how the page tree is stored, we can use this information to get the parent page of multiple pages at once by doing some manual querying. This involves a subquery to get the parent page for each page in the queryset.
from django.db.models.functions import Length, Substr from django.db.models import OuterRef, Subquery from wagtail.models import Page # Filter for pages where the path matches the exact value we calculate below. This subquery is called for each page in the queryset but because we use a subquery, it's still only one query to the database and not an N+1 query. That makes it quite efficient. subquery_parent_title = Page.objects.filter( path=Substr( # OuterRef refers to the path of a page from the outer query expression=OuterRef("path"), pos=1, # How many characters to keep in the result - in this case, the number of characters in the path minus 4 # Remember: 00010002 should become 0001 length=Length(OuterRef("path")) - 4 # Or replace 4 with Page.steplen as that is the constant for this value ) # We only need the title from the parent page, so we select only that field ).values('title') # Add the parent title to the results as an extra field: parent_title results = Page.objects.filter(title__icontains="inheritance tax").annotate(parent_title=Subquery(subquery_parent_title))
The output looks like this, parent titles between parentheses
- What is inheritance tax? (Knowledge Base)
- What is inheritance tax? (Questions answered in plain English)
- Calculating inheritance tax (Knowledge Base)
- Do I need to pay inheritance tax? (Questions answered in plain English)
One more thing - filtering out parents at the top of the tree
This works almost perfectly, but there is a small issue. The code will return the parent page title for all pages, even if the page is at the top of the tree. It will return 'Home' as the parent of 'Questions answered in plain English' and 'Knowledge Base' if those pages appear in the results, which is not desirable. Conveniently, django-treebeard also adds a `depth` field to the Page model that we can use to filter out pages that are at the top of the tree.
The fix is fairly straightforward. We need to add a filter to the subquery so it only returns pages that live below the home page in the tree. Specifically, we need to filter for depth >= 3.
The reason we chose `depth >= 3` is because the page with `depth = 1` is Wagtail’s hidden root page. `depth = 2` is the home page. Everything at `depth = 3` is directly below the homepage.
from django.db.models.functions import Length, Substr from django.db.models import OuterRef, Subquery from wagtail.models import Page subquery_parent_title = Page.objects.filter( # NEW: Filter out parent that are at the top of the tree # Only a parent below the home page (depth >= 3) will be returned # Otherwise, the parent title will be None depth__gte=3, path=Substr( expression=OuterRef("path"), pos=1, length=Length(OuterRef("path")) - 4 ) ).values('title') # Let's instead search for pages with the title "questions" to demonstrate the filtering results = Page.objects.filter(title__icontains="questions").annotate( parent_title=Subquery(subquery_parent_title) )
This does exactly what we wanted - “Questions answered in plain English” lives under Home but no parent title is shown!
Conclusion
We explored how Wagtail stores the page tree internally and how we can exploit that knowledge to solve a problem that the API doesn’t provide a solution for. By understanding the Materialized Path tree implementation, we were able to efficiently get the parent page of multiple pages at once.
Of course, this is an example to a specific problem - maybe it'll be helpful for you someday. And even if not, this post has given you some insights into how you cause the Materialized Path tree implementation in Wagtail to query the page tree.
And remember, sometimes the answer is not in the documentation or on the internet. That doesn't mean you're facing an unsolvable problem! Don't be afraid to dig into the source code of whatever system you are using, maybe you'll gain some insights that will help you solve your problem.
PS: during the authoring of this blog post I’ve decided to try again with ChatGPT (free version). If you give the context that ‘Wagtail uses django-treebeard Materialized Paths’, you get significantly better results. You still have to nudge it in the right direction, but it’ll reach the same solution eventually.