A typical modern day website relies on interactive elements to provide functionality.
For complex or detailed pages this can result in a webpages’ filesize getting big, sometimes very big, creating a situation where the content that you want Googlebot to read is actually a long way down the page.
So how big is too big for Google?
To answer that question we are going to have to get a bit creative. I generated a list of 202 keywords that currently have no indexed pages, they are pretty much just gibberish with a number attached to the end. I recorded which ones Google picked up after a period of around 10 days.
Each keyword was separated by 100kb of commented out text provided by Project Gutenberg, then the page was submitted to the search results via the “fetch as Google” tool within search console.
You can see from the screenshot above that Google only rendered up to the 158th keyword, or between 15700 kb (15.7 mb) and 15800 kb (15.8 mb), however when clicking on the Fetching tab and viewing the HTTP response, I was only actually shown the first 250 kb of data
I didn’t expect there to be such a massive discrepancy between what Google can actually render and what it shows in the fetch response, Google only showed three of the 202 keywords in the HTTP response section.
Having submitted the fetched page to Google I had to wait for the page to be picked up and after a few hours the results began to trickle in. I allowed Google just over a week to index as many keywords as it could.
As you can see from the screenshot above Google had picked up and indexed the keywords and included some of them on the page as part of the description. Rather than repeat this process another 202 times I ran a rank tracker to pull the first page of Google results, the result was surprisingly similar to the fetch as Google tool
Google has indexed up to the 158th keyword, 15700 kb (15.7 mb) which is exactly where the visual portion of the fetch and render stopped.
This seems to indicate that the “fetch as Google” tool is very much how Google will see your page, just keep in mind if you are looking for a specific block of code that you can’t see beyond the first 250kb in the HTTP response.
Whilst doing this research I also came across a rather strange situation with Google cache; a user had a very heavy javascript page that was indexing fine but when they viewed the page in the cache it was entirely blank. It appears as though Google cache was only holding onto the first 1 mb of data, this was causing the javascript to be truncated and thus leaving the page with no visible content.
So to sum up:
Fetch and render will get between 15.7mb and 15.8mb of data but will only show you 250kb in the HTTP response. This can make it a little bit difficult to debug issues where your page size goes over 250kb.
The actual Google index seems to stop indexing somewhere between 15.7mb and 15.8mb – this is in line with the visual fetch and render which means you can trust the visual portion of the fetch and render to show you what will actually be seen by Googlebot.
Google cache capped at 1mb – it will truncate content after this which can lead to some very strange pages in the cache. Watch out for this as when it truncates it may break page functionality leading you to believe there is a problem with the page!
Why does this matter?
John Mueller recently spoke about how SEO’s can help devs and share knowledge,
The web has moved from plain HTML – as an SEO you can embrace that. Learn from JS devs & share SEO knowledge with them. JS’s not going away.
— John ☆.o(≧▽≦)o.☆ (@JohnMu) August 8, 2017
There is an assumption that Google can now crawl so much content that it will be able to index anything, however there is still a hard line which Google will not cross. You are unlikely to hit this simply using on page text but with the increasing use of javascript to create interactive elements and dynamic content, it’s useful to know exactly where Google draws its line.