Nearly 2.6K Google Search API Docs Leaked to Rand Fishkin

Nearly 2,600 documents on Google Search API have been leaked to Rand Fishkin, co-founder of market and audience research software firm SparkToro, revealing what possibly goes on behind the tech giant’s closely guarded search operations.

Fishkin recently joined us on the DesignRush Podcast discussing this and similar marketing and SEO topics.

According to Fishkin, an anonymous source contacted him via email on May 5, and after several email exchanges, he learned through a video call that the person was indeed an industry insider with whom he shares mutual professional acquaintances.

Here's my post breaking down the leak's source, my efforts to authenticate it, and early findings from the document trove: https://t.co/nmOD0fd5mN pic.twitter.com/yMxMrSeeLa
— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 28, 2024

The source chose the Moz creator because of his expertise and because he has been very vocal about calling for Google to be transparent about its operations in the past.

Fishkin found the source to be “credible, thoughtful, and deeply knowledgeable,” but he continued to consult experts to gauge the authenticity of the documents.

The SparkToro founder then consulted three former Google employees who all concluded that the documents “look legit” based on their firsthand knowledge of the notation style and format of the company’s internal documents.

Fishkin also went to iPullRank Founder Mike King for his technical SEO expertise, and although 100% authenticity can’t be claimed, he believes that the leaked files “appear to be a legitimate set of documents from inside Google’s Search division.”

After publishing the leaked documents on Monday, the source revealed himself to be Erfan Azimi, founder and director of SEO of digital marketing agency EA Eagle Digital.

According to his statement, the main reason for all of this is that “the truth needs to come out.”

Azimi also shared that the documents were given to him by a former member of Google’s search team who “respectfully declined” to reveal his identity due to the “sensitivity of the situation.”

If you want to step up your SEO game, consider partnering with one of the best SEO agencies listed on DesignRush.

What the Leaked Search API Docs Contain

The leaked files seem to be from an incident where code-hosting platform GitHub accidentally made public API documentation from its private repository and internal Google corporate sites.

The leaked documents’ commit history in Fishkin’s hands states that the code was uploaded to GitHub on March 27 and was only removed on May 7.

The search API documents are full of lines of code that only experts in technical SEO can make sense of.

“Think of this as instructions for members of Google’s search engine team. It’s like an inventory of books in a library, a card catalog of sorts, telling those employees who need to know what’s available and how they can get it,” Fishkin simplified.

One of the leaked Google API modules about Navboost. — One of the Leaked Google API Modules | Source: Rand Fishkin

According to King’s initial analysis, the 2,596 leaked modules that contain 14,014 API features or attributes give out information on Google Search’s core systems and functionality, such as the following:

Web crawling system Trawler shows Google’s crawl queue, how it maintains crawl rates, and how it analyzes how often pages change
Alexandria as the core indexing system, SegIndexer for tier indexing, and TeraGoogle for indexing documents that live on disk long term
HtmlrenderWebkitHeadless as the rendering system for JavaScript pages (Chromium was also mentioned in the docs, so It’s likely that Google originally used WebKit and made the switch once Headless Chrome arrived)
LinkExtractor to extract links from pages and WebMirror for managing canonicalization and duplication

Although there are no specific details as to how exactly Google scores its search results and how it decides which ones go on its first search engine result page (SERP), the leaked files give an idea as to its ranking system.

Google uses Mustang as its primary scoring, ranking, and serving system, Ascorer as its primary rankings algorithm that ranks pages before any re-ranking adjustments, and WebChooserScorer to define feature names in snippet scoring.

It then utilizes Navboost to rerank based on click logs of user behavior and FreshnessTwiddler to rerank documents based on freshness.

The Twiddler framework is the search engine’s overlay system that controls re-rankings after the core level algorithm is done, which includes Navboost, QualityBoost, RealTimeBoost, and WebImageBoost.

Before SERPs are served to users on the front end, Google has the following systems in place.

Google Web Server – the server that the front of Google interacts with, receiving the payloads of data to display to the user
SuperRoot – the brain of Google Search that sends messages to Google’s servers and manages the post-processing system for re-ranking and presentation of results.
SnippetBrain – the system that generates snippets for results
Glue – the system for pulling together universal results using user behavior
Cookbook – the system for generating signals, indicating that values are created at runtime

While the documents don’t reveal exact ranking factors or their weights, they provide a glimpse into Google’s ranking system.

As co-owner of search agency Candour, Mark Williams-Cook, pointed out, “Just because something is referenced in the API leak doesn't mean it's a ranking factor.”

What the Leaked Files Imply

It became clear that Google has made false statements in the past years, gaslighting the industry when it comes to how its search engine operates.

First off, Google has stated several times in the past that its search engine doesn’t use Domain Authority, a metric that estimates the likelihood of a website's domain ranking in SERPs compared to other similar domains.

However, upon closer inspection of the leaked documents, King found that “as part of the Compressed Quality Signals that are stored on a per document basis, Google has a feature they compute called “siteAuthority.’”

Second, Google engineer Paul Haahr spoke in detail about live experiments with clicks at the 2016 SMX West, saying that it’s a mistake to use clicks as a ranking signal due to them being heavily affected by biases.

But, as seen in the contents of the leaked files, Navboost, an algorithm that optimizes search results by analyzing user click patterns, is used as one of the metrics in Google’s ranking system.

Critics conclude that the tech giant tried to hide its use of Click-Through Rate (CTR), which is a way to see how many people click on ads compared to how many times it is shown to people, as a ranking signal.

A screenshot of Paul Haahr's statement about Google not using CTR as a ranking metric. — A Screenshot of Paul Haahr's Statement About Clicks | Source: Mike King

Third, Google Senior Search Analyst John Mueller replied that “there is no sandbox” to a tweet (it has since been deleted) by Vijay Kumar asking about how long it takes for Google to take new websites out of the sandbox.

It’s not a secret that it takes a while for new websites to rank and come out high on SERPs, and the main theory is that Google places them in a sandbox for an unspecified period.

Again, this has been proven untrue as there’s a hostAge attribute in the leaked PerDocData module used “to sandbox fresh spam in serving time.”

John Mueller's deleted tweet quotes him saying that "there is no sandbox." — John Mueller's Deleted Tweet About the Google Sandbox | Source: Mike King

And last, former Google engineer Matt Cutts reportedly said that the #1 search engine doesn’t use Chrome data for its organic ranking algorithm.

One of the leaked documents linked to page quality scores and another one connected to sitelink generation have Chrome-related attributes.

Bill Hartzer's post says Matt Cutts told him that Google Search doesn't use Chrome data in ranking. — 'Matt Cutts: Organic Algo Does Not Use Any Chrome Data' | Source: Bill Hartzer's Post on Webmaster World

After reviewing the leaked modules, both Fishkin and King concluded that Google lied about their search operations at least four times.

As both SEO experts are quick to point out, analyzing all of the 2,600 API documents will take some time, and they will post their findings in the future as they uncover more insights.

How This Affects Marketers

Valued at $68.27 billion in 2022 and projected by Emergen Research to increase to $157.41 billion in 2032, the global search engine market, which Google dominates, is booming.

This shows how marketers rely heavily on SEO and ranking metrics to increase website traffic and brand visibility, as well as measure the success of their campaigns.

Knowing exactly how Google Search works will help marketers a lot in developing effective initiatives. But the fact that Google appears to not want to be transparent about it means there is something to hide.

I don't think years of personal experience with seeing Google's algorithm respond completely opposite to what all the talking heads were saying is preconceived bias. They have been lying through their teeth since day one, and anyone with even basic SEO experience who was around…
— Greg Boser (@GregBoser) May 28, 2024

Maybe it’s about Google boosting its image as a fair company, prioritizing the quality of its results over profit.

Or maybe it’s something more sinister like protecting its monopoly while appearing to promote competition.

Whatever the case may be, SEO practitioners are going crazy over these leaked documents, with some having an “I knew it” reaction and others preferring to be cautious.

Google Releases Statement on Leaked Modules

On Thursday, Google finally released a statement to The Verge via email, confirming the authenticity of the leaked documentation.

Despite admitting that the search API modules are genuine, it brings into question their accuracy and relevance, with Google spokesperson Davis Thompson saying that they contain "out-of-context, outdated, or incomplete information."

“We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation,” Thompson added.

I do not like this, but it does support what the leaked API docs suggest Google's doing with rankings, so I guess.. yay?

¯\_(ツ)_/¯ https://t.co/lzTbDmZjFm
— Rand Fishkin (follow @randderuiter on Threads) (@randfish) May 29, 2024

It seems that Google is maintaining its stance that it didn't lie to marketers in the past to protect its search operations, clinging to the fact (or excuse?) that its algorithms are constantly being updated and changed.

It's also possible for Google to claim that they were just live experiments and that they weren't used as ranking signals.

But now that the authenticity of the leaked modules has been confirmed, marketers may just be able to create a breakthrough when it comes to newer companies standing a chance in the ranking battle against big brands.

Expect updates in the coming days as more experts dissect and interpret the modules to reveal more of what goes on behind Google Search.