Note: This post is a case study. I haven’t posted a case study on the site before, so let me know if you feel it’s valuable. If you do, I will try to post this sort of content regularly.
A few weeks ago, I was contacted by a site owner who had a very serious problem: He was noticing what he believed to be a lot of false positive soft 404 errors in his Google Webmaster Tools account. Alarmed, he started checking his pages in Google, and noticed that a lot of his content had been completely de-indexed.
Curious, I took a look at his site, and my initial impression was positive. It was well architected, and home to a lot of original and high quality content that demonstrated a solid understanding of its audience. A quick check of his backlinks removed any suspicion that he was engaged in anything he shouldn’t have been. Everything appeared to be in top shape.
Stranger still, the pages that Google was reporting as soft 404s appeared to be unique and valuable. Aside from the header and footer, they had nothing at all in common with his real 404 pages (which sent the proper response code).
Why was Google throwing the soft 404 flag and removing his content from their index?
I was interested in helping him solve this problem and agreed to help.
Where things got nasty
With the initial research out of the way, I fired up Firefox and used this user agent switcher plugin to see the site as Google sees it.
The news wasn’t good. I’m pretty sure I reacted the same way Samuel L Jackson reacted in Jurassic Park when the electricity in the raptor fences went down. That is, mumbling obscenities to myself and warning everyone to hold onto their butts.
In short, I saw that the site had been hacked to show pharmacy spam links to Google, and the regular site to everyone else. With the Googlebot user agent activated, every page on the site looked like this:
Suddenly, the problem made sense. In Google’s eyes, the content of every page on his site was identical to his 404 pages (which were also hacked). Hence the soft 404 errors.
And they were de-indexing the site because every page was a duplicate of every other page on the site (not to mention the content was nothing but spam links).
Can we fix it? Yes we can!
I requested the login credentials for his site and cpanel admins.
The site used Joomla as its CMS, which I’m not terribly familiar with, and it took some time to get acquainted. As I was starting to get a handle on how all the pieces worked together, I stumbled across a php file that was buried in a very deep folder.
In it, I found what must be the simplest malicious code in the history of malicious code. It was a just a couple lines of php that looked like this:
*note* I’ve removed the actual URLs from the screenshot because I promised anonymity — I don’t want you super sleuths digging through backlinks and identifying the hacked site.
In case you can’t read the code, basically what it’s doing is checking to see if the user agent of the visitor is Googlebot. If it is, then it shows the pharmacy spam links. If it isn’t, then it shows the site as normal.
Fixing the site was as simple as removing that conditional statement.
Within one week, many of the de-indexed pages started making their way back into the index.
Within two and a half weeks, the site made a full recovery. All of its pages were back in the index, and traffic had returned to normal.
In this case, the site benefited greatly by having someone in charge who was monitoring Google Webmaster Tools, saw the soft 404 errors, and noticed that pages were dropping from the index. I suspect that if the site had continued to show those spam links for an extended period of time, recovery could have been much slower.
In the end, the relatively quick recovery was great news for them. Unfortunately, not everyone is so lucky.
Check your Webmaster Tools each day, ladies and gentlemen.