Protect your content from scrapers, aggregators, and other scary creatures

scanned image from where the wild things are

Creating great content is hard work. It requires hours of researching, rewriting, and editing. And if you’re like me, you spend just as much time agonizing over the perfect picture to include in your post. Before you hit publish, you’re certain that you’ve done everything you could to create a massively successful piece of content that will drive tons of traffic.

But what you might not know is that there are people waiting to take (and get credit for) your content the moment you hit the publish button. It happens all the time: scrapers and content aggregators take other people’s content, publish it, and outrank the truly original source in Google.

One of the most common reasons this can happen is that aggregators take your newly published content and get their copy indexed before you do. Sometimes, Google will assume that the aggregator is the original author, and that you republished their content. This can lead to a loss of traffic, or worse, being caught in the Panda filter.

Now, there’s nothing we can do to change the way that Google works. But the good news is that there are some things we can do on our own sites to prevent this type of thing from happening.

Below, I’m going to show you a few tactics that you can use on any WordPress site to make sure that Google always indexes your content first. Follow these steps, and you will always get authorship credit for the content that you create.

1. Delay your RSS feed

RSS feeds can be great traffic drivers if you have a lot of subscribers. But what you might not know is that RSS feeds make it really easy for people to take and distribute your content however they see fit.

I don’t have a problem with people distributing my content in general, but I do have a problem with someone else getting authorship credit (and outranking me) for something that I wrote. To make sure that my content gets indexed before anybody else can take it from my feed, I like to delay the RSS build.

On this particular site, my new content is normally in Google’s index within a minute or two after I publish it. Sometimes it can take a little longer, so to be safe, I delay my RSS build so it doesn’t show new content for 10 minutes after it’s published. This way, Google has plenty of time to index the post before anybody else is able to consume it through RSS.

Changing the RSS wait time is simple. Just copy and paste the following code into functions.php. To change the amount of time you want to wait before building the feed, change the value of the $wait variable.

function publish_later($where) {

	global $wpdb;

	if ( is_feed() ) {
		// get the WordPress timestamp
		$now = gmdate('Y-m-d H:i:s');

		// set the amount of time that you want to wait before building the feed
		$wait = '5'; // integer

                // now define the unit of time that you would like to wait
		$device = 'MINUTE'; //MINUTE, HOUR, DAY, WEEK, MONTH, YEAR

		// now we turn the timestamp into a format that MYSQL can understand
		$where .= " AND TIMESTAMPDIFF($device, $wpdb->posts.post_date_gmt, '$now') > $wait ";
	}
	return $where;
}

add_filter('posts_where', 'publish_later');

2. Turn off auto cache clearing

If you’re not using a caching plugin on your site, you really should be. These plugins take a lot of strain off your database, and generally result in a website that loads more quickly than one without a caching plugin (WP Super Cache and W3 Total Cache are good options).

Most (if not all) of the available WordPress caching plugins have a setting that clears the cache each time you publish a new post. This is a great idea in theory (and it’s just fine for authority sites), but it also gives scrapers a chance to grab your content before Google indexes it.

To combat this, turn off the function that automatically clears the cache each time you publish a new post (found within the settings of your caching plugin). Once your post is in the index, manually clear the cache through your plugin’s admin panel. This way, any scrapers who hit your site before your new content is indexed will not see the new content.

3. Give your sitemap a funky name

One of the easiest ways to scrape the entirety of someone’s site is by using the sitemap as a starting point and then following all the links. Since most sitemaps include every piece of content on a particular site, this is by far the easiest method of content discovery for scrapers.

We don’t want to make it any easier for scrapers than it has to be, but we also need our sitemap to be updated right away so Google can quickly find and index the new post.

Most people name their sitemap sitemap.xml, or something very similar. Almost every WordPress site uses this convention, as it’s the default setting for just about every sitemap plugin. I would encourage you to name your sitemap something else. Maybe your dog’s name, or even a random string of characters. This will make it very difficult to randomly guess the location of the sitemap.

Additionally, you will want to turn off the setting that shows your sitemap’s location in robots.txt. If you’re using the popular Google XML Sitemap Generator plugin, you can do that like this:

screenshot of setting to remove the sitemap from robots

Removing the location from robots.txt will not hurt you at all, as long as you submit your sitemap in Google and Bing webmaster tools. Aside from those guys, nobody needs to see it anyway.

4. Submit your new post through Google Webmaster Tools

This is one of my favorite features of Google Webmaster tools, and a lot of people don’t know that it exists. If you have a new website, it take some time for Google to index your new content after you publish a post. This tip will speed up the indexing process significantly.

As soon as you publish a new post, login to Webmaster Tools and go to Health->Fetch as Googlebot. Enter the URL you want to submit into the field:

screenshot of submitting a url to Google

After a moment, a ‘submit to index’ button will show like this:

submit to index button

Click it, and your content will be submitted to the index. In my experience, it normally takes between 12-14 hours for Google to index a URL submitted to them in this manner. While that sounds like a long time, it’s often much faster than Google would index a page posted on a new site.

Closing

If you follow the above steps, it becomes just about impossible for anyone else to consume and publish your content before Google is able to discover and index your version.

If you have any alternate methods of protecting your content, please leave a note in the comments and I’ll be sure to add it to the post.

a picture of kevin spence.by: Kevin Spence

Kevin Spence built his first website in 1999. These days, he builds all of his sites on WordPress using the Genesis Framework, and manages them using these tools. Follow him on Twitter.

Get Email Updates

Sign up to receive email updates each time we publish a new SEO or WordPress tip (normally 2-3 times a week).


Speak Your Mind

*