Sunday, December 18, 2011

Fight Back Against Content Theft

Many bloggers have been plagued by content theft, including myself. Plagiarism is an age old problem and will never be fully eradicated. But, there are a few ways you can reduce your chances of being ripped off.

I'm not an intellectual property expert, but my understanding is that your work is implicitly understood to be copyrighted UNLESS you specifically say that it is not. Just to make it abundantly clear, you can put a copyright notice on your blog to let people know that it is not OK to take any content from your blog. On blogger, put a footer that says something like this:
Copyright © 2011 YourBlogName/BusinessName All Rights Reserved.

[In dashboard, click on your blog, then on the design tab at the top. Then click on "Add a gadget" at the bottom of the template. Type your copyright text statement into the box.]

That cuts out the naive but honest actors. What about the dishonest ones?

Many websites had reposted part of What Do Automobiles and Spacecraft Have in Common?, a guest post I wrote for the James Fallows' blog at the Atlantic Monthly website. Only one had the chutzpah to repost it verbatim in it's entirety. Grrr. Very bad karma.

The upside is that I did a little research and learned some interesting lessons for this new post.

The overwhelming majority of blogs are known in the industry as "spam blogs" because they exist solely to serve up ads. To generate a large quantity of content as cheaply as possible, they resort to stealing it.

There are many companies that sell software to automate the process of "content scraping". (This cuts out the hard work of generating content and stealing it by hand!) There are a few legitimate reasons to scrape a website, e.g. to back it up. But I am just going to talk about the search and theft issue here.

This problem became particularly acute in Pakistan after news stories were published about a few spam bloggers (including a school boy) earning real $$$$ from the Google AdSense program. See this Express Tribune article from Pakistan that estimates that ~90% of Pakistani blogs were serving exclusively stolen content. (I've heard that is an underestimate.)

Note, the bloggers were not blocked from blogging and that their blogs continued to be included in search and referral. They were merely banned from participating in the Google AdSense advertising (for $) program. They could still earn money from other sources. Some articles in the media, including Global Voices, don't make that distinction clear. It's not censorship. It's just preventing people from profiting off theft on your ad network.

If you find your content someplace it doesn't belong, take these steps.

[This is not meant to be a complete list because I am not an expert. If you know more things to try, please teach us by leaving a comment.]

If you find a theft, don't link to it. You'll only drive up their importance ranking for search.

Instead, inform the search engines so that they can devalue the ranking for that webpage, and possibly for the entire site.

Google provides a Report Scraper Pages online document. Fill it out with the URL addresses of the original content that you wrote and the offending webpage and submit. (If you know a similar link for other search engines, please leave the link in a comment.)

It may take a while, but the scraper website will slowly disappear from the Google search engine. After several days, I noticed that the entire site that stole my content was excluded from the Google AdSense program and had slipped in the keyword search ranking.

Get the stolen content removed. You may want to give the search engine companies a few days to look at the offending website before you do this.

Do a domain trace on the URL address to find out who owns it and where it is hosted. You can use any one of the many whois lookups on the web.

For instance I did a trace at both Network Solutions and and found out that the address was bought through a domain reseller rather than registered directly (a common obfuscation trick) and that the domain registration contact is hidden rather than public (there are sound security reasons to do this, sometimes).

Scroll down to see the name servers. That tells you the name of the ISP that is hosting the website. In my example, I saw:
Name Servers:
Write the ISP at abuse@hostingcompany.whatever. In my case, that would be Include in your message the page of the stolen content that they are hosting, the url of the original content, and request that they take down the stolen content.

I lucked out because gothost is an American company based in Florida and needs to comply with US laws. They took the content off their servers immediately. Abroad, you may be out of luck.

Robots are your friend.
Do you know what your robots.txt file says? I didn't even know what it was until a friend told me about it a month ago. Wikipedia has a synopsis.

Basically, it tells webcrawler robots where they can and can not access. This is purely advisory and nonbinding, but the major search engines all follow the instructions you set.

On blogger dashboard, select your blog, then click on the setting tab. Select yes for "Let search engines find your blog?"

If you do that, then your robots.txt file will look something like
User-agent: Mediapartners-Google

User-agent: *
Disallow: /search
Allow: /

Note: robots.txt must always reside at the top-level root directory. If you put it anywhere else, it won't work.

Why do you want robots to "crawl" your content? Because that's how they index what your website is about and refer readers to you. It's also how they find high quality and original content.

If you don't allow search engines to "crawl" your data, the first time they see your content is AFTER it has been stolen and reposted elsewhere.

I know you are a good writer and you have tons of original content and creative ideas. Take credit for them. Make sure the search engines know where to find you.

How I became a victim.

I discovered What Do Automobiles and Spacecraft Have in Common? on a spam blog. Perusing it, it looked like the blog was set up to serve car ads for Google Adsense. 100% of their content was stolen. The "About Us" page said:
All content in this Site is gather from all over the internet. From All of the best resource of internet. That can be digitized – books, newspapers, magazines, newsletters, journals, research, music and film – is being digitized and distributed via the Internet. This Site aim to provide the world’s easiest-to-use, most flexible and most affordable solution for getting all this content online.
We always strive to bring you the best information available, in the internet, Our vision is to give readers information and provided, as fastest and fresh as it can. Finally we want to create a platform that is completely open so publishers and developers can shape it anyway they want to meet their own special needs.
So I am not sure if I should be offended or flattered to be considered among the "best resource of internet". ;-)

It looked like it was one of a family of websites--each organized around an advertising theme--that all link to each other. There was not one iota of original content, except possibly for the "About Us" info above. (Are they non-native English speakers? Or just pretending to be?)

I wrote some guest posts for They have a very restrictive file.
User-agent: *
Disallow: /james-fallows/*/*
Disallow: /james-fallows/*/?cid=*
Disallow: /megan-mcardle/*/*
Disallow: /derek-thompson/*/*
Disallow: /marc-ambinder/*/*
Disallow: /ta-nehisi-coates/*/*
Disallow: /chris-good/*/*
Disallow: /alyssa-rosenberg/*/*
Disallow: /william-powers/*/*
Disallow: /jeffrey-goldberg/*/*
Disallow: /matthew-yglesias/*/*
Disallow: /bruce-falconer/*/*
Disallow: /daniel-indiviglio/*/*
Disallow: /derek-lowe/*/*
Disallow: /culture/category/*
Disallow: /special-report/archive/*
Disallow: /*/print/20*
Disallow: /*/this-week/*
Disallow: /*/last-week/*
Disallow: /*/thisweek/*
Disallow: /*/lastweek/*
Disallow: /*/magazinearticles/*
Disallow: /*/magazine-articles/*
Disallow: /*/blogarticles/*
Disallow: /*/blog-articles/*
Crawl-delay: 5
Allow: /
This is very unusual. Take a look at
User-agent: *
Disallow: /search.html

The Daily Beast makes everything visible to the search engines.

Now go peek at the robots.txt file of all your favorite news sites. I'll wait.

Inside Search, the official Google Search Blog, recently wrote about some search tweaks. If you were a web crawler, you'd know how much badly-written dreck there is out there on the internets. They program their web crawler robots to search for original content that can't be found elsewhere. I put significant effort into What Do Automobiles and Spacecraft Have in Common? and it's not a connection that I've seen pointed out in the popular press. Unfortunately, the first time the web search robots encountered it was on a spam blog. Don't let that happen to you.

There are sound reasons to exclude some portions of your website to robots. I've seen this at the beginning of several robots.txt files. It looks like a template was passed around and I don't know the original source.
$Id: robots.txt,v 2008/12/10 20:12:19 goba Exp $
# robots.txt
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
# This file will be ignored unless it is at the root of your host:
# Used:
# Ignored:
# For more information about the robots.txt standard, see:
# For syntax checking, see:
If you are a scientific data center, serving terabytes of data to the public, you need to be careful about what you allow the robots to crawl. Otherwise, no science users can get through to download data to actually use. I asked around at AGU for best practices. Several people suggested blocking off most of the data areas but leaving a few selected examples of data that can be downloaded from the site open to the the robots.

1 comment:

  1. This comment has been removed by a blog administrator.


Comments are open for recent posts, but require moderation for posts older than 14 days.