Sunday, December 18, 2011

Fight Back Against Content Theft

Many bloggers have been plagued by content theft, including myself. Plagiarism is an age old problem and will never be fully eradicated. But, there are a few ways you can reduce your chances of being ripped off.

I'm not an intellectual property expert, but my understanding is that your work is implicitly understood to be copyrighted UNLESS you specifically say that it is not. Just to make it abundantly clear, you can put a copyright notice on your blog to let people know that it is not OK to take any content from your blog. On blogger, put a footer that says something like this:
Copyright © 2011 YourBlogName/BusinessName All Rights Reserved.

[In dashboard, click on your blog, then on the design tab at the top. Then click on "Add a gadget" at the bottom of the template. Type your copyright text statement into the box.]

That cuts out the naive but honest actors. What about the dishonest ones?

Many websites had reposted part of What Do Automobiles and Spacecraft Have in Common?, a guest post I wrote for the James Fallows' blog at the Atlantic Monthly website. Only one had the chutzpah to repost it verbatim in it's entirety. Grrr. Very bad karma.

The upside is that I did a little research and learned some interesting lessons for this new post.

The overwhelming majority of blogs are known in the industry as "spam blogs" because they exist solely to serve up ads. To generate a large quantity of content as cheaply as possible, they resort to stealing it.

There are many companies that sell software to automate the process of "content scraping". (This cuts out the hard work of generating content and stealing it by hand!) There are a few legitimate reasons to scrape a website, e.g. to back it up. But I am just going to talk about the search and theft issue here.

This problem became particularly acute in Pakistan after news stories were published about a few spam bloggers (including a school boy) earning real $$$$ from the Google AdSense program. See this Express Tribune article from Pakistan that estimates that ~90% of Pakistani blogs were serving exclusively stolen content. (I've heard that is an underestimate.)

Note, the bloggers were not blocked from blogging and that their blogs continued to be included in search and referral. They were merely banned from participating in the Google AdSense advertising (for $) program. They could still earn money from other sources. Some articles in the media, including Global Voices, don't make that distinction clear. It's not censorship. It's just preventing people from profiting off theft on your ad network.

If you find your content someplace it doesn't belong, take these steps.

[This is not meant to be a complete list because I am not an expert. If you know more things to try, please teach us by leaving a comment.]

If you find a theft, don't link to it. You'll only drive up their importance ranking for search.

Instead, inform the search engines so that they can devalue the ranking for that webpage, and possibly for the entire site.

Google provides a Report Scraper Pages online document. Fill it out with the URL addresses of the original content that you wrote and the offending webpage and submit. (If you know a similar link for other search engines, please leave the link in a comment.)

It may take a while, but the scraper website will slowly disappear from the Google search engine. After several days, I noticed that the entire site that stole my content was excluded from the Google AdSense program and had slipped in the keyword search ranking.

Get the stolen content removed. You may want to give the search engine companies a few days to look at the offending website before you do this.

Do a domain trace on the URL address to find out who owns it and where it is hosted. You can use any one of the many whois lookups on the web.

For instance I did a trace at both Network Solutions and Whois.net and found out that the address was bought through a domain reseller rather than registered directly (a common obfuscation trick) and that the domain registration contact is hidden rather than public (there are sound security reasons to do this, sometimes).

Scroll down to see the name servers. That tells you the name of the ISP that is hosting the website. In my example, I saw:
Name Servers:
ns1.gothost.org
ns2.gothost.org
Write the ISP at abuse@hostingcompany.whatever. In my case, that would be abuse@gothost.org. Include in your message the page of the stolen content that they are hosting, the url of the original content, and request that they take down the stolen content.

I lucked out because gothost is an American company based in Florida and needs to comply with US laws. They took the content off their servers immediately. Abroad, you may be out of luck.

Robots are your friend.
Do you know what your robots.txt file says? I didn't even know what it was until a friend told me about it a month ago. Wikipedia has a synopsis.

Basically, it tells webcrawler robots where they can and can not access. This is purely advisory and nonbinding, but the major search engines all follow the instructions you set.

On blogger dashboard, select your blog, then click on the setting tab. Select yes for "Let search engines find your blog?"

If you do that, then your robots.txt file will look something like http://badmomgoodmom.blogspot.com/robots.txt
User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://badmomgoodmom.blogspot.com/feeds/posts/default?orderby=updated
Note: robots.txt must always reside at the top-level root directory. If you put it anywhere else, it won't work.

Why do you want robots to "crawl" your content? Because that's how they index what your website is about and refer readers to you. It's also how they find high quality and original content.

If you don't allow search engines to "crawl" your data, the first time they see your content is AFTER it has been stolen and reposted elsewhere.

I know you are a good writer and you have tons of original content and creative ideas. Take credit for them. Make sure the search engines know where to find you.

How I became a victim.

I discovered What Do Automobiles and Spacecraft Have in Common? on a spam blog. Perusing it, it looked like the blog was set up to serve car ads for Google Adsense. 100% of their content was stolen. The "About Us" page said:
All content in this Site is gather from all over the internet. From All of the best resource of internet. That can be digitized – books, newspapers, magazines, newsletters, journals, research, music and film – is being digitized and distributed via the Internet. This Site aim to provide the world’s easiest-to-use, most flexible and most affordable solution for getting all this content online.
We always strive to bring you the best information available, in the internet, Our vision is to give readers information and provided, as fastest and fresh as it can. Finally we want to create a platform that is completely open so publishers and developers can shape it anyway they want to meet their own special needs.
Admin
So I am not sure if I should be offended or flattered to be considered among the "best resource of internet". ;-)

It looked like it was one of a family of websites--each organized around an advertising theme--that all link to each other. There was not one iota of original content, except possibly for the "About Us" info above. (Are they non-native English speakers? Or just pretending to be?)

I wrote some guest posts for theatlantic.com. They have a very restrictive http://www.theatlantic.com/robots.txt file.
User-agent: *
Disallow: /james-fallows/*/*
Disallow: /james-fallows/*/?cid=*
Disallow: /megan-mcardle/*/*
Disallow: /derek-thompson/*/*
Disallow: /marc-ambinder/*/*
Disallow: /ta-nehisi-coates/*/*
Disallow: /chris-good/*/*
Disallow: /alyssa-rosenberg/*/*
Disallow: /william-powers/*/*
Disallow: /jeffrey-goldberg/*/*
Disallow: /matthew-yglesias/*/*
Disallow: /bruce-falconer/*/*
Disallow: /daniel-indiviglio/*/*
Disallow: /derek-lowe/*/*
Disallow: /culture/category/*
Disallow: /special-report/archive/*
Disallow: /*/print/20*
Disallow: /*/this-week/*
Disallow: /*/last-week/*
Disallow: /*/thisweek/*
Disallow: /*/lastweek/*
Disallow: /*/magazinearticles/*
Disallow: /*/magazine-articles/*
Disallow: /*/blogarticles/*
Disallow: /*/blog-articles/*
Crawl-delay: 5
Allow: /
This is very unusual. Take a look at http://www.thedailybeast.com/robots.txt.
User-agent: *
Disallow: /search.html

Sitemap: http://www.thedailybeast.com/sitemap.xml
The Daily Beast makes everything visible to the search engines.

Now go peek at the robots.txt file of all your favorite news sites. I'll wait.

Inside Search, the official Google Search Blog, recently wrote about some search tweaks. If you were a web crawler, you'd know how much badly-written dreck there is out there on the internets. They program their web crawler robots to search for original content that can't be found elsewhere. I put significant effort into What Do Automobiles and Spacecraft Have in Common? and it's not a connection that I've seen pointed out in the popular press. Unfortunately, the first time the web search robots encountered it was on a spam blog. Don't let that happen to you.

Aside:
There are sound reasons to exclude some portions of your website to robots. I've seen this at the beginning of several robots.txt files. It looks like a template was passed around and I don't know the original source.
$Id: robots.txt,v 1.9.2.1 2008/12/10 20:12:19 goba Exp $
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html
If you are a scientific data center, serving terabytes of data to the public, you need to be careful about what you allow the robots to crawl. Otherwise, no science users can get through to download data to actually use. I asked around at AGU for best practices. Several people suggested blocking off most of the data areas but leaving a few selected examples of data that can be downloaded from the site open to the the robots.

Friday, December 16, 2011

AI Finis!

Remember when Iris and I signed up for the online AI (Artificial Intelligence) class as a team?


I carried on alone.

But she wandered by while I was working with the (logic) Truth Tables for a final exam question and asked me to explain the notation.

Then she proceeded to rattle off the correct answers by inspection.

It't not cheating because I would have gotten the correct answers in the end. Truly.

Besides, we signed up as a team.

And, is it fair that a guy who competed in all of the DARPA Grand Challenges (and placed quite highly on some of them) and already has a PhD in robotics is also taking the class?

Yes, the percentage of people in the on-line version of the class that has gotten perfect marks on all the homework assignments and the midterm is double the percentage among the students actually taking the class at Stanford, but what if we disqualify all of the ringers like the one above? From the discussion forums, it appears that quite a few professors are taking the class to learn about e-learning rather than robotics.

Or is it sour grapes because I can do complicated Bayesian statistics calculations but can't subtract 14 from 100 correctly? (I was a math major. We don't use actual numbers after the first year.)

This class was a blast. It took the full 10 hours a week that the teachers estimated we needed to put into it. But, looking back over the homework and exams, I can see how much makes sense to me now. Before the class, I had no clue even how to read the gobbleygook notation.

Serendipitously, I embarked on a new project at work where I can apply some of what I learned. What fun! And I get paid for it.

A big thank-you to instructors Peter Norvig and Sebastian Thrun and all the IT staff that kept the servers (mostly) running during this record-breaking huge class.

Related Posts:

Thursday, December 15, 2011

And the feathers flew

In 20 consecutive days, I was away for all but a 10 hour stretch and a 36 hour stretch. Within those 46 hours, I unpacked, did laundry, repacked, went to work (and unpacked from my office move in abstentia), and slept a little bit.

It's been hectic. No wonder I came back a little under the weather.

What did I find at home? Not toilet paper. We were running low before I left, but I didn't have time to get any. Do you think the other toilet users in my household would have bought some on their way home from work or school? We don't lack for stores en-route in our high density urban infill neighborhood.

OTOH, Bad Dad helped out on a coworker's field experiment in addition to his regular work and he was a single dad two weeks in a row. Plus, they claim that we haven't truly run out unless ALL bathrooms are out of TP and we have no facial tissues in the house.

When I got home, Bad Dad said that he had to spend Sunday at the office so I had to supervise homework and take care of the home front on my own. I have to admit that the field campaign and single dad trump cards are compelling.

I also discovered that my daughter had only one weekend day left before her end of semester video project was due. She told me that her partner dropped out the prior Wednesday, she wrote her script on Thursday, and it was due on Friday of the following week.

The script had more than half a dozen characters so I asked her who she had lined up as actors and her filming schedule.

You guessed it.

I told her to work those phones double time.

Then she showed me the script with her notes on costumes and wanted me to make them .right now.

In the end, she rounded up two friends who brought their own hats. That pile of cut up brown paper bags on the dining room table turned out to be headdresses (who would have guessed?) and we found all the feathers, fabric, belts, jewelry and pins she needed in my sewing room and closet.

There were feathers flying everywhere downstairs during and after headdress construction, but I have to admit the headdress was very creative. It's pretty cool to come home to a child who sees a grocery sack as an Aztec headdress (with paper curls) and a husband game enough to put it on and read dialogue like he means it.

I was always a method actress anyway. If you play a happy family, you can almost believe it yourself.

Thursday, December 08, 2011

Meeting Shams

Yesterday, I took a break from work meetings to lunch with Shams. We did NOT plan to coordinate, but we did. Note that she made her coat, cowl and pants. The only thing I made in the picture is my sweater.

We didn't natter on as much about sewing as I expected. We were both taking a quick break from work and the conversation flowed between Berkeley (which both of us attended), working in tech, raising our daughters and fitting exceptional figures.

What is an exceptional figure? Chances are very good that you have one.

Suppose something is engineered to work for 90% of circumstances, from between the 5th and 95th percentile in some metric. That seems reasonable and your product will work for 90% of the market, right?

Not so fast. Think multi-dimensionally.
(0.9)6 = 0.53
In six independent dimensions, nearly half of the potential market will fall outside your engineering specs in at least one dimension.

Ready-to-wear (RTW) clothing manufacturers can't fit everyone. They have to make decisions about who their market is. Consumers have to take RTW items to tailors. Even people who can't sew can become virtual dressmakers by hiring custom clothiers. (I haven't sewn for others for money since my undergrad days, but I do swap favors with friends. Eric's wife gets clothes and I get a place to stay when in their town.)

Look back at the picture at the top of this post. Who is easier to fit?

That's a trick question. Shams' hips are an inch smaller than mine. We are both 5'5" tall, but our torso to leg proportions vary. She has to add inches to bust darts and I have to take them out.

In college, I had a 14" drop between my hip and waist measurements, 39" to 25" . In mid-life, that has decreased to 9-10"; my hips are still 39" but my waist is ~29-30". I can sometimes find RTW pants that fit, but it is still not easy. It was nearly impossible 20 years ago.

We both agreed that we are not fitting experts. But we are experts in fitting our own unique bodies. And we like to think that we fit our own unique personalities, too. ;-)

We had such a good time in our short meeting, I invited Shams to visit me in LA. She's never shopped the LA garment district. That needs to be rectified.

Shams posted her account of our meeting. Go there to see another photo.