Preventing Spam at the 'Source'?

On way to reduce spam is to stop it being generated in the first place (rather than detect it after it arrives)!

 

Now, I'm sure that a lot of (most?) spam is generated because ones personal email address appears in a web-page (of one sort or another).  Well, here are the barebones of an idea that could prevent email-addresses being harvested by spambots in the first place (maybe)!  Plus,

it could be automated to a very large degree (100% even).

 

(any) robot/crawler will typically spend very little time processing a webpage – after all, they're not interested in 'reading it'!  Additionally, for each page they've scanned, they'll then very rapidly follow any links found on that page to other pages on the same website - and scan them too (and, of course, most pages do have links to other pages on the same site).

 

So, here's the idea - use Machine Learning techniques to prevent this harvesting!

 

This is just a very rough idea – but I think it could work reasonably well?  Ok, enough with the defensive posture - here we go …

1. Make each page active – like (say) use SSI, in conjunction with PHP.  In other words, put some code behind each page that's server-side – so that it will always run whenever a page is requested.

 

2. As any page is requested, log the requester's IP-address, and the time they requested the page (as accurately as possible).

 

3. When any other page is requested, do the same with that too.

 

4. Now as pages are requested, compare the times that others were requested - from the same IP address.

 

5. If the times are 'very close' (I know, parameterised/subjective), the probability of the visitor being a robot (instead of a human) is increased.

 

6. After just a few pages have been requested, one should be able to say with near absolute certainty whether the visitor is either a human or a robot.

 

7. Now you need to do a reverse IP-lookup (DNS) on the IP-address that's requesting pages.  Of course, in reality, you'd probably do this earlier on.  The idea here is that you might not want to tag (say) Google as an unwanted-robot!  I believe that all 'good' robots have names that they don't mind you seeing (e.g., googlebot)

 

8. If the robot isn't a recognized search-engine robot (like a reverse-IP/DNS lookup doesn't even yield a name perhaps!), you now return it a 'get lost' page on all subsequent page-requests.  Of course, you record its IP-address so that the next time it visits, you don't have to do any of the earlier stuff - you just tell it to get lost straight away!

 

9. You publish the unwanted robot's IP-address in a webpage (or, better still, in an RSS feed) so that others can reference it too – these others can then use this source-document to prime their own known-list of unwanted-robots (these lists will need aggregating of course).

 

10. Of course, if an unwanted-robot does scan your pages, you'll NOT want to have those pages contain mailto tags.  So, by default, and until you decide that this isn't an unwanted-robot, all your pages' mailto-tags should be obscured in someway (by default in other words).

 

11. If the visitor is a human, the code behind it returns normal pages, c/w proper mailto tags/data. 

Anyway, it's just an idea (anything to promote more ML!)

 

A couple of P.S. comments:

 

1. Of course, if your email address is already spammed, this won't really help you (for now).  However, it'll help yet to be spammed other people, and it'll help anyone who gets a new email address in the future (in time, that's all of us I guess).

 

2. It does require cooperation to work best - fingers-crossed.  And anyway, in my opinion it'll take cooperation (of one sort or another) to ultimately stop spam!

 


  • Share This Post:
  • Share on Twitter
  • Share on Facebook
  • Share on Technorati

Print | posted on Sunday, June 20, 2004 8:54 AM