Blog Stats
  • Posts - 45
  • Articles - 0
  • Comments - 16
  • Trackbacks - 20

 

Thursday, June 30, 2005

bad bad bad hack

Have you ever written a piece of code that you know is bad but you had to push it out because you know it had to be implemented?

I ran across one of those scenaios the other day. While using Interop Services, I discovered that a Wchar pointer in VC++ 6.0 that was returning back a value would not return back any value if marshalled as a string (with UnmanagedType.LPWStr) The only way that I could do this was as UnmanagedType.LPArray of char[]. This got me the string back except it looked like this "h e l l o w o r l d". The spaces were there because the C++ representation assigned double the space needed for char. So in true bad hack fashion, I assigned another string and picked alternate characters. The new string now read "hello world". If anyone knows a better way, please let me know.

Anti-phishing project - and a bad implementation attempt

So, it finally happened yesterday afternoon. I had to give up on my skunkworks project. Below is an article on what the attempt was, and how exactly and why I had to abandon it. Hopefully, it will help someone following along the same lines (if someone is really dumb enough to follow along those lines)

Phishing project (a.k.a. what I did the Spring of 2005) Rishi Pande

Goal: The main goal of this project was to stop phishing.

Background:
Phishing, also referred to as brand spoofing or carding, is a variation on “fishing,” the idea being that bait is thrown out with the hopes that while most will ignore the bait, some will be tempted into biting. Most current anti-phishing measures work in client side browsers. Ideas have ranged from simply not displaying a site matching certain criteria to having the browser skin change. However, all these measures are client-side changes, and a dumb user would get past all the problems. Also, the company that gets phished carries no liability but faces great damage (lost customers, public embarrassment, etc.) The main ‘trick’ in the phishing is the oldest in the book, namely, making a user believe that something fake is in fact genuine.

Approach:
The core idea behind the project was to verify the authenticity of the website. The fake website has to look similar (if not the same) as the actual website as shown in the Anti-phishing archives.

To determine if a site has been phished, we need the contents and location of the real site and the contents and location of the fake site. Since the text and images on the fake webpage are a near replica of the real webpage, the idea was to use information retrieval algorithms to determine the similarity between the two pages. If the two pages match, then the site is being phished. This would be valuable information to the developers/ maintainers of the website.

Algorithm:
The two user information characteristics that a webpage has are text and images. Thus a combination of a textual document matcher and an image matching algorithm should match all characteristics for similarity.
Text matching
Information retrieval research has produced several document matching algorithm. Chief among these, is the vector based search. The vector based approach works by parsing a document for stop words (and, it, or, but, etc.) and then getting the stem of each remaining word. Then an Inverse Document Frequency list is produced based on the frequency of each word. This formation is called the vector of the document. Vectors of all documents to be matched are calculated before any actual matching is done.
When the new document to be matched is supplied, its vector is calculated. Then the cosine of the angle between the two vectors is calculated. If the cosine value of the difference between the two vectors is high, then the documents are said to be similar. High, of course, is relative in this scheme. Sample matches should be made to set the value of the bar that determines the cut-off point.

Image matching
Image matching can get highly complex. This is because the information in the images is visual but represented as binary. Several techniques have been developed that match images based on edges in the image, actual images, geo-spatial methods. A new technique that has come about in the past few years is called Content Based Image Retrieval (CBIR). CBIR has accurate image matching properties. However, implementations of most algorithms are still proprietary. Also, most image matching algorithms face a sizing problem – a.k.a. if the image is resized, the matching characteristics
An interesting implementation is one done by Steve Scorbett (http://www.scorbett.ca ), called ImageCompare. The implementation basically takes any image and converts it to a fixed size (100 X 100) with a grayscale palette. Any minor differences are not seen by the algorithm. We used the image comparison method used by Steve Corbett and found the algorithm to be exactly what we needed. This was because if the phisher makes too many changes to the image, the user may not be convinced enough that the website is an actual record.

Problems
This section documents the problems we ran into during development.
The first sign of trouble started when the document matching algorithm started generating false positives. An issue that I had overlooked during the earlier stages was that the pages that are phished are the login pages. Unfortunately, the content of most login pages is the same. Enter username/ password. Therefore, all login pages would match and generate false positives.
Therefore, it was determined that we may have to alter the algorithm to not check the text at all. Therefore, all in all, it would just check all the images on a webpage. This means that all the images on a particular page would have to be indexed with the page.
Such a simple method is extremely inefficient because you really can’t index. No vectors can be formed for each document based on images because images do not have a ‘dictionary’ like words. This makes searching extremely inefficient and impractical with any reasonably sized database of web pages.

Conclusion
This implementation was an attempt at stopping phishing attacks at least from a proof-of-concept point of view. Unfortunately, it did not pan out. But then again, this is what I think research is all about :)
NOTE: Six months after I gave up on this project, CNET published an article this morning which seems very similar to my idea detailed here. They call it "phishing print" but it seems remarkably similar to my idea. IT would be nice if they get beyond the marketing and into the technical nitty-gritty.

highs and lows of being on TV (or not)

WARNING: RANT ahead

I work on a product that was supposed to be on T.V. this morning (The Today show) So I wake up and tune in to the WRONG CHANNEL!!!

I got a recording of the show and all they spoke was our competitor. Then, finally they started to speak about the product. I get excited, feel a little tingly inside when I heard the words of death - 'That's all that we have time for. We are going into a commercial break now!' Oh, damn you NBC!!!
 

 

Copyright © Rishi Pande