Recently I was discussing SQL Server encryption with some friends who have been using it to encrypt short strings such as Social Security numbers at their shop. I commented, "Just try searching those Social Security Numbers," they shared my lamentation, and we moved on to other subjects. Later that evening, though, I thought there must be a way to search those wretched encrypted blocks--somehow--and worked out the solution you are about to read.
The Problem
The difficulty lies in the fact that you cannot compare varbinaries in a predicate. Combine this with the fact that encrypted text is stored as a varbinary, and the result is that you cannot search the varbinary encrypted text, even if you run the search term through the encryption algorithm before performing the comparison.
To illustrate, suppose you have a Person table that stores a name and a Social Security Number (SSN) encrypted with a certificate, defined like this:
CREATE TABLE [dbo].[Person](
[Id] [int] IDENTITY(1,1) NOT NULL,
[SSN] [varbinary](128) NULL,
[Name] [varchar](50) NULL,
CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED
(
[Id] ASC
)ON [PRIMARY]
)ON [PRIMARY]
It would be nice if you could just encrypt the search text and build a predicate that compares the term and the encrypted column, like this:
declare @encryptedSSN varbinary(128);
set @encryptedSSN = ENCRYPTBYCERT(CERT_ID('certKey'), @ssn);
select name, CONVERT(char(9),DECRYPTBYCERT(CERT_ID('certKey'), SSN) as SSN
from dbo.Person
where ssn = @encryptedSSN;
As I mentioned previously, though, this approach simply does not work.
Of course, you could make the predicate succeed by setting the datatype of the encrypted text column to char(128), and then converting the encryption products (stored SSNs plus the search term) from varbinary to char(128) before using them. However, indexing a column that large, and performing comparisons with it, would consume a lot of disk and processing resources.
The Solution
Instead of converting the large encrypted block into char(128) data, you can work with the hash of the SSN instead. If you calculate the hash using the SHA1 algorithm, the hash will require only 20 bytes of storage, so the burden on disk and processing resources will drop significantly. And since the SHA1 algorithm is considered to be very secure, you will not add any significant security risk to your solution. From start to finish, here's how the approach will work:
1. Include a char(20) column in your table to hold the hash of the SSN. The new table definition will look like this:
CREATE TABLE [dbo].[Person](
[Id] [int] IDENTITY(1,1) NOT NULL,
[SSN] [varbinary](128) NULL,
[Name] [varchar](50) NULL,
[Hash] [char](20) NULL,
CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED
(
[Id] ASC
)ON [PRIMARY]
)ON [PRIMARY]
You should also index the Hash column to improve the performance of searches by SSN, of course.
2. Whenever you insert a record, store the hash of the SSN along with the encrypted bytes. Your insert stored procedure should use the built-in HASHBYTES function, like this:
CREATE PROCEDURE [dbo].[usp_Insert_Person]
@ssn char(9),
@name varchar(50)
AS
BEGIN
insert into dbo.Person(SSN, Name,Hash)
values(ENCRYPTBYCERT(CERT_ID('certKey'), @ssn), @name, convert(char(20), HASHBYTES('SHA1', @ssn)))
END
3. When you search by SSN, compare the search term hash to the stored hash in your predicate:
CREATE PROCEDURE usp_Search_Person_BySsn
@ssn char(9)
AS
BEGIN
--SET NOCOUNT ON added to prevent extra result sets from
--interfering with SELECT statements.
SET NOCOUNT ON;
declare @hash char(20);
set @hash = convert(char(20), HASHBYTES('SHA1', @ssn));
select convert(char(9), DECRYPTBYCERT(cert_id('certkey'), ssn)) as SSN, name
from dbo.Person
where Hash= @hash;
END
Conclusion
This approach has worked well for me in my experimentation, and I recommend it to you if you need to search on an encrypted field. You should note, however, that the approach does have a limitation: you cannot search on an arbitrary substring of the encrypted text. In other words, a "begins with" or "contains" search is not possible. As long as you can live with this limitation, you should be able to employ this approach for your own encrypted data.
As always, I invite you, my reader, to leave a comment if you have found this useful or if you can think of an improvement.
EDIT: Per the comment from Tom, please hash the encrypted SSN, rather than the SSN itself (my original idea) to obtain a searchable char field. Hashing the SSN does not by itself provide enough security.