Use Case
Just started working with amazon's S3 buckets to hold a centralised filesystem to
support a distributed workflow system. When the tasks in the workflow run on different
physical machines in a viariety of locations so it's we need efficient ways of syncronising
just small sub-sections of local files with a bucket.
The Plan
Amazons API allows listing objects by a key prefix i.e. search for all the
files in a particular folder or sub-folders. This is a great way of syncronising
folders where they might contain sub-folders, however we need to also list the same
files from the local file system.
The second task is then comparing files, I our system the synronisation is only
performed in one direction at a time (pull or push) and therefore we can calculated
which files have been:
- created (if it dosen't exist on the destination)
- deleted (if it dosen't exist on the source)
- modified (if the md5 of the local file doesn't match the etag on amazon)
Implementation
Get the current amazon file list
I'm using amazon's own .NET API for this example. The first task is to request all
the objects within a particular folder. First we create the S3 client:
AmazonS3Client
client = new
AmazonS3Client("awsAccessKeyId",
"awsSecretAccessKey");
Then we get all the files (S3 objects) under the desired folder using a ListObjectsRequest
and getting the keys and their corresponding etags out into a dictionary for later:
ListObjectsResponse
folderObjects = client.ListObjects(new
ListObjectsRequest() { BucketName = "dbradley-test-bucket",
Prefix = "test/folder" });
Dictionary<string, string> remoteObjects
= folderObjects.S3Objects.ToDictionary(obj => obj.Key, obj => obj.ETag);
Get the current local file list
To get the local files in a similar format takes a little more work as filesystems
don't naturally let you recursively get the files and paths for all sub folders.
The approach to implement this behaviour is therefore going to be to implement a
recursive function to dig down into all the sub directories.
The output of this funciton needs to be something that's comparible with the previous
result from the amazon bucket - a dictionary mapping the file path to its MD5 hash.
The first step is to be able to generate an "amazon compatible" checksum
of a file. We can use the ComputeHash function of the MD5CryptoServiceProvider
class. This can be simply passed an stream and will return the hash as a byte array.
However, to make this bit array into a hex encoded string we use the BitConverter
ToString method, then simply strip the dashes and lower the case so that it will
match the etag returned by amazon.
Note: There's probably a more efficient method of doing the conversion from byte
array to hex, but this will do for now!
Therefore the hashing function looks something like:
string
hash = BitConverter.ToString(crypto.ComputeHash(fileStream)).Replace("-", string.Empty).ToLower();
The next consideration is the time it takes to calculate these hashes. Even the
most efficient of MD5 implementation introduce a significant cost to calculate,
especially with big files. Therefore, rather than returning a dictionary of file
paths mapping to the actual string MD5 hash we will actually return the paths mapping
to a function which, only when run, will return the MD5 hash of the given file.
We can define this using a delegate function which doesn't take an input:
delegate
{
using (var
stream = file.OpenRead())
{
return
BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
}
}
Going back to the recursive function, we need to make sure that the file keys match
with those on amazon. Amazon paths looks somthing like "test/folder/file.txt"
and therefore we need to make all of our local paths relative to a specific folder.
Therefore we will define two root functions for simplicity:
- Get all the files within a directory (and assume that the given directory is the
root directory in amazon).
- Get all the files within a directory and specify the current directories path
on amazon.
Each of these funcitons will then call the internal recursive method. This internal
method then simply returns the keys and hash functions of each file in it's
current directory combinded with the keys and hash functions of each of it's
sub-directories.
Bringing it all together.
So, finally here's the code to get a local directory as a set of amazon compatible
paths mapping to an Amazon-compatible md5 hash.
public
static Dictionary<string,
Func<string>>
GetLocalFileKeys(DirectoryInfo directory)
{ return GetLocalFileKeys(directory,
string.Empty, new
MD5CryptoServiceProvider()).ToDictionary(kvp
=> kvp.Key, kvp => kvp.Value);
}
public
static Dictionary<string, Func<string>> GetLocalFileKeys(DirectoryInfo
directory, string rootPath)
{
return GetLocalFileKeys(directory,
rootPath, new MD5CryptoServiceProvider()).ToDictionary(kvp
=> kvp.Key, kvp => kvp.Value);
}
private
static IEnumerable<KeyValuePair<string, Func<string>>> GetLocalFileKeys(DirectoryInfo
directory, string currentPath,
MD5CryptoServiceProvider crypto)
{
if (directory == null)
throw
new ArgumentNullException("directory",
"directory is null.");
return directory.EnumerateFiles().Select
(
file =>
new KeyValuePair<string,
Func<string>>
(
currentPath + "/" + file.Name,
delegate
{
using (var
stream = file.OpenRead())
{
return BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
}
}
)
)
.Union
(
directory.EnumerateDirectories().SelectMany
(
childDir => GetLocalFileKeys(childDir,
currentPath + childDir.Name + "/",
crypto)
)
);
}
One observation of the internal function is that it is using IEnumerable of KeyValuePair
rather than an actual dictionary. This is due to dictionaries not being able to
add collections of new pairs at once (as we need to do this when calling the function
recursively so that the results are presented in a flat collection).