Ready Room

Miles and Miles of Files

Post by
Peter Lacey
Miles and Miles of Files

(Word count: 1,271)

There’s a design principle in the software development game that states “you ain’t gonna need it,” or YAGNI. The intent of YAGNI is to discourage engineers from building functionality today in anticipation of needing it tomorrow. Circumstances change, information is incomplete, there are known features that are of higher priority. In other words, don’t build it now, build it if and when you need to.

It was clear from the beginning that Ready Room needed to deal with files. What was less clear was the scale. How many files? How large? How frequently accessed? Tens of files per inspection totaling hundreds of megabytes was no big deal; we could store those in the database. The idea of using the database as a document store had a lot going for it. It would be robust, secure, reasonably scalable, and easy to implement. Ease of development, in turn, would allow us to move on to other critical features. Based on what we knew, what we didn’t know, and decades of experience, our engineers correctly decided on a database-backed file store for the initial release of the product.

In Ready Room’s first day of production usage our customers uploaded 475 files totaling 1.1 GB.

Not outrageous, but more than anticipated. Extrapolating forward, it was clear that a more scalable solution would be needed sooner than later. Fortunately, we already knew what the better solution was; we just had to build it.

Cloud Object Storage

In this day and age, any application that is managing significant amounts of unstructured data is likely storing it in a cloud-based “object store,” such as AWS S3, Google Cloud Storage (GCS), or Azure Storage. These object stores have a number of interesting properties. First and foremost, they provide essentially infinite storage. You can throw terabytes of data at the things, and they will just suck it up. Furthermore, they are extremely durable; once a file is committed to an object store it is not likely to be lost due to hardware failure (human error, of course, is a whole other thing). In fact, Google touts “99.999999999% annual durability.” That’s 11 nines. Which means that if you store 10 million objects in GCS, you can expect to lose one of them every 10,000 years! Since Ready Room was already running on Google Compute Engine, we opted to go with Google Cloud Storage.

A naive implementation of GCS-backed file management in Ready Room would be to simply modify the backend code to shoot the file over to the object store instead of writing it to the database. That would, indeed, be simple, but it suffers from two big problems. One, it almost doubles the time it takes to upload a file. That is, the file first has to go from the user to Ready Room and then from  Ready Room to GCS. And two, it consumes gobs of system resources: RAM, CPU, and bandwidth, to process a file that the system is just going to turn around and jettison. Resources that cost money and are now not available to other users. And all this is true when retrieving a file as well. No, we needed a way to get these files to GCS without proxying them through Ready Room’s servers.

A naive implementation of GCS-backed file manangem

Easy-peasy, you might think. Instead of connecting from the browser to Ready Room when processing files, connect directly to GCS instead. Sure, but how does the user authenticate? After all, this storage isn’t publicly accessible, it contains extremely sensitive information. The user needs to be able to login before they can read or write a file. And we can’t simply send the authentication credentials to the browser. If we did that, anyone could get to them, opening a security hole the size of a truck. To address this common requirement, the object store vendors all provide for the use of “signed URLs.”

Authentication via Cryptographically Signed URLs

On request, cloud vendors can provide their customers with a private key; a long string of seemingly random bytes that can be used to gain access to the object store. When that key is generated, the vendor will also create a corresponding public key that they hold on to. This public/private key-pair has an interesting mathematical property: data encrypted using the private key can be decrypted with the public key.

In computing there is also the notion of a secure hash function. A secure hash function can generate a fingerprint unique to any piece of data. Here, for instance, is a fingerprint for the phrase “Ready Room:”

347058ed03730a16153a7526df80eea0fa3f5cdc419af569108c475c23d7edef

The interesting thing about secure hash algorithms is that they are extremely sensitive to the original input. Change just one bit and you get an entirely different hash. Here’s the fingerprint of “ReadyRoom” (no space):

85fb501e73859297043608c1d96a3d14a89515906469789ea4593a96cdbcd517

With these two concepts, public key cryptography and secure hash algorithms, we can solve our problem of authenticating to GCS from a client browser.

When a Ready Room user wants to store a file, information about the inspection, the request, and the file is sent to the backend. Ready Room uses this information to construct a URL, such as:

https://storage.googleapis.com/6c398d6f-345b-462c-8726-a96558bb99eb/23/somefile.pdf

Then it hashes that URL (and a few other choice pieces of information) to generate a fingerprint and encrypts that fingerprint with the private key. The resulting “signed” URL can now be safely sent back to the browser. It looks something like this:

https://storage.googleapis.com/6c398d6f-345b-462c-8726-a96558bb99eb/23/somefile.pdf?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=user%40project.iam.gserviceaccount.com%2F20200609%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200614T115359Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature= 15fbdd7d7b2dd60ddcae409169f85243da2e201c4ff5d7bd1fcdb199d06c29bad8ee8 1c9baa4da2310aae16eee436fa54055e7d8ab4c5fdc11c84c5ea65075c8a6bfb6383df18b4f27f6ddc1a9c2774a4d4de4a84cb18e3d954f a311c6e2c494243725a9890f6291d84269fe6aea1555194790020049492c2006dce44a674fb35857433c0111857f8ff6d3b88c77118500 daeb16b9274f3d10ecc36f8eed695376e0280b00c772ab0d2c6753314acc80dad6a077956231ba313c6ad214adcfe14db6f217bdaa410b 26c5ceace2021b2e4af9ac9bb58e95f83af7c391cb937fa67aff8e07f0d4fe5e98ac130a7ba5ed4a302faca743e73d3643318dc565c06fa9

That string of seemingly random characters at the end is the signature; the magnetic ink if you will. It is the encrypted and encoded fingerprint of the URL components that precede it. Don’t worry, that URL won’t actually resolve to a file. Not only did I monkey with the URL components, thus invalidating the fingerprint, but these signed URLs can also be set to expire. In this case, as you can see if you squint, after 900 seconds, i.e., 15 minutes. It’s also a “write” URL and cannot be used to read a file.

Cutting out the middleman

When this URL is returned to the browser, it can then use it as the actual location to post the file. On receipt, Google will use the corresponding public key to decrypt the signature and get the fingerprint. If that step succeeds, then Google knows that the request came from an authorized account since only the holder of the private key could have encrypted the payload. Google then hashes the plaintext parts of the URL, and if this generates the same fingerprint as was just decrypted, then Google knows that the URL was not modified in transit. Now satisfied that the request is legitimate, GCS will accept the file for storageconsuming zero additional Ready Room resources!

Something similar happens when retrieving a file. The user clicks a link, the client sends the link data to Ready Room, Ready Room generates a signed URL and returns it to the client, the client follows the signed URL, Google validates the signature and sends the file directly from GCS.

Now, I know what you’re probably thinking. You’re probably thinking “What time is it? How long have I been asleep?” But once you’ve shaken off your stupor, you may also be thinking, “All well and good, but what’s in it for me?” Lots! Not only do you get those 11 nines of durability, but uploads and downloads will be faster, files can be larger, and the system as a whole will be more responsive. In fact, because of our switch to GCS, we’ve doubled the maximum allowed file size from 50MB to 100MB. As always, there is no limit on the number of files that can be stored.

GCS storage is available now. Get Ready!