Freetrade Head of Engineering, Invest, Tim Drew, shares how we scale our platform using Cloud Firestore
Cloud Firestore is Google’s premier NoSQL document-oriented database, it sports:
“automatic multi-region data replication, strong consistency guarantees, atomic batch operations, and real transaction support. We've designed Cloud Firestore to handle the toughest database workloads from the world's biggest apps” src
At Freetrade we get an enormous amount of value from this robust, autoscaling database.
Particular points of note:
For all our love of Firestore, it does come with its challenges, many of which relate to the import/export tool. Below are a collection of lessons learnt on the road to productionising our use of Firestore.
Firestore paths are made up of an alternating sequence of collection and document ids:
It is very easy to make the mistake of thinking that the lower level collection IDs are “fully qualified” and isolated from others of the same name, but that’s not how it works.
Take, for example, the following example data structures:
Here we have two history collections, with very different purposes and containing very different data.
At first look this structure seems reasonable and the basics work fine, if you query /snapshots/<ID>/history you only get back snapshot documents, if you query /prices/<ID>/history you only get price documents.
Makes sense, the two are totally separate, right? Wrong!
Under the hood in Firestore, these two history collections are actually one and the same. This can lead to a number of unexpected issues, for example:
The solution for all of this is simple, albeit ugly and against my preference for DRY. Your collection IDs should avoid generic names and instead should be unique per use case, like this:
Firestore does not provide any kind of automated backup service. The closest thing they have is the import/export tool, which you have to orchestrate yourself.
This tool allows you to take a backup of the database relatively easily, however it is important to realise that it is not a delta, it is a full backup of every document in the database and you are billed for the cost of a read on every one of those documents!
At first look, the costs of Firestore look pretty affordable (and they are) with read costs a tiny $0.036/100k document reads. As your database size increases though, these costs really start to stack up.
Costs per backup (assuming 1 backup per day):
10M doc backup -> $3.6/day -> $109.20/month
100M doc backup -> $36/day -> $1092/month
500M doc backup -> $180/day -> $5460/month
If, like us, your risk tolerance means you need to backup your data multiple times per day and regulations mean you need to retain data for years, then the multiples can quickly make the cost of keeping everything in Firestore unsustainable, especially if your data design is biased towards lots of small documents.
As a result of the above I would strongly recommend you consider what data you need to hold onto and what you can delete outright. For historical data that needs to be retained, consider developing an archival process where you move documents to a cheaper long-term storage medium.
It has taken non-trivial effort, but by keeping our Firestore dataset to a lean record of current data, we have kept the costs very reasonable. In future we may look at rolling our own streaming backup utility, but I really hope that Google comes up with a managed service before we resort to that.
Some of our initial data structures involved dynamically created collection IDs. On the face of it this seemed a perfectly reasonable design decision, for example partitioning data by date:
In this example 2020-01-01 and 2020-01-02 are distinct collection IDs, with collections of documents that sit underneath them.
This approach works seamlessly and Firestore allows you to implicitly declare collections just by creating the documents underneath them, something Google themselves highlight.
The devil however, is in the detail. While this kind of structure might seem perfectly functional when you first start using it, there are problems:
We recommend you avoid any kind of dynamic collection naming and keep the number of unique collections to the 10s or low hundreds. In some cases this has meant we’ve needed to introduce superfluous intermediate collection/documents to avoid the dynamic collection IDs. For example, the following structure shifts the dynamic date ID to be on a document ID, rather than a collection ID.
Ugly but effective.
You absolutely wouldn’t do this intuitively unless you knew about the issues related to dynamic collection IDs.
Firestore paths are made up of an alternating sequence of collection and document IDs:
But it is entirely possible and valid that your path might have “missing” documents at the intermediate levels. For example, our event-sourcing data structure looks something like this:
In this example /clients/<CLIENT_ID> is a document that doesn’t actually exist, it is just an intermediate part of the path used to partition the events by client.
This structure works fine when you’re directly addressing an individual client, but is problematic when you want to enumerate all clients.
A key detail about listDocuments is that it does not give you any hooks for pagination, it simply returns all documents. This clearly won’t scale forever and indeed we saw this method starting to fail once our collections got into the 10ks of documents.
This is more dangerous than it might seem at first. If your document IDs aren’t predictable, a UUID for example, then you could very easily be left with data in your database that you can’t find.
As a result of the above, we recommend either designing a data structure that doesn’t involve “missing” documents, or maintaining a separate collection of concrete documents that you can paginate through.
As mentioned at the outset, we at Freetrade get an enormous amount of value from Firestore, there is so much power that it provides out-of-the-box.
As we’ve listed however, there are a bunch of gotchas that are not readily apparent when you first start out with the database. Hopefully you can save yourself some pain by learning from our mistakes.