Deciding between CRDTs and OT for data synchronization
CRDTs and operational transformation (OT) algorithms are two of the most popular approaches to automatic conflict resolution. Which one fits your use case best?
Table of contents
Do you even need them?
OT and CRDTs solve the problem of data conflicts (aka concurrent writes), but come with their own set of issues. If it’s possible to avoid data conflicts in the first place, then you should generally do so. In short:
- Data conflicts can’t happen if data only has one owner. A singular database, partitioning, and single-leader replication all fall under this.
- If data has multiple owners, you might still have options:
- Locking. Slow and hard to implement in distributed contexts.
- Simple overwriting strategies, such as last-write-wins (LWW) and node hierarchies. Results in data loss.
How about situations where you do need these algorithms? For example, you might have a multi-leader distributed database system, where each DB server is a node that writes to the same data. Or as another example, you might be writing a collaborative or offline-first application (a la Google Docs). Clients are also considered to be nodes in this system, because they apply their local changes immediately (to a client-side DB) and only later synchronize them with others.
Which one should you use?
Ease of use
I’ll keep this short: CRDT libraries are easier to use. If your application doesn’t have any special requirements and you aren’t sure what to use, pick CRDTs. I’d recommend either Y.js or Automerge, but suggest checking out ElectricSQL and Loro as well.
Open-source OT libraries tend to have a smaller ecosystem, less support, as well as fewer built-in integrations with other technologies. Not to mention that even minimal OT algorithms tend to be really complicated and unintuitive compared to CRDTs.
Of course, this is just an opinion on the internet. The rest of the article focuses on the more nuanced differences in regards to specific use cases.
Complexity and offline support
There are two types of OT algorithms. The more popular ones (usually based on Jupiter) require a central synchronization server. Their complexity is typically , where is the amount of operations to synchronize. There also exist distributed OT algorithms, but they suffer from higher complexity 1. In practice, centralized OT algorithms are most often used between the two (alongside partitioning between OT servers if needed).
Modern CRDTs, on the other hand, tend to enjoy complexities like and when synchronizing changes.
If you need to synchronize data in a distributed manner, then this complexity problem might already be driving you towards CRDTs. However, there’s a great paper in favor of OTs that I recommend reading 2. The authors dispel a lot of myths about OT often made by proponents of CRDTs. Amongst them, they correctly point out that complexity isn’t an issue in real-time synchronization scenarios, because is a really small number.
But I’d emphasize that in scenarios where can in fact grow large, the complexity of OT algorithms may still be a concern 3. Consider situations where clients can lose connection with the other nodes for prolonged periods of time but still continue writing data. An example of this would be applications with offline support, such as PWAs. If you expect a lot of offline changes, CRDTs may be the better option 4.
Performance
There is another interesting point on performance: OTs are faster at upstream updates, but CRDTs are typically faster at downstream updates. Upstream execution time is how long it takes for changes to be applied locally, whilst downstream execution time is the time it takes for changes to be synchronized with other users.
OT upstream updates are an immediate , because all of the transformations required for synchronization happen elsewhere. CRDTs, on the other hand, must transform local operations into the CRDT’s data structure.
Bad upstream responsiveness is more noticable to users compared to it taking long for external changes to arrive and synchronize. Therefore, in most cases, this is a point in favor of OT.
Text and the interleaving anomaly
This is a point concerning those who wish to synchronize text intuitively.
A common problem when it comes to text synchronization is the interleaving anomaly. If one user writes “foo” and another writes “bar” at the same time, then an intuitive result after synchronization would be either “foobar” or “barfoo”.
However, some synchronization algorithms may end up synchronizing the changes as something like “fboaor”, where the two words are interleaved. There are stronger and weaker versions of this anomaly: some can only appear in very certain situations that are unlikely to happen.
The interleaving anomaly appears in both CRDTs and OT algorithms, described in detail here. The strong forms can appear in popular OTs based on Jupiter, adOPTED and SOCT2. Only the weaker forms appear in CRDT libraries such as Automerge. There also exists a newer CRDT library called Loro, which utilizes the Fugue algorithm to avoid the interleaving anomaly entirely.
In this area, CRDTs seem like the clear winner.
Footnotes
-
It’s also a bit of an oversimplification, as more input variables are often involved. See “Evaluating CRDTs for Real-time Document Editing” for complexity analysis of CRDT algorithms. ↩
-
The same authors also wrote the Operational Transformation FAQ, which is a great resource for any questions regarding OT that you may have. You may also have heard of the correctness arguments against OT, which they debunk in both their papers and the FAQ. ↩
-
There are some techniques to alleviate this problem, such as compressing multiple local operations together. Other than synchronization time, clients must also buffer all operations until they’re synchronized, which may pose a problem in environments with storage limitations. ↩
-
Operation-based CRDTs run into a similar problem as OT algorithms: you need to store changes in a local operations buffer until they can be synchronized. Automerge, an operation-based CRDT library, deals with this by compressing its local operations buffer. ↩