Data synchronization


Data synchronization is the process of establishing consistency among data from a source to a target data storage and vice versa and the continuous harmonization of the data over time. It is fundamental to a wide variety of applications, including file synchronization and mobile device synchronization e.g., for PDAs.
Synchronization can also be useful in encryption for synchronizing Public Key Servers.

File-based solutions

There are tools available for file synchronization, version control, distributed filesystems, and mirroring, in that all these attempt to keep sets of files synchronized. However, only version control and file synchronization tools can deal with modifications to more than one copy of the files.
Several theoretical models of data synchronization exist in the research literature, and the problem is also related to the problem of Slepian–Wolf coding in information theory. The models are classified based on how they consider the data to be synchronized.

Unordered data

The problem of synchronizing unordered data is modeled as an attempt to compute the symmetric difference
between two remote sets
and of b-bit numbers. Some solutions to this problem are typified by:
;Wholesale transfer: In this case all data is transferred to one host for a local comparison.
;Timestamp synchronization: In this case all changes to the data are marked with timestamps. Synchronization proceeds by transferring all data with a timestamp later than the previous synchronization.
;Mathematical synchronization: In this case data are treated as mathematical objects and synchronization corresponds to a mathematical process.

Ordered data

In this case, two remote strings and need to be reconciled. Typically, it is assumed that these strings differ by up to a fixed number of edits. Then data synchronization is the process of reducing edit distance between and, up to the ideal distance of zero. This is applied in all filesystem based synchronizations. Many practical applications of this are discussed or referenced above.
It is sometimes possible to transform the problem to one of unordered data through a process known as shingling.

Error handling

In fault-tolerant systems, distributed databases must be able to cope with the loss or corruption of their data. The first step is usually replication, which involves making multiple copies of the data and keeping them all up to date as changes are made. However, it is then necessary to decide which copy to rely on when loss or corruption of an instance occurs.
The simplest approach is to have a single master instance that is the sole source of truth. Changes to it are replicated to other instances, and one of those instances becomes the new master when the old master fails.
Paxos and Raft are more complex protocols that exist to solve problems with transient effects during failover, such as two instances thinking they are the master at the same time.
Secret sharing is useful if failures of whole nodes are very common. This moves synchronization from an explicit recovery process to being part of each read, where a read of some data requires retrieving encoded data from several different nodes. If corrupt or out-of-date data may be present on some nodes, this approach may also benefit from the use of an error correction code.
DHTs and Blockchains try to solve the problem of synchronization between many nodes.