Parallel-TEBD


The parallel-TEBD is a version of the TEBD algorithm adapted to run on multiple hosts. The task of parallelizing TEBD could be achieved in various ways.
This article introduces the conceptual basis of the implementation, using MPI-based pseudo-code for exemplification, while not restricting itself to MPI - the same basic schema could be implemented with the use of home-grown messaging routines.

Introduction

The TEBD algorithm is a good candidate for parallel computing because the exponential operators used to calculate the time-evolution factorize under the Suzuki-Trotter expansion. A detailed presentation of the way TEBD works is given in the main article. Here we concern ourselves only with its parallel implementation.

Implementation

For our purposes, we will use the canonical form of the MPS as introduced by Guifré Vidal in his original papers. Hence, we will write the function of state as:
This function describes a N-point lattice which we would like to compute on P different compute nodes. Let us suppose, for the sake of simplicity, that N=2k*P, where k is an integer number. This means that if we distribute the lattice points evenly among the compute nodes, an even number of lattice points 2k is assigned to each compute node. Indexing the lattice points from 0 to N-1 and the compute nodes from 0 to P-1, the lattice points would be distributed as follows among the nodes:
NODE 0: 0, 1,..., 2k-1
NODE 1: 2k, 2k+1,..., 4k-1
...
NODE m: m*2k,..., *2k - 1
...
NODE P-1: *2k,..., N-1
Using the canonical form of the MPS, we define as "belonging" to node m if m*2k ≤ l ≤ *2k - 1. Similarly, we use the index l to assign the to a certain lattice point. This means that
and, belong to NODE 0, as well as. A parallel version of TEBD implies that the computing nodes need to exchange information among them. The information exchanged will be the MPS matrices and singular values lying at the border between neighbouring compute nodes. How this is done, it will be explained below.
The TEBD algorithm divides the exponential operator performing the time-evolution into a sequence of two-qubit gates of the form:
Setting the Planck constant to 1, the time-evolution is expressed as:
where H = F + G,
What we can explicitly compute in parallel is the sequence of gates
Each of the compute node can apply most of the two-qubit gates without needing information from its neighbours. The compute nodes need to exchange information only at the borders, where two-qubit gates cross them, or just need information from the other side. We will now consider all three sweeps, two even and one odd and see what information has to be exchanged. Let us see what is happening on node m during the sweeps.

First (even) sweep

The sequence of gates that has to be applied in this sweep is:
Already for computing the first gate, process m needs information from its lowest neighbour, m-1. On the other side, m doesn't need anything from its "higher" neighbour, m+1, because it has all the information it needs to apply the last gate. So the best strategy for m is to send a request to m-1, postponing the calculation of the first gate for later, and continue with the calculation of the other gates. What m does is called non-blocking communication. Let's look at this in detail. The tensors involved in the calculation of the first gate are: