Standalone Solver

MPI provides a mechanism to dynamically spawn worker processes via MPI_Comm_spawn. By utilising this feature, it is possible to separate the main application from the solver. The architecture can be illustrated as follows.

architecture design

The esp solvers are also available as standalone executables that follow the above design. Such a form is particularly beneficial if the following holds.

The main application is not suitable for distributed-memory model due to, for example, latency constraints.
The main application adopts a different parallelism pattern.
Sometimes, one may want to have a clear boundary between the main application and the solver(s).

However, such a clear boundary may be a shortcoming as well. For example, the root process needs to store the whole matrices in memory, which is subject to physical limitations. And since the spawned processes are not persistent, solving the same system with different right-hand sides may incur unnecessary data communication.

If the matrices need to be read from IO in a distributed manner, one may also need a refined control of how the data is prepared, by probably directly using BLACS/ScaLAPACK/MPI functions.

Backbone

Here, we use the pgesv solver as the example to illustrate some critical steps used in implementing the above architecture.

We are not using the raw MPI functions, instead, the mpl library is used. It is necessary to get the default communicator and the parent inter communicator.

solver.pgesv.cpp:52:55
#include <mpl/mpl.hpp>

const auto& comm_world{mpl::environment::comm_world()};
const auto& parent = mpl::inter_communicator::parent();

Since mpl handles MPI environment initialisation and finalisation, it is necessary to tell ezp to ignore finalising the MPI environment. The function ezp::blacs_env::do_not_manage_mpi() must be called if MPI is managed by external tools.

solver.pgesv.cpp:83
    ezp::blacs_env<int>::do_not_manage_mpi();

The parameters of the linear system will be broadcast over by the main application first.

solver.pgesv.cpp:90:98
    const auto all = mpl::communicator(parent, mpl::communicator::order_high);

    int config[3]{};

    all.bcast(0, config);

    const auto N = config[0];
    const auto NRHS = config[1];
    const auto FLOAT = config[2];

Knowing the problem sizes N and NRHS, the root process can initialise the containers and receive the contents of two matrices.

solver.pgesv.cpp:60:70
    if(0 == comm_world.rank()) {
        A.resize(N * N);
        B.resize(N * NRHS);

        mpl::irequest_pool requests;

        requests.push(parent.irecv(A, 0, mpl::tag_t{0}));
        requests.push(parent.irecv(B, 0, mpl::tag_t{1}));

        requests.waitall();
    }

Solving the system follows the conventional approach, that is, create a solver object and call the solve method with data wrapped. The solution is sent back to the caller.

solver.pgesv.cpp:72:77
    const auto error = ezp::pgesv<DT, IT>().solve({N, N, A.data()}, {N, NRHS, B.data()});

    if(0 == comm_world.rank()) {
        parent.send(error, 0);
        if(0 == error) parent.send(B, 0);
    }

The Caller

From the above code snippets, one may observe that the caller needs to

broadcast config,
send A and B,
receive the error code and the solution X if no error occured.

To this end, one can declare the corresponding containers using std::vector.

runner.cpp:38:39
    std::vector<double> A, B(N * NRHS, 1.);
    std::vector<int> config;

The following creates a diagonal matrix for illustration.

runner.cpp:45:50
        solver = "solver.pgesv";

        config = {N, NRHS, 1};

        A.resize(N * N, 0.);
        for(auto I = 0; I < N; ++I) A[I * N + I] = I + 1.;

Now since all the data is ready to be communicated, the actual communication is very concise and straightforward.

runner.cpp:95:110
    const auto& comm_world{mpl::environment::comm_world()};
    const auto worker = comm_world.spawn(0, argc < 3 ? 1 : std::abs(std::stoi(argv[2])), {solver});
    const auto all = mpl::communicator(worker, mpl::communicator::order_low);

    all.bcast(0, config.data(), mpl::contiguous_layout<int>(config.size()));

    mpl::irequest_pool requests;

    requests.push(worker.isend(A, 0, mpl::tag_t{0}));
    requests.push(worker.isend(B, 0, mpl::tag_t{1}));

    requests.waitall();

    int error = -1;
    worker.recv(error, 0);
    if(0 == error) worker.recv(B, 0);

Python Caller

Since there is a clear boundary between the main application and the solver, it is possible to use other MPI compatible to send/receive the problem.

For example, one can use mpi4py to call the solvers.

It has to be pointed out that, the column-major memory layout shall be kept.

runner.py
from array import array
import sys
from mpi4py import MPI
import numpy


def run(nprocs: int, N: int, NRHS: int):
    comm = MPI.COMM_WORLD

    # spawn the solver
    worker = comm.Spawn("./solver.pgesv", maxprocs=nprocs)
    all = worker.Merge()

    # broadcast the problem configuration
    all.Bcast(array("i", [N, NRHS, 1]), root=0)

    # send the matrices
    # ! numpy arrays are row-major by default
    # ! column-major order is required by (Sca)LAPACK
    # ! thus need to specify the order as "F"
    A = numpy.zeros((N, N), dtype=numpy.float64, order="F")
    for i in range(N):
        A[i, i] = i + 1
    print("A:\n", A)
    worker.Send(A, dest=0, tag=0)

    B = numpy.ones((N, NRHS), dtype=numpy.float64, order="F")
    for i in range(NRHS):
        B[:, i] = i + 1
    print("B:\n", B)
    worker.Send(B, dest=0, tag=1)

    # receive the error code and the solution if no error
    error = array("i", [-1])
    worker.Recv(error)
    if 0 == error[0]:
        X = numpy.ones((N, NRHS), dtype=numpy.float64, order="F")
        worker.Recv(X)
        print("Solution:\n", X)

    worker.Disconnect()
    all.Disconnect()


if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: runner.py <nprocs> <N> <NRHS>")
        sys.exit(1)

    nprocs = int(sys.argv[1])
    N = int(sys.argv[2])
    NRHS = int(sys.argv[3])

    run(nprocs, N, NRHS)

Calling the above script with

Bash
./runner.py 2 6 2

yields the following is the solution, which is correct.

Text Only

A:
 [[1. 0. 0. 0. 0. 0.]
 [0. 2. 0. 0. 0. 0.]
 [0. 0. 3. 0. 0. 0.]
 [0. 0. 0. 4. 0. 0.]
 [0. 0. 0. 0. 5. 0.]
 [0. 0. 0. 0. 0. 6.]]
B:
 [[1. 2.]
 [1. 2.]
 [1. 2.]
 [1. 2.]
 [1. 2.]
 [1. 2.]]
Solution:
 [[1.         2.        ]
 [0.5        1.        ]
 [0.33333333 0.66666667]
 [0.25       0.5       ]
 [0.2        0.4       ]
 [0.16666667 0.33333333]]

Full Reference Implementation

The following is a full reference implementation of a standalone solver and the corresponding caller logic.

runner.cpp

runner.cpp
#include <iostream>
#include <mpl/mpl.hpp>

int main(int argc, char** argv) {
    if(argc < 2) {
        std::cout << "Usage: runner ge|po|gb|db|pb [n]\n";
        std::cout << "Example: runner ge 3\n";
        return 0;
    }
    constexpr auto N = 6, NRHS = 1;

    std::vector<double> A, B(N * NRHS, 1.);
    std::vector<int> config;

    const auto type = std::string(argv[1]);
    std::string solver;

    if("ge" == type) {
        solver = "solver.pgesv";

        config = {N, NRHS, 1};

        A.resize(N * N, 0.);
        for(auto I = 0; I < N; ++I) A[I * N + I] = I + 1.;
    }
    else if("po" == type) {
        solver = "solver.pposv";

        config = {N, NRHS, 1};

        A.resize(N * N, 0.);
        for(auto I = 0; I < N; ++I) A[I * N + I] = I + 1.;
    }
    else if("gb" == type) {
        solver = "solver.pgbsv";

        constexpr auto KL = 1, KU = 1;

        config = {N, KL, KU, NRHS, 1};

        A.resize(N * (KL + KU + 1), 0.);
        for(auto I = 0; I < N; ++I) A[KU + I * (KL + KU + 1)] = I + 1.;
    }
    else if("db" == type) {
        solver = "solver.pdbsv";

        constexpr auto KL = 1, KU = 1;

        config = {N, KL, KU, NRHS, 1};

        A.resize(N * (KL + KU + 1), 0.);
        for(auto I = 0; I < N; ++I) A[KU + I * (KL + KU + 1)] = I + 1.;
    }
    else if("pb" == type) {
        solver = "solver.ppbsv";

        constexpr auto KLU = 1;

        config = {N, KLU, NRHS, 1};

        A.resize(N * (KLU + 1), 0.);
        for(auto I = 0; I < N; ++I) A[I + I * KLU] = I + 1.;
    }
    else {
        std::cout << "Usage: runner ge|po|gb|db|pb [n]\n";
        return 0;
    }

    const auto& comm_world{mpl::environment::comm_world()};
    const auto worker = comm_world.spawn(0, argc < 3 ? 1 : std::abs(std::stoi(argv[2])), {solver});
    const auto all = mpl::communicator(worker, mpl::communicator::order_low);

    all.bcast(0, config.data(), mpl::contiguous_layout<int>(config.size()));

    mpl::irequest_pool requests;

    requests.push(worker.isend(A, 0, mpl::tag_t{0}));
    requests.push(worker.isend(B, 0, mpl::tag_t{1}));

    requests.waitall();

    int error = -1;
    worker.recv(error, 0);
    if(0 == error) worker.recv(B, 0);

    for(auto I = 0; I < N; ++I) printf("x[%d] = %+.6f\n", I, B[I]);

    return 0;
}

solver.pgesv.cpp

solver.pgesv.cpp
#include <ezp/pgesv.hpp>
#include <mpl/mpl.hpp>

const auto& comm_world{mpl::environment::comm_world()};
const auto& parent = mpl::inter_communicator::parent();

template<ezp::data_t DT, ezp::index_t IT> int run(const int N, const int NRHS) {
    std::vector<DT> A, B;

    if(0 == comm_world.rank()) {
        A.resize(N * N);
        B.resize(N * NRHS);

        mpl::irequest_pool requests;

        requests.push(parent.irecv(A, 0, mpl::tag_t{0}));
        requests.push(parent.irecv(B, 0, mpl::tag_t{1}));

        requests.waitall();
    }

    const auto error = ezp::pgesv<DT, IT>().solve({N, N, A.data()}, {N, NRHS, B.data()});

    if(0 == comm_world.rank()) {
        parent.send(error, 0);
        if(0 == error) parent.send(B, 0);
    }

    return 0;
}

int main(int, char**) {
    ezp::blacs_env<int>::do_not_manage_mpi();

    if(!parent.is_valid()) {
        printf("This program must be invoked by the host application.\n");
        return 0;
    }

    const auto all = mpl::communicator(parent, mpl::communicator::order_high);

    int config[3]{};

    all.bcast(0, config);

    const auto N = config[0];
    const auto NRHS = config[1];
    const auto FLOAT = config[2];

    if(FLOAT >= 10) return run<complex16, int_t>(N, NRHS);
    if(FLOAT >= 0) return run<double, int_t>(N, NRHS);
    if(FLOAT > -10) return run<float, int_t>(N, NRHS);

    return run<complex8, int_t>(N, NRHS);
}