As this may be useful for some people I decided to write a short article about how one can do distributed parameter tests either on a cloud or on a cluster. As a test case, we will solve the poisson equation for some 1000 different conductivities - automatically distributed onto all available computing nodes. Apart from NGSolve we will use dispy to manage the network communication.
Distributed computing using a cloud
This introduction works if all your computing nodes are in a local network. For use in a cluster look a bit further down, if the nodes are in different networks, an explanation how to use dispy in that setting is available in the dispy documentation. First one needs to install both dispy and NGSolve on all nodes. Then execute the dispynode.py script (provided by dispy) on every node you want to use for your computation. Next we write a compute function which will be executed on every node:
def compute(alpha): import ngsolve as ngs import pickle import socket import os from netgen.geom2d import unit_square ngs.SetNumThreads(1) mesh = ngs.Mesh(unit_square.GenerateMesh(maxh=0.1)) fes = ngs.H1(mesh,order=5,dirichlet="left|top") u,v = fes.TrialFunction(), fes.TestFunction() a = ngs.BilinearForm(fes) a += ngs.SymbolicBFI(ngs.grad(u)*ngs.grad(v)) f = ngs.LinearForm(fes) f += ngs.SymbolicLFI(1 * v) u = ngs.GridFunction(fes,"u_"+str(alpha)) with ngs.TaskManager(): a.Assemble() u.vec.data = a.mat.Inverse(fes.FreeDofs()) * f.vec print("computed solution") directory = os.path.expanduser("~") + "/tmp" if not os.path.exists(directory): os.makedirs(directory) filename = directory + "/solution_"+str(alpha) + ".dat" with open(filename,"wb") as f: pickler = pickle.Pickler(f) pickler.dump([mesh,u]) dispy_send_file(filename) host = socket.gethostname() return host
This function computes the solution for an input conductivity $\alpha$, pickles the solution and sends it back to the master node. On the master we will need to run some further dispy code which creates the jobs and submits them:
if __name__ == '__main__': import dispy import numpy as np cluster = dispy.JobCluster(compute) jobs = [] for alpha in np.linspace(1,200,1000): job = cluster.submit(alpha) job.id = alpha jobs.append(job) for job in jobs: host = job() if job.exception: print(job.exception) else: print('%s computed solution for alpha = %s in %s seconds' % (host, job.id, job.end_time-job.start_time)) cluster.print_status()
Running the dispynode.py on 4 computers in the network and then running simple_ngsolve.py we get all the pickled solutions back as well as a short report:
Node | CPUs | Jobs | Sec/Job | Node Time Sec ------------------------------------------------------------------------------ pc1 | 128 | 264 | 7.884 | 2081.290 pc2 | 16 | 152 | 1.349 | 205.017 pc3 | 48 | 300 | 2.115 | 634.399 pc4 | 80 | 284 | 3.718 | 1055.902 Total job time: 3976.608 sec, wall time: 21.270 sec, speedup: 186.955
Note that we told ngsolve to run with 1 thread only with the line
ngs.SetNumThreads(1)
And let dispy start one node per available thread. For larger computations it may be more efficient (or even necessary because of the memory) to do the reverse (have a look at the dispy documentation to see how to run dispy with one node per computer instead of one per thread).
Parameter Testing using a cluster (by Lukas Kogler)
If you have access to a cluster you might want to do your parameter tests there. The setup for a cluster is basically the same from the python side, however you will usually have to deal with some kind of workload manager or queue to start your computations.
We will give a short example on how to set this up on a cluster using the SLURM workload-manager. All relevant files are in the attached zip-container.
We want to have one node running the master-script and all the other nodes wait for jobs using the dispy script.
In this example, each node has 20 cores and we will start one instance of ngsolve per node, each using
20 threads.
The job is submitted with "sbatch script_dispy", which contains the following:
#!/usr/bin/bash # #SBATCH --job-name=dispy_test #SBATCH --partition=medium #SBATCH --output=/home/lkogler/scripts/dispy/slurm_out #SBATCH --ntasks 4 #SBATCH -N 4 #SBATCH --time=15:00 #SBATCH --cpus-per-task=20 #SBATCH --tasks-per-node=1 srun --multi-prog multi.conf
This tells SLURM that we want to start a job using 4 nodes (N), running 1 task per node with 20 cpus assigned to each.
--multi-prog tells slurm that you are planning to execute different commands on different tasks. In this case, the list of commands is contained in the file multi.conf. Task 0 is the master and tasks 1 to 3 wait for jobs.
0 python3 /home/lkogler/scripts/dispy/simple_ngsolve.py 1-3 bash /home/lkogler/scripts/dispy/.pipe_it python3 /home/lkogler/scripts/dispy/dispynode.py --daemon --clean
Per default, slurm writes the output of all tasks into the same file, which can be specified by "--output=..." in the slurm-script.
We want each instance of ngsolve to write into its own file. In this case, SLRUM provides the environment variable SLURM_NODEID and we can use this to specify one file per node in the attached file ".pipe_it":
#!/usr/bin/bash $@ > /scratch/lkogler/output/out_node_$SLURM_NODEID
This is an important point because otherwise all of the processes would try to write to the same file at the same time. They would constantly block each other and the resulting file would be absolutely unreadable.
If you run into problems or need help getting started, leave a comment here or in the forum or contact us directly!