Running COMSOL® in parallel on clusters

Solution Number: 1001
Title: Running COMSOL® in parallel on clusters
Platform:
Applies to: All Products
Categories:
Keywords: solver memory parallel cluster

Problem Description

This solution describes how you enable distributed parallelization (cluster jobs) in COMSOL®.

Solution

COMSOL supports two mutual modes of parallel operation: shared-memory parallel operations and distributed-memory parallel operations, including cluster support. This solution is dedicated to distributed-memory parallel operations. For shared-memory parallel operations, see Solution 1096. COMSOL can distribute computations on compute clusters using MPI. One large problem can be distributed across many compute nodes. Also, parametric sweeps can be distributed with individual parameter cases distributed to each cluster node.

Cluster computing is supported on Windows® (Windows® HPC Server 2008/2012/2016), and on Linux® using SLURM® or Sun Grid Engine (SGE, also known as Univa® Grid Engine). Other common schedulers including LSF, PBS, and TORQUE can be utilized by configuring the general scheduler type in the COMSOL Desktop® or by creating a job script for batch submission from the command line.

NOTE: A Floating Network License (FNL) is required to run COMSOL with distributed memory parallelization (cluster/cloud computing).

At the bottom of this page are some additional resources on how to get started with cluster computing.

Some basic information, useful tips, and troubleshooting guides are provided below.

Fundamentals

The following terms occur frequently when describing the hardware for cluster computing and shared memory parallel computing:

  • Compute node: The compute nodes are where the distributed computing occurs. A COMSOL instance resides in each compute node and communicates with other compute nodes using MPI. A compute node is a process running on the operating system, and multiple compute nodes can be assigned to run on a single host.
  • Host: The host is a physical machine with a network adapter and unique network address. A cluster consists of multiple hosts connected by a network. A host is sometimes referred to as a physical node.
  • Core: One or more physical processor cores are used in shared-memory parallelism by a computational node running on a host with a multicore processor. For example, a host with two quad-core processors has eight available cores.

The following settings are particularly important to specify how COMSOL executes a cluster job on distributed memory hardware. These settings can be found in the Cluster Computing or Cluster Sweep nodes and associated Job Configuration node in the COMSOL GUI.

  • Number of nodes: The total number of compute nodes created across all hosts when the cluster job is executed. This is then also the number of active MPI processes.
  • Host file: The host file is a plain text file that contains the IP address or hostname of each host. A proper host file for COMSOL should have each IP address listed only once, with each entry on a separate line.
  • Number of processes on a host: The number of compute nodes that will run on each host.

Similar settings can be provided as execution arguments when running a non-interactive session of COMSOL from a command line.

  • -nn <no. of nodes> | Total number of compute nodes
  • -f <path/hostfile> | Path and file containing the list of hostnames
  • -nnhost <no. of nodes per host> | Number of compute nodes to run on each host
  • -np <no. of cores> | Number of cores to be used by each compute node

Getting started with the command line

A quick way to get started with COMSOL cluster computing is to run a job directly from the command line. For this you need access to a cluster where you have the ability to log into one of the hosts and execute commands. One of the simplest examples is a Beowulf cluster built from normal workstation computers, as described in Building a Beowulf Cluster for Faster Multiphysics Simulations.

For these examples, assume the system consists of four hosts, each with eight cores. In the first example, the input file example.mph could be any COMSOL file to be run with a single solve distributed across the four machines. The most common reason for this would be that the solver requires too much memory to be run on a single machine. The following Linux® operating system execution line will create four compute nodes, with one running on each host whose IP address is listed in hostfile. Each compute node will use all eight cores available on the host.

comsol batch -nn 4 -nnhost 1 -np 8 -f hostfile -inputfile example.mph 
-outputfile example_solved.mph -batchlog logfile.log

As a second example, the input file example.mph could be any COMSOL file with a parametric sweep where solving for each parameter value can be done on a single machine. Distributing the sweep on a cluster allows it to be solved faster. When preparing the model, go to the Parametric Sweep node, turn on Advanced Study Options, and under the Study Extensions settings select the Distribute parametric sweep check box. This configures the COMSOL scheduler to run each parameter value on a different compute node. Using the same Linux® command from the first example would run four parameters at the same time, one on each host. If the sweep has more than four parameters, then when a compute node finishes solving for one parameter it will automatically start on another until they are all solved.

For some parametric sweeps, such as relatively small models solved for many parameters, it may be beneficial to assign more than one computational node to each host. The following Linux® command will create eight compute nodes, with two running on each host listed in hostfile. Each compute node is assigned four cores, since two compute nodes need to share the eight cores available on a host.

comsol batch -nn 8 -nnhost 2 -np 4 -f hostfile -inputfile example.mph 
-outputfile example_solved.mph -batchlog logfile.log

The potential benefit of running multiple parameters simultaneously on a single host is discussed in the blog article Hybrid Computing: Advantages of Shared and Distributed Memory Combined.

Cluster distribution in an interactive COMSOL® session

Another way to run cluster jobs is to add a Cluster Computing or Cluster Sweep node to a study. This is done by right-clicking on the Study node (the Advanced Study Options must be enabled). The Cluster Computing interface distributes the work of the study across all hosts, similar to the first example above. The Cluster Sweep interface is similar to running a distributed parameter sweep, as described in the second example above. Either interface requires the user to specify relevant settings, such as the number of nodes, the host file, and the number of simultaneous jobs.

Example models for cluster computing included in the Model Library:

COMSOL_Multiphysics/Tutorials/micromixer_cluster Demonstrates how the Cluster Computing interface can be used to distribute the work of running a single model to several nodes. If the COMSOL session is running on one of the cluster hosts, then the built-in COMSOL scheduler is used by changing the setting "Scheduler type" to "General", specifying the number of nodes, and providing the path to the host file.

COMSOL_Multiphysics/Tutorials/thermal_actuator_jh_distributed Demonstrates how the Cluster Computing interface can be used with a distributed parametric sweep to have each node run a different parameter. This functionality can also be implemented by replacing the Cluster Computing and Parametric Sweep with a Cluster Sweep, which has additional options for file handling and greater resilience to failure of an individual node.

Cloud Computing

The benefits of cluster computing can also be achieved by running on cloud computing hardware. For additional information, including a list of cloud vendor partners who can help get you set up quickly and easily, please see Running COMSOL® Multiphysics and COMSOL Server™ in the Cloud.

Hardware Recommendations

See the Knowledge Base solution on Selecting hardware for clusters.

Troubleshooting

Your first step is to make sure you have the latest release installed. The latest release can be downloaded here. Also do Help > Check for Updates to install the latest software updates. The latest updates are also available for download here.

Error messages relating to GTK

GLib-GObject-WARNING **: invalid (NULL) pointer instance
GLib-GObject-CRITICAL **: g_signal_connect_data: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed
Gtk-CRITICAL **: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed
...

These errors typically occur when the COMSOL® user interface Java® component is trying to display an error message in a graphical window, but there is no graphical display available. The recommended solution is to disable file locking. Add the row

-Dosgi.locking=none 

to three COMSOL *.ini configuration files. Open the following files in a text editor:

/usr/local/comsol53/multiphysics/bin/glnxa64/comsolcluster.ini
/usr/local/comsol53/multiphysics/bin/glnxa64/comsolclustermphserver.ini
/usr/local/comsol53/multiphysics/bin/glnxa64/comsolclusterbatch.ini

In each of these files you will find several -Dosgi.* rows. Add the -Dosgi.locking=none row directly below these. Please note that the options are case sensitive.

Check that the nodes can access the license manager

Linux®: Log in to each node and run the the command

comsol batch -inputfile /usr/local/comsol53/multiphysics/models/COMSOL_Multiphysics/Equation-Based_Models/point_source.mph -outputfile out.mph

The command above should be issued on one line. /usr/local/comsol53 is assumed to be your COMSOL installation directory. The /usr/local/comsol53/multiphysics/bin directory, where the comsol script is located, is assumed to be included in the system PATH. Make sure you have write permissions for ./out.mph. No error messages should be produced, or you may have a license manager connectivity problem.

Windows® HPCS: Log in to each node with remote desktop and start the COMSOL Desktop GUI. No error messages should be displayed.

Issues with Infiniband based Linux® clusters

Update the Infiniband drivers to the latest software version. If you cannot update at this time, add the command line options -mpifabrics shm:tcp or -mpifabrics tcp. This will use TCP for communication between nodes.

For more information advice on how to troubleshoot Infiniband issues, please refer to the section Troubleshooting Distributed COMSOL and MPI in the COMSOL Multiphysics Reference Manual.

Problems with the Cluster Computing feature in the model tree

If you get the error message "Process status indicates that process is running", it means that the *.status file in the batch directory indicates that the previous job is still running. In some cases this can happen even if the job is not actually running, for example if the job halted or was terminated in an uncontrolled way. To work around this problem, perform these steps:

  • Cancel any running jobs in the Windows® HPCS Job manager or other scheduler that you use.
  • In COMSOL, go to the External Process page at the bottom right corner of the COMSOL Desktop.
  • Click the Clear Status button. If the error still remains, manually delete all the files in the batch directory.

Error messages due to communication problems between Linux® nodes

If you get error messages, make sure that the compute nodes can access each other over TCP/IP and that all nodes can access the license manager in order to check out licenses. If you run the ssh protocol between the hosts on a Linux cluster you need to pregenerate the keys in order to prevent the nodes to ask each other for passwords as soon as communication is initiated:

# generate the keys
ssh-keygen -t dsa
ssh-keygen -t rsa
# copy the public key to the other machine
ssh-copy-id -i ~/.ssh/id-rsa.pub user@hostname
ssh-copy-id -i ~/.ssh/id-dsa.pub user@hostname

Example of LSF job submission script

Specifies a run on four physical nodes, each running one compute node using eight processor cores

#!/bin/sh
# Rerun process if node goes down, but not if the job crashes
# Cannot be used with interactive jobs
#BSUB -r

# Job name
#BSUB -J comsoltest

# Number of compute cores
#BSUB -n 32

# Use 8 cores per node
#BSUB -R "span[ptile=8]"

# Redirect screen output to output.txt
#BSUB -o output.txt
rm -rf output.txt

# Create hostfile for COMSOL
cat $LSB_DJOB_HOSTFILE | uniq > comsol_hostfile

# Launch the COMSOL batch job
comsol batch -nn 4 -nnhost 1 -np 8 -f comsol_hostfile -inputfile in.mph -outputfile out.mph

Example of PBS job submission script

Specifies a run on four physical nodes, each running one compute node using eight processor cores

#!/bin/bash
# ##############################################################################
#
export nn=4
export np=8
export inputfile="simpleParametricModel.mph"
export outputfile="outfile.mph"
#
qsub -V -l nodes=${nn}:ppn=${np} <<´__EOF__´
#
#PBS -N COMSOL
#PBS -q dp48
#PBS -o $HOME/cluster/job_COMSOL_$$.log
#PBS -e $HOME/cluster/job_COMSOL_$$.err
#PBS -r n
#PBS -m a -M email@domain.com<br>
#
echo "------------------------------------------------------------------------------”
echo "--- Starting job at: ´date´"
echo
#
cd ${PBS_O_WORKDIR}
echo "--- Current working directory is: ´pwd´"
#
np=$(wc -l < $PBS_NODEFILE)
echo "--- Running on ${np} processes (cores) on the following nodes:"
cat $PBS_NODEFILE
#
cat $PBS_NODEFILE | uniq > comsol_hostfile
echo "--- parallel COMSOL RUN"
comsol batch -nn $nn -nnhost 1 -np $np -f comsol_hostfile -inputfile $inputfile -outputfile $outputfile -batchlog batch_COMSOL__$$.log
echo
echo "--- Job finished at: ´date´"
echo "------------------------------------------------------------------------------”
#
__EOF__

Links and Downloads

Micromixer - Cluster Version

Joule Heating of a Microactuator - Distributed Parameter Version

See Also

For additional information on how COMSOL uses shared-memory parallelism on multicore computers, see COMSOL and Multithreading.


Disclaimer

COMSOL makes every reasonable effort to verify the information you view on this page. Resources and documents are provided for your information only, and COMSOL makes no explicit or implied claims to their validity. COMSOL does not assume any legal liability for the accuracy of the data disclosed. Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark details.