Analytical Services & Materials, Inc.

CFD on Inexpensive Clustered Computers

Steven J. Massey - This email address is being protected from spambots. You need JavaScript enabled to view it. and
Khaled S. Abdol-Hamid

Analytical Services & Materials Inc.
Hampton, VA 23666
(757) 865-7093

November 2, 1998
Revised June 12, 2000

Introduction

In the past decade, advances in vector super computer technology drove the rapid growth and commercial acceptance of computational fluid dynamics. However, due to fundamental physical constraints, the speed of these vector machines has plateaued in last few years making them prohibitively expensive to own and operate. While individual workstations have taken up much of the work that was previously done by vector machines, mid to large size problems still require machines much larger than the average workstation can handle. The solution to this problem is now at hand due to the introduction of several inexpensive new technologies: Fast Ethernet, powerful personal computers, free Unix and free parallel libraries (Message Passing Interface.) Using these new technologies it is now possible to create super computer class, coarse grained parallel computers entirely from mass market off-the-shelf components. The latest release of the CFD code PAB3D has embraced this new cluster technology and has driven as many as 64 processors with very high efficiencies of over 80%. For smaller problems with less communications, efficiencies are routinely above 95%. In this article, a brief outline of the software and hardware requirements for harnessing the power of these new machines for CFD will be discussed. See Sterling et al. [1] for a broader assessment of clustering for scientific applications in general.

Sample Clusters

Since Thomas Sterling's pioneering work in the creation of the PC cluster "Beowulf" at NASA Goddard Space Flight Center in 1993 (see Sterling et al. [2]), the installed base of low cost clusters, also known as Beowulf class machines, continues to grow rapidly.

NASA Goddard Space Flight Center, is currently host to two large clusters and several smaller clusters. "the HIVE" is a 64 node cluster with 2 Pentium Pro processors (200 MHz, 256K cache) per node and Fast Ethernet, see Figure 1. "the HIVE" is capable of a sustained rate of 7.3 GFLOPS running the PROMETHEUS computer code, which solves the Euler equations for compressible gas dynamics on a structured grid using the Piecewise-Parabolic Method. For a 1997 system price of $210,000, the cost per MFLOPS is $29. The second large cluster at NASA Goddard is a file server consisting of 100 166 MHz Pentium Pro processors (2 per node) and a total disk space of 515 GB, see Figure 2.

Very recently Michael S. Warren of Los Alamos National Laboratory, assembled a 140 node DEC Alpha (533 MHz, 2 MB cache) cluster constructed entirely from commodity personal computer technology and freely available software, for a cost of $313,000 including on site labor, see Figure 3.  The machine performs at a peak of 47.7 GFLOPS (Linpack) and sustains 17.6 GFLOPS on a gravitational simulation code. As a reference, CFD codes typically run from 325 to 480 MFLOPS per processor on a Cray C-90.

Another notable DEC Alpha cluster is the 160 node cluster at Digital Domain which was constructed in 1996 to render scenes for the film Titanic, see Figure 4. Unlike the other clusters discussed so far, the Digital Domain cluster mixed Windows NT and Linux operating systems within the cluster to meet application software requirements. However, for scientific applications for which source code is available, Linux offers the best performance and lowest cost.

Although the "Beowulf" concept of commodity super computers was founded on Intel PC's, the availability of the powerful 64 bit DEC Alpha processor on standard PC motherboards has made clusters based on the Los Alamos and Digital Domain architecture a viable alternative to the Intel architecture. The exact chip and architecture that results in the best computing value changes daily so the choice of which CPU to buy must be reevaluated at the time of purchase. In October of 1998, the DEC Alpha 533 was the best value, and so for this reason, AS&M has chose it to build a prototype cluster.

AS&M Prototype Cluster

The baseline configuration for the AS&M prototype 3 node cluster consists of the following components. The total cost of the system as of October 2, 1998 including 40 hours of setup labor was approximately $15,000. Another cost savings is the lack of a need for yearly maintenance contracts which typically run 10% of the original purchase price. Maintenance contracts are a prudent choice for single machines costing $40,000 and multiple CPU boxes costing well over $100,000. However, for clustered computers each node is entirely independent, therefore the exposure to loss is only the cost of a single node. Also, each node in our cluster came standard with a 2 year warranty for repair or replacement within 48 hours. This level of support for a typical workstation can only be obtained at a premium.
Update June 12, 2000:
An Intel based node of similar speed to the Alpha 21164-500 containing a Pentium III 650 overclocked to 805 MHz, 256 MB RAM and booting diskless via fast ethernet can be assembled for approximately $1500 per compute node.

Hardware

 

Head  Node:

 

  • DEC Alpha 533 MHz, 2 MB Cache, 512 MB RAM, 8 GB EIDE Disk
  • PCI Video Card w/2MB DRAM
  • (2) Kingston PCI Ethernet 10/100 Mbps
  • CD-ROM Drive
  • Floppy Drive
  • 6 port Keyboard/Video/Monitor Switch
  • 8 port 10/100 Mbps Hub
  • Monitor
  • Keyboard, mouse
  • APC Back-UPS 650

 

Subordinate Nodes:

 

  • DEC Alpha 533 MHz, 2 MB Cache, 256 MB RAM, 8 GB EIDE Disk
  • PCI Video Card w/2MB DRAM
  • Kingston PCI Ethernet 10/100 Mb
  • Floppy Drive

 

Software

 

  • Operating System: Red Hat Alpha Linux 5.1
  • Compilers: GNU C, C++, GNU FORTRAN 77, (f90 to gcc translator is also available from NAG). More compilers are available for Intel Linux.
  • Parallel Libraries: Argonne MPICH, LAM MPI, PVM.
  • Applications: PAB3D, USM3D, CFL3D, OVERFLOW and any other MPI enabled software.

 

Machine Performance

Based on the performance of the Los Alamos DEC cluster running molecular dynamics codes, we expected a total 3 node performance in the range of 150 to 377 MFLOPS for CFD codes. Test runs produced excellent results of 225-400 MFLOPS depending on the code, problem size and algorithm. Performance data for various machines and CFD codes are shown in Tables 1-5. Since the CFD problem types and algorithms used are not consistent for all tables and no consideration is given to solution convergence, only machine comparisons are valid. However, the performance ratio among machines in a single table does offer confidence in the test results if it is consistent across codes and test cases. Also note that results for PAB3D, OVERFLOW and CFL3D are for single precision (real*4) calculations while USM3D results are for double precision.

Of all the machines listed in the tables, currently only two should be considered as choices for constructing a new cluster, the dual Pentium II 450 (not the Xeon) and the DEC Alpha 533. The low price of the dual Pentium relative to the single version puts the cost per MFLOPS at less than 4% more than the DEC Alpha. Thus, the choice is a matter of user preference for the platform and available software. Currently, the availability and quality of compilers for the Alpha/Linux platform is poor, therefore codes must be compiled in the DEC/Unix environment and brought to Linux. This is possible since DEC/Unix is binary compatible with DEC/Linux if the code is compiled with the non-shared library option. For Intel/Linux the compilers are well established and produce code that runs as fast or faster than the Intel/NT versions.

In summary, from the following tables it is clear that the DEC Alpha 533 PC and dual Pentium II 450 platform offer excellent single and multiprocessor performance that is highly scalable for the average CFD problem in which there are many more interior cells than boundary cells. Because of the high interior to boundary cell ratio, multiprocessor performance is extremely efficient More test are currently underway to correlate code performance with the ratio of interior cells to boundary cells, so a point of diminishing returns can be established for adding nodes to a problem of a given size. 

Single CPU

 

Table 1:  Machine Performance for a PAB3D 5 Block 2D Case, Two-Eqn. Turbulence Model, Block Diagonal, Two Direction Viscosity:  24,320 cells
Machine MFLOPS time/cell/it 
(micro sec)
time/cell/it 
(x C-90)
approx. cost 
thousands $
Cray C-90 Single Processor (f90) 325 15.2 1 1,000
DEC Alpha 21164 533 MHz (Linux) 77 64 4.2 2.5
HP 9000/780 180 MHz 72 68 4.5 20
Sun Ultra-2 200 MHz 49 100 6.6 20
DEC Alpha 21164 333 MHz (Digital UNIX) 48 104 6.8 20
Pentium II 450 MHz (Linux) 45 109 7.2 4
SGI R10000 195 MHz Indigo 2 41 122 8.0 18
Pentium II 300 MHz (Linux/NT 4) 34 148 9.7 2
SGI R10000 175 MHz O2 31 160 10.5 10
Pentium 120 MHz (Win 95) 7 700 46.0 0.5

 

Table 2:  Machine Performance for a PAB3D 9 Block 3D Case, Two-Eqn. Turbulence Model, Block Diagonal, Two Direction Viscosity:  1,213,440 cells
Machine MFLOPS time/cell/it 
(micro sec)
time/cell/it 
(x C-90)
Cray C-90 Single Processor (f90) 370 20.5 1
DEC Alpha 21164 533 MHz (Linux) 103 74 3.6
HP 9000/780 180 MHz 79 96 4.7
SGI Origin 2000 R10k 195 MHz 78 97 4.7
DEC Alpha 21164 333 MHz (Digital UNIX) 60 127 6.2
Sun Ultra-2 200 MHz 59 129 6.3
SGI R10000 195 MHz Indigo 2 57 134 6.5
Pentium II 300 MHz 47 160 7.8
SGI R10000 175 MHz O2 44 172 8.4


 

Table 3:  Machine Performance for an USM3D 3D Case with k-e Turbulence Model:  338,417 cells
Machine time/cell/it 
(micro sec)
time/cell/it 
(x Alpha Linux)
DEC Alpha 21164 533 MHz (Linux) 152 1
SGI R10000 195 MHz Octane 249 1.6
DEC Alpha 21164 533 MHz (Digital UNIX) 289 1.9


 

Table 4:  Machine Performance for an OVERFLOW Single Block 3D Case with One Eqn Turb., Single Direction Viscosity:  199,920 cells
Machine MFLOPS time/cell/it 
(micro sec)
time/cell/it 
(x C-90)
Cray C-90 Single Processor  426 8 1
DEC Alpha 21164 533 MHz (Linux) 133 25 3.2
SGI R10000 195 MHz Octane 123 27 3.4
DEC Alpha 21164 533 MHz (Digital UNIX) 96 35 4.4
SGI R10000 195 MHz Indigo 2 72 47 5.9


 

Table 5a:  Machine Performance for a CFL3D Single Block 3D Case, Scalar Diagonlization, Roe Scheme, Laminar,  Single Direction Viscosity: 147,456 cells
Machine MFLOPS time/cell/it 
(micro sec)
time/cell/it 
(x C-90)
Cray C-90 Single Processor (f90) 480 7.8 1
DEC Alpha 21164 533 MHz (Linux) 104 36 4.6
SGI R10000 195 MHz Octane 81 46 5.9
DEC Alpha 21164 533 MHz (Digital UNIX) 80 47 6.0


 

Table 5b:  Machine Performance for a PAB3D  Single Block 3D Case, Scalar Diagonlization, Roe Scheme, Laminar, Single Direction Viscosity:  147,456 cells
Machine MFLOPS time/cell/it 
(micro sec)
time/cell/it 
(x C-90)
Cray C-90 Single Processor (f90) 408 7.4 1
DEC Alpha 21164 533 MHz (Linux) 89 34 4.6
DEC Alpha 21164 533 MHz (Digital UNIX) 69 44 5.9
SGI R10000 195 MHz Octane 64 47 6.4

 

AS&M Cluster Machine Performance

 

Table 6a:  MPI PAB3D Performance for a Balanced 3 Block 3D Case, Scalar Diagonlization, Roe Scheme, Two-Eqn. Turbulence Model, Single Direction Viscosity:  1,018,368 cells
Machine CPU's MFLOPS total time/cell/it 
(micro sec)
percent speed up total time/cell/it 
(x Single C-90)
Cray C-90 (f90) 1 360 12.0 NA 1
DEC Alpha 21164 533 MHz (Linux) 3 225 19.5 298 1.6
SGI Origin 2000 R10k 195 MHz 3 180 23.5 325 2.0
SGI R10k 195 MHz Octane 3 133 32.6 275 2.7
DEC Alpha 21164 533 MHz (Linux) 1 75 58.1 NA 4.8
SGI Origin 2000 R10k 195 MHz 1 56 76.4 NA 6.4
SGI R10k 195 MHz Octane 1 48 89.8 NA 7.5
Sun Ultra-2 200 MHz 1 43 99.2 NA 8.3


 

Table 6b:  MPI PAB3D Performance for a Balanced 3 Block 3D Case, Scalar Diagonlization, Roe Scheme, Two-Eqn. Turbulence Model, Single Direction Viscosity:  127,296 cells (1/8th of Case in Table 6a)
Machine CPU's MFLOPS total time/cell/it 
(micro sec)
percent speed up total time/cell/it 
(x Single C-90)
Cray C-90 (f90) 1 320 12 NA 1
DEC Alpha 21164 533 MHz (Linux) 3 229 17 282 1.4
SGI Origin 2000 R10k 195 MHz 3 213 17.5 291 1.5
SGI R10k 195 MHz Octane 3 167 23 278 1.9
DEC Alpha 21164 533 MHz (Linux) 1 80 48 NA 4.0
SGI Origin 2000 R10k 195 MHz 1 74 51 NA 4.3
SGI R10k 195 MHz Octane 1 60 64 NA 5.3
Sun Ultra-2 200 MHz 1 41 93 NA 7.8


 

Table 6c:  MPI PAB3D Performance for a Balanced 3 Block 3D Case, Scalar Diagonlization, Roe Scheme, Laminar, Single Direction Viscosity:  127,296 cells (Table 6b Case without Turbulence)
Machine CPU's MFLOPS total time/cell/it 
(micro sec)
percent speed up total time/cell/it 
(x Single C-90)
Cray C-90 (f90) 1 350 8 NA 1
DEC Alpha 21164 533 MHz (Linux) 3 241 11.6 293 1.45
SGI Origin 2000 R10k 195 MHz 3 227 12.3 292 1.54
SGI R10k 195 MHz Octane 3 175 16 287 2.0
Pentium II 450 MHz (Linux) 3 141 19.8 252 2.5
DEC Alpha 21164 533 MHz (Linux) 1 81 34 NA 4.3
SGI Origin 2000 R10k 195 MHz 1 78 36 NA 4.5
SGI R10k 195 MHz Octane 1 60 46 NA 5.8
Pentium II 450 MHz (Linux) 1 56 50 NA 6.3
Sun Ultra-2 200 MHz 1 48 58 NA 7.3

 

PAB3D Performance on the Coral Cluster

In December of 1998, ICASE installed the 32 node "Coral" cluster based on Intel Pentium II 400 Mhz CPU's. Recently, PAB3D was tested on 2D and 3D balanced grid cases. The 3D case consisted of 48 block ONERA M6 wing grid with a total of 884,736 cells. The 2D case consisted of a 32 block RAE2822 airfoil grid with a total of 98,304 cells.

ONERA M6 Wing Case

The ONERA M6 grid is made up of 48 blocks, each having a dimension of 36 x 32 x 16 cells for a total of 884,736 cells. Using the grid sequencing feature of PAB3D the case was also run at a medium resolution of 1/2 the cells in each direction and coarsely at 1/4 the cells in each direction, for a total cell reduction of 1/8 and 1/64 respectively. To determine the effects of node scaling and block size on performance; fine, medium and coarse grids were run on the block balanced configurations of 1, 2, 3, 4, 6, 8, 12, 16 and 24 nodes. The flow was solved using a two-factor scheme, Roe flux splitting, viscosity coupled in 2 directions and 2-Eq. turbulence.

Results for each grid resolution are tabulated and plotted against the number of nodes, see Table 7 and Plots 1 and 2. It is clear that both the number of processors and the coarseness of the grid (in terms of the ratio of the number of internal cells to boundary cells) affect the efficiency of the parallelization. In the plots of speed-up and efficiency verses number of nodes, the increasing curvature, relative to grid coarseness, indicates that as the grid is coarsened, the penalty for increasing the number of nodes increases more nonlinearly. The nonlinearity is a reflection of the constant time delay associated with any communication call. The efficiency plots for common grids, see Plot 2, show the penalty of increased network communication as blocks are spread across machines.

Using the nitb parameter in PAB3D, boundary condition communication may be reduced to every nitb iteration. The effect of reducing the communication by a factor of 2, 3 and 4 was tested on the worst case of 24 nodes on the very coarse grid. The results, shown in Table 8, indicate that by reducing the communication by a factor of 4, the efficiency improves another 25% (41% relative increase) to nearly the that of the fine grid case. To decide whether or not to reduce the boundary communication, one would have to weigh its convergence penalty. This penalty will depend on the physics of the problem as well as the grid structure.

In summary, the PAB3D results from the Coral cluster demonstrate that as the number of nodes increase, the communications overhead increases nearly linearly for blocks having a volume to surface cell ratio greater or equal to 4. Furthermore, as the blocks are coarsened to a v/s ratio of 1.0 the communications overhead increases nonlinearly with the number of CPU's because of the basic overhead of the communication calls. Very coarse grids, which are typically used early in a computation to speed convergence, may benefit significantly from incremental boundary condition communication.

Table 7a:  PAB3D  Performance on ICASE Coral Cluster: 48 Block 3D Case, Fine Grid 884,736 cells. Logical Volume to Surface Ratio = 18,432/4,480 = 4.1
Number of Intel 
Pentium II 400 MHz 
Nodes
time/cell/it 
(micro sec)
Speed-Up
Factor
Percent
Efficiency
1 104.4 1.00 100.0
2 53.3 1.96 97.9
3 35.6 2.93 97.8
4 27.6 3.78 95.1
6 18.7 5.60 93.6
8 14.1 7.40 92.3
12 9.7 10.8 90.1
16 7.4 14.1 88.4
24 5.1 20.6 86.1


 

Table 7b:  PAB3D  Performance on ICASE Coral Cluster: 48 Block 3D Case, Medium Grid 110,592 cells. Logical Volume to Surface Ratio = 4.1 / 2 = 2.1
Number of Intel 
Pentium II 400 MHz 
Nodes
time/cell/it 
(micro sec)
Speed-Up
Factor
Percent
Efficiency
1 98.7 1.00 100.0
2 51.3 1.92 95.9
3 34.4 2.87 95.7
4 26.6 3.72 93.1
6 18.4 5.36 89.3
8 14.0 7.07 88.4
12 9.7 10.1 84.4
16 7.6 12.9 80.4
24 5.4 18.3 76.3


 

Table 7c:  PAB3D  Performance on ICASE Coral Cluster: 48 Block 3D Case, Coarse Grid 13,824 cells. Logical Volume to Surface Ratio = 4.1 / 4 = 1.0
Number of Intel 
Pentium II 400 MHz 
Nodes
time/cell/it 
(micro sec)
Speed-Up
Factor
Percent
Efficiency
1 107.0 1.00 100.0
2 57.5 1.86 93.0
3 38.7 2.76 92.2
4 30.7 3.49 88.0
6 21.6 4.96 83.8
8 16.7 6.41 80.4
12 12.4 8.64 73.3
16 10.0 10.7 68.0
24 7.5 14.2 59.7

 

Plot 1:  PAB3D Speed-Up vs. Nodes.

 

Plot 2:  PAB3D Parallel Efficiency vs. Nodes.

 

 

Table 8: Effect of Reduced Boundary Condition Communication on Parallel Efficiency on Coarse Grid.
Boundary Condition 
Communication 
Increment (nitb)
Parallel Efficiency Speed-Up Factor 
for 24 CPU's
1 59.7 % 14.3
2 72.4 % 17.4
3 80.8 % 19.4
4 84.4 % 20.3

 

RAE2822 Airfoil Case

The RAE 2822 grid is made up of 32 blocks, each having a dimension of 32 x 96 cells for a total of 98,304 cells. To study the effect of reducing interior cells while keeping the boundary size constant, the case was run at a medium and coarse resolution of 1/2 and 1/4 the cells in the wrap around (J) direction. To determine the effects of node scaling and block size on performance; fine, medium and coarse grids were run on the block balanced configurations of 1, 2, 4, 8, 16 and 32 nodes. The flow was solved using a two-factor scheme, Roe flux splitting, thin layer viscosity and 2-Eq. turbulence.

Results for each grid resolution are plotted against the number of nodes, Plots 3 and 4. As in the 3D example, the number of processors and the coarseness of the grid affect the efficiency of the parallelization. Unlike the wing case, the airfoil case has a much higher ratio of interior to boundary cells, 32, 16 and 8 as opposed to 4, 2 and 1. However, because of the small number of interior cells involved in each iteration, the relitive communication overhead is much larger, which is again due to the delays involved in initiatating each communication call.

In summary, the PAB3D airfoil results demonstrate that even for a very small 2D blocks, high parallel efficiencies can still be obtained, provided that the ratio between interior to boundary cells is on the order of 10.

Plot 3:   PAB3D Speed-Up vs. Nodes.

 

 

Plot 4:   PAB3D Parallel Efficiency vs. Nodes.

 

Conclusions

The current state of mass market hardware and software has allowed the emergence of a new class of super computers, the PC cluster. What's more is that these computers can be assembled with relative ease by end users or system administrators. It is clear that both the DEC Alpha 533 PC and dual Pentium II 450 platform offer excellent single and multiprocessor performance that is highly scalable for CFD, making either architecture an excellent choice for clusters. In either case, the peripheral hardware and software setup will be essentially the same. The cost per MFLOPS of the Los Alamos DEC cluster, "Avalon", which ranks 88th on the list of the most powerful super computers (Linpack), has been conservatively estimated to be a factor of ten lower than an SGI Origin 2000. Initial test using the MPI version PAB3D show that 3 DEC Linux nodes actually run 8% faster than 3 Origin 2000 nodes. This internal CPU performance hit was also seen in the dual Pentium machines. The dual Pentium ran at an efficiency of 84% while two single Pentium machines ran at 98%. Apparently, the wait time involved in sharing memory is greater than the time required for network communication. More testing is under way to determine how this time ratio will change with problem size. In experiments with grid coarseness and the number of processors used on the Coral cluster, it was observed that the inefficiency inherent in scaling up the number of processors was exaggerated for coarse grids. However, the performance for the fine grid case, which still had a fairly low ratio of inner cells to boundary cells of 4.1, was still excellent at 86% for 24 nodes.

Another key benefit of the cluster concept as is scalability and reusability. The user does not need to commit to a room full of computers, but can have them if they are needed. A 16 node cluster can be stacked on a single desk and powered on a standard 20 amp office circuit, see Figure 5. Rack mounting is also an option, up to 12 nodes can be mounted in a single 42 inch high cabinet, see Figure 6. As the computers age, the nodes can easily be swapped with newer units and the old nodes can serve as an upgrade or supplement to the engineering workstation pool. Therefore, PC cluster technology is truly the "faster, better, cheaper" way to practice CFD in the current state-of-the-art.

References

 

  1. Sterling, T., Cwik, T., Becker, D., Salmon, J., Warren, M., and Nitzberg, B., "An assessment of Beowulf-class computing for NASA requirements:  Initial findings from the first NASA workshop on Beowulf-class clustered  computing." Proceedings, IEEE Aerospace Conference, March 21-28 1998. (PostScript)
  2. Sterling, T., Becker, D.J., Savarese, D., Dorband, J.E., Ranawak, U.A., and Packer, C.V., "Beowulf: A Parallel Workstation for Scientific Computation," Proceedings, International Conference on Parallel Processing, 1995. (PostScript)
  3. Highly-parallel Integrated Virtual Environment (HIVE) Home page, http://newton.gsfc.nasa.gov/thehive/ .
  4. Avalon: T-CNLS Dec Alpha Cluster Home page, http://cnls.lanl.gov/avalon/.
  5. Strauss, D., "Linux Helps Bring Titanic to Life," Linux Journal, number 46, Feb 1998, p. 2494. See also http://www.ssc.com/lj/issue46/2494.html .

Figure 1:  Back and front views of a portion of NASA Goddard's Intel PC rack mounted cluster, HIVE.

 

Figure 2:  NASA Goddard's PC 512 GB bulk data server.

 

Figure 3:  Back and front views of a Los Alamos's Avalon DEC Alpha cluster. 

 

Figure 4:  Digital Domain's 160 node DEC Alpha cluster, "Render Ranch", see Strauss[5].

 

Figure 5:  Clemson University's 16 node PC (200 MHz) cluster--the ultimate desktop computer.

 

Figure 6:  10 node rack mount system with integrated keyboard, video and mouse switch. (42 inches high by 19 inches wide).