Analytical Services & Materials
Inc.
Hampton, VA 23666
(757) 865-7093
November 2, 1998
Revised June 12, 2000
NASA Goddard Space Flight Center, is currently host to two large clusters and several smaller clusters. "the HIVE" is a 64 node cluster with 2 Pentium Pro processors (200 MHz, 256K cache) per node and Fast Ethernet, see Figure 1. "the HIVE" is capable of a sustained rate of 7.3 GFLOPS running the PROMETHEUS computer code, which solves the Euler equations for compressible gas dynamics on a structured grid using the Piecewise-Parabolic Method. For a 1997 system price of $210,000, the cost per MFLOPS is $29. The second large cluster at NASA Goddard is a file server consisting of 100 166 MHz Pentium Pro processors (2 per node) and a total disk space of 515 GB, see Figure 2.
Very recently Michael S. Warren of Los Alamos National Laboratory, assembled a 140 node DEC Alpha (533 MHz, 2 MB cache) cluster constructed entirely from commodity personal computer technology and freely available software, for a cost of $313,000 including on site labor, see Figure 3. The machine performs at a peak of 47.7 GFLOPS (Linpack) and sustains 17.6 GFLOPS on a gravitational simulation code. As a reference, CFD codes typically run from 325 to 480 MFLOPS per processor on a Cray C-90.
Another notable DEC Alpha cluster is the 160 node cluster at Digital Domain which was constructed in 1996 to render scenes for the film Titanic, see Figure 4. Unlike the other clusters discussed so far, the Digital Domain cluster mixed Windows NT and Linux operating systems within the cluster to meet application software requirements. However, for scientific applications for which source code is available, Linux offers the best performance and lowest cost.
Although the "Beowulf" concept of commodity super computers was founded on Intel PC's, the availability of the powerful 64 bit DEC Alpha processor on standard PC motherboards has made clusters based on the Los Alamos and Digital Domain architecture a viable alternative to the Intel architecture. The exact chip and architecture that results in the best computing value changes daily so the choice of which CPU to buy must be reevaluated at the time of purchase. In October of 1998, the DEC Alpha 533 was the best value, and so for this reason, AS&M has chose it to build a prototype cluster.
Of all the machines listed in the tables, currently only two should be considered as choices for constructing a new cluster, the dual Pentium II 450 (not the Xeon) and the DEC Alpha 533. The low price of the dual Pentium relative to the single version puts the cost per MFLOPS at less than 4% more than the DEC Alpha. Thus, the choice is a matter of user preference for the platform and available software. Currently, the availability and quality of compilers for the Alpha/Linux platform is poor, therefore codes must be compiled in the DEC/Unix environment and brought to Linux. This is possible since DEC/Unix is binary compatible with DEC/Linux if the code is compiled with the non-shared library option. For Intel/Linux the compilers are well established and produce code that runs as fast or faster than the Intel/NT versions.
In summary, from the following tables it is clear that the DEC Alpha 533 PC and dual Pentium II 450 platform offer excellent single and multiprocessor performance that is highly scalable for the average CFD problem in which there are many more interior cells than boundary cells. Because of the high interior to boundary cell ratio, multiprocessor performance is extremely efficient More test are currently underway to correlate code performance with the ratio of interior cells to boundary cells, so a point of diminishing returns can be established for adding nodes to a problem of a given size.
| Machine | MFLOPS | time/cell/it
(micro sec) |
time/cell/it
(x C-90) |
approx. cost
thousands $ |
| Cray C-90 Single Processor (f90) | 325 | 15.2 | 1 | 1,000 |
| DEC Alpha 21164 533 MHz (Linux) | 77 | 64 | 4.2 | 2.5 |
| HP 9000/780 180 MHz | 72 | 68 | 4.5 | 20 |
| Sun Ultra-2 200 MHz | 49 | 100 | 6.6 | 20 |
| DEC Alpha 21164 333 MHz (Digital UNIX) | 48 | 104 | 6.8 | 20 |
| Pentium II 450 MHz (Linux) | 45 | 109 | 7.2 | 4 |
| SGI R10000 195 MHz Indigo 2 | 41 | 122 | 8.0 | 18 |
| Pentium II 300 MHz (Linux/NT 4) | 34 | 148 | 9.7 | 2 |
| SGI R10000 175 MHz O2 | 31 | 160 | 10.5 | 10 |
| Pentium 120 MHz (Win 95) | 7 | 700 | 46.0 | 0.5 |
| Machine | MFLOPS | time/cell/it
(micro sec) |
time/cell/it
(x C-90) |
| Cray C-90 Single Processor (f90) | 370 | 20.5 | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 103 | 74 | 3.6 |
| HP 9000/780 180 MHz | 79 | 96 | 4.7 |
| SGI Origin 2000 R10k 195 MHz | 78 | 97 | 4.7 |
| DEC Alpha 21164 333 MHz (Digital UNIX) | 60 | 127 | 6.2 |
| Sun Ultra-2 200 MHz | 59 | 129 | 6.3 |
| SGI R10000 195 MHz Indigo 2 | 57 | 134 | 6.5 |
| Pentium II 300 MHz | 47 | 160 | 7.8 |
| SGI R10000 175 MHz O2 | 44 | 172 | 8.4 |
| Machine | time/cell/it
(micro sec) |
time/cell/it
(x Alpha Linux) |
| DEC Alpha 21164 533 MHz (Linux) | 152 | 1 |
| SGI R10000 195 MHz Octane | 249 | 1.6 |
| DEC Alpha 21164 533 MHz (Digital UNIX) | 289 | 1.9 |
| Machine | MFLOPS | time/cell/it
(micro sec) |
time/cell/it
(x C-90) |
| Cray C-90 Single Processor | 426 | 8 | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 133 | 25 | 3.2 |
| SGI R10000 195 MHz Octane | 123 | 27 | 3.4 |
| DEC Alpha 21164 533 MHz (Digital UNIX) | 96 | 35 | 4.4 |
| SGI R10000 195 MHz Indigo 2 | 72 | 47 | 5.9 |
| Machine | MFLOPS | time/cell/it
(micro sec) |
time/cell/it
(x C-90) |
| Cray C-90 Single Processor (f90) | 480 | 7.8 | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 104 | 36 | 4.6 |
| SGI R10000 195 MHz Octane | 81 | 46 | 5.9 |
| DEC Alpha 21164 533 MHz (Digital UNIX) | 80 | 47 | 6.0 |
| Machine | MFLOPS | time/cell/it
(micro sec) |
time/cell/it
(x C-90) |
| Cray C-90 Single Processor (f90) | 408 | 7.4 | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 89 | 34 | 4.6 |
| DEC Alpha 21164 533 MHz (Digital UNIX) | 69 | 44 | 5.9 |
| SGI R10000 195 MHz Octane | 64 | 47 | 6.4 |
| Machine | CPU's | MFLOPS | total time/cell/it
(micro sec) |
percent speed up | total time/cell/it
(x Single C-90) |
| Cray C-90 (f90) | 1 | 360 | 12.0 | NA | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 3 | 225 | 19.5 | 298 | 1.6 |
| SGI Origin 2000 R10k 195 MHz | 3 | 180 | 23.5 | 325 | 2.0 |
| SGI R10k 195 MHz Octane | 3 | 133 | 32.6 | 275 | 2.7 |
| DEC Alpha 21164 533 MHz (Linux) | 1 | 75 | 58.1 | NA | 4.8 |
| SGI Origin 2000 R10k 195 MHz | 1 | 56 | 76.4 | NA | 6.4 |
| SGI R10k 195 MHz Octane | 1 | 48 | 89.8 | NA | 7.5 |
| Sun Ultra-2 200 MHz | 1 | 43 | 99.2 | NA | 8.3 |
| Machine | CPU's | MFLOPS | total time/cell/it
(micro sec) |
percent speed up | total time/cell/it
(x Single C-90) |
| Cray C-90 (f90) | 1 | 320 | 12 | NA | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 3 | 229 | 17 | 282 | 1.4 |
| SGI Origin 2000 R10k 195 MHz | 3 | 213 | 17.5 | 291 | 1.5 |
| SGI R10k 195 MHz Octane | 3 | 167 | 23 | 278 | 1.9 |
| DEC Alpha 21164 533 MHz (Linux) | 1 | 80 | 48 | NA | 4.0 |
| SGI Origin 2000 R10k 195 MHz | 1 | 74 | 51 | NA | 4.3 |
| SGI R10k 195 MHz Octane | 1 | 60 | 64 | NA | 5.3 |
| Sun Ultra-2 200 MHz | 1 | 41 | 93 | NA | 7.8 |
| Machine | CPU's | MFLOPS | total time/cell/it
(micro sec) |
percent speed up | total time/cell/it
(x Single C-90) |
| Cray C-90 (f90) | 1 | 350 | 8 | NA | 1 |
| DEC Alpha 21164 533 MHz (Linux) | 3 | 241 | 11.6 | 293 | 1.45 |
| SGI Origin 2000 R10k 195 MHz | 3 | 227 | 12.3 | 292 | 1.54 |
| SGI R10k 195 MHz Octane | 3 | 175 | 16 | 287 | 2.0 |
| Pentium II 450 MHz (Linux) | 3 | 141 | 19.8 | 252 | 2.5 |
| DEC Alpha 21164 533 MHz (Linux) | 1 | 81 | 34 | NA | 4.3 |
| SGI Origin 2000 R10k 195 MHz | 1 | 78 | 36 | NA | 4.5 |
| SGI R10k 195 MHz Octane | 1 | 60 | 46 | NA | 5.8 |
| Pentium II 450 MHz (Linux) | 1 | 56 | 50 | NA | 6.3 |
| Sun Ultra-2 200 MHz | 1 | 48 | 58 | NA | 7.3 |
Results for each grid resolution are tabulated and plotted against the number of nodes, see Table 7 and Plots 1 and 2. It is clear that both the number of processors and the coarseness of the grid (in terms of the ratio of the number of internal cells to boundary cells) affect the efficiency of the parallelization. In the plots of speed-up and efficiency verses number of nodes, the increasing curvature, relative to grid coarseness, indicates that as the grid is coarsened, the penalty for increasing the number of nodes increases more nonlinearly. The nonlinearity is a reflection of the constant time delay associated with any communication call. The efficiency plots for common grids, see Plot 2, show the penalty of increased network communication as blocks are spread across machines.
Using the nitb parameter in PAB3D, boundary condition communication may be reduced to every nitb iteration. The effect of reducing the communication by a factor of 2, 3 and 4 was tested on the worst case of 24 nodes on the very coarse grid. The results, shown in Table 8, indicate that by reducing the communication by a factor of 4, the efficiency improves another 25% (41% relative increase) to nearly the that of the fine grid case. To decide whether or not to reduce the boundary communication, one would have to weigh its convergence penalty. This penalty will depend on the physics of the problem as well as the grid structure.
In summary, the PAB3D results from the Coral cluster demonstrate that as the number of nodes increase, the communications overhead increases nearly linearly for blocks having a volume to surface cell ratio greater or equal to 4. Furthermore, as the blocks are coarsened to a v/s ratio of 1.0 the communications overhead increases nonlinearly with the number of CPU's because of the basic overhead of the communication calls. Very coarse grids, which are typically used early in a computation to speed convergence, may benefit significantly from incremental boundary condition communication.
| Number of Intel
Pentium II 400 MHz Nodes |
time/cell/it
(micro sec) |
Speed-Up
Factor |
Percent
Efficiency |
| 1 | 104.4 | 1.00 | 100.0 |
| 2 | 53.3 | 1.96 | 97.9 |
| 3 | 35.6 | 2.93 | 97.8 |
| 4 | 27.6 | 3.78 | 95.1 |
| 6 | 18.7 | 5.60 | 93.6 |
| 8 | 14.1 | 7.40 | 92.3 |
| 12 | 9.7 | 10.8 | 90.1 |
| 16 | 7.4 | 14.1 | 88.4 |
| 24 | 5.1 | 20.6 | 86.1 |
| Number of Intel
Pentium II 400 MHz Nodes |
time/cell/it
(micro sec) |
Speed-Up
Factor |
Percent
Efficiency |
| 1 | 98.7 | 1.00 | 100.0 |
| 2 | 51.3 | 1.92 | 95.9 |
| 3 | 34.4 | 2.87 | 95.7 |
| 4 | 26.6 | 3.72 | 93.1 |
| 6 | 18.4 | 5.36 | 89.3 |
| 8 | 14.0 | 7.07 | 88.4 |
| 12 | 9.7 | 10.1 | 84.4 |
| 16 | 7.6 | 12.9 | 80.4 |
| 24 | 5.4 | 18.3 | 76.3 |
| Number of Intel
Pentium II 400 MHz Nodes |
time/cell/it
(micro sec) |
Speed-Up
Factor |
Percent
Efficiency |
| 1 | 107.0 | 1.00 | 100.0 |
| 2 | 57.5 | 1.86 | 93.0 |
| 3 | 38.7 | 2.76 | 92.2 |
| 4 | 30.7 | 3.49 | 88.0 |
| 6 | 21.6 | 4.96 | 83.8 |
| 8 | 16.7 | 6.41 | 80.4 |
| 12 | 12.4 | 8.64 | 73.3 |
| 16 | 10.0 | 10.7 | 68.0 |
| 24 | 7.5 | 14.2 | 59.7 |
![]() |
![]() |
| Boundary Condition
Communication Increment (nitb) |
Parallel Efficiency | Speed-Up Factor
for 24 CPU's |
| 1 | 59.7 % | 14.3 |
| 2 | 72.4 % | 17.4 |
| 3 | 80.8 % | 19.4 |
| 4 | 84.4 % | 20.3 |
Results for each grid resolution are plotted against the number of nodes, Plots 3 and 4. As in the 3D example, the number of processors and the coarseness of the grid affect the efficiency of the parallelization. Unlike the wing case, the airfoil case has a much higher ratio of interior to boundary cells, 32, 16 and 8 as opposed to 4, 2 and 1. However, because of the small number of interior cells involved in each iteration, the relitive communication overhead is much larger, which is again due to the delays involved in initiatating each communication call.
In summary, the PAB3D airfoil results demonstrate that even for a very
small 2D blocks, high parallel efficiencies can still be obtained, provided
that the ratio between interior to boundary cells is on the order of 10.
![]() |
![]() |
Another key benefit of the cluster concept as is scalability and reusability. The user does not need to commit to a room full of computers, but can have them if they are needed. A 16 node cluster can be stacked on a single desk and powered on a standard 20 amp office circuit, see Figure 5. Rack mounting is also an option, up to 12 nodes can be mounted in a single 42 inch high cabinet, see Figure 6. As the computers age, the nodes can easily be swapped with newer units and the old nodes can serve as an upgrade or supplement to the engineering workstation pool. Therefore, PC cluster technology is truly the "faster, better, cheaper" way to practice CFD in the current state-of-the-art.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |