Efficient parallel I/O with ZioLib in CCSM

Tseng, Y. H. and Ding, C. (2008), 'Efficient parallel I/O in Community Atmosphere Model (CAM)', International Journal of High Performance Computing Applications 2008, 22, 206, DOI: 10.1177/1094342008090914

Introduction

High resolution Century- or millennium-long global climate simulations using the NCAR Community Climate System Model (CCSM) generate tremendous amount of data. Efficient I/O is a crucial factor for such large-scale simulations on massively parallel machines, but CCSM currently uses sequential I/O through a single processor. This will soon become a major bottleneck for even higher resolution simulations. Here we implement the Parallel NetCDF (PnetCDF) with ZioLib algorithm in the Community Atmosphere Model (CAM), a component of CCSM, to facilitate efficient and flexible parallel I/O.

In distributed memory parallel environment, most applications rely on a serial I/O strategy, where the global array is gathered on a single processor and then written out to a file. The I/O performance with this approach is largely limited by single PE I/O bandwidth. Even when parallel I/O is used, satisfactory parallel scaling is not always observed. Parallel I/O rates can depend sensitively on parallel decompositions. The current approach, combining the features of PnetCDF and ZioLib, ensures the flexibility and the maximum I/O rate for all physical decompositions. Our tests show that this approach greatly Improves the CAM I/O performance and removes the I/O bottleneck. The maximum speed-up is roughly scaled with the increasing domain size.

In climate models a field in CPU resident memory is often in one index order but stored in a disk file in another order. For example, history data for NCAR's Community Atmosphere Model (CAM) dynamic variables are in (longitude, height, latitude) order but must be written out to a file in (longitude, latitude, height) order. Changing index orders complicates a parallel I/O implementation and slows down I/O.

Parallel NetCDF library with ZioLib algorithm

NetCDF is a simple and widely used file format. The PnetCDF interface facilitates an efficient parallel I/O to access a single netCDF file. However, the I/O rate may not be optimal depending on the array's index order and the results may not be consistent with the serial I/O. On the other hand, ZioLib can flexibly remap the distributed array and output data in parallel. Thus, our strategy is to remap a distributed field into a Z-decomposition on a subset of processors ("I/O staging processors") using the ZioLib algorithm, and then write to a disk file using PnetCDF (see below figure). In this Z-decomposition, the data layout of the remapped array on the staging processors' memory is the same as on disk, thus only block data transfer occurs during parallel I/O, achieving maximum efficiency.

Writing the global field of distributed array a(X,Z,Y) to a disk file in (X,Y,Z) index order using three I/O staging processes.

Advantages of PnetCDF with ZioLib algorithm

Relieve memory limitations on a processor
Relieve congestion on I/O nodes
Write/read in large blocks (no seeks) in parallel
Achieve robust I/O performance regardless of parallel decomposition types
Eliminate gather/scatter/transpose from user codes

Source codes for different platform testing

This parallel algorithm has been tested in IBM SP, IBM Bluegene, Cray XT and clusters. Click here to download the source codes for testing the implementation (a gzipped tar file). Please read README.PIO for detailed installation instruction.

Input file can no longer be downloaded here.

Performance on CAM3.0/3.1 history I/O

In the current CAM, a field in CPU resident memory is in one index order but is stored in a disk file in another order. To optimize the I/O performance, we have implemented the PnetCDF with ZioLib algorithm on CAM3.0 and CAM3.1 history I/O and compared with the serial netCDF I/O (i.e., one staging processor) using 2 to 512 MPI tasks on the IBM SP at LBNL/NERSC. All dynamic cores (Eulerian, Semi-Lagrangian and Finite Volume) are tested. The parallel implementation in CAM is illustrated below. P0-P4 are the I/O stage processors. In the current version, we use all nodes to obtain the maximum performance. See our poster presentation for further information and other benchmark tests.

Finite volume D-resolution perforance

The speed-up of I/O is shown in the following figure. In the finite volume D-resolution (576x361x26) test, the parallel I/O speeds up writes by a factor of over 13 with respect to the single-PE I/O. In original CAM, processor 0 gathers distributed data, transposes the global array, and writes to a file. The speed-up is roughly scaled with the global domain size.

The number of processors (tasks) v.s. Bandwidth.

The curves represent the number of processors in a MPI_task (M4 refers to 4 processor per MPI_task).

Comments, Suggestions, Bug Reports, ...

We welcome your comments, suggestions, bug reports, etc. Please send to Yu-heng Tseng (yhtseng@as.ntu.edu.tw), Helen He (yhe@lbl.gov) or Chris Ding (chqding@uta.edu).

Acknowledgements

This work is supported by the DOE SciDAC climate project and the NERSC Program.