Project

General

Profile

ECMWF Parallel I/O - FDB

Added by George Mozdzynski about 11 years ago

I thought it might be useful to describe an existing example of parallel I/O.
There may be better approaches to take in the future but today this is what we
use at ECMWF, and it appears to work well with little overhead.

We have measured the I/O overhead at about 1 percent of the wall clock time,
sustaining 7.5 GBytes/sec for a T1279 forecast model experiment running hourly
post-processing.

Of course, besides the actual I/O time, there is the time accumulated in several
parts of the IFS to either interpolate data, transform it (e.g. spectral, real),
gather into a complete field (MPI), encode to grib, and all this has been measured
at about 7-8 percent of job wall clock time for this same forecast model experiment.
Surprisingly, the actual I/O time is not really a problem!

At ECMWF, parallel I/O in our IFS applications (forecast models, 4D-Var, EPS, etc.)
is achieved by what we call the Fields Data Base (FDB).

A presentation (from 2007) describing the basic features of FDB s/w is attached.
A key feature of the FDB is that each task can write a field (typically in grib format)
to the FDB and this occurs in parallel with any number of tasks. This does not
involve any tasks or nodes other than those that perform computation.

The fdb per task buffers are flushed to GPFS per node disk cache either explicitly at the
end of a parallel execution or at the end of a post processing time-step or when some
default threshold (which can be overridden by environment variable) is reached, ensuring
data is moved to the GPFS disk cache in large multi MB blocks. Once such a flush has
taken place, fields are immediately available, for example, for products generation.

The layout of FDB file structure is determined by the FDB configuration file (see attached
example of such a Config file). This file contains list of attributes that describe the data
(for each logical database). Attributes can be adjoint into groups defining which ones form
directory name(s), file names or meta data (so called database index file).
Currently we have several logical databases fdb, pdb (products database), ocean, seas,
ppdidb (dissemination requirements database), etc.

The configuration file can be be easily changed, thus changing the FDB structure. The Config
file also allows us to chose the database file structure, we can put all data we have into a
single file or at the other extreme have one file per field.
When accessing FDB data, applications do not have to have any knowledge of the FDB structure,
just attribute names.

Enough for now....

The Config is not a binary file!!!!, I will paste here....

FDBCONFIGURATION ;

%servers ; machines ; 5 {
ifsfdb ; fdb ; hpc ; 4055 ; 1801 ( /fdb, /fdb )
hpcfdb ; fdb ; hpc ; 4055 ; 1801 ( /fdb, /fdb )
serversfdb ; fdb ; servers ; 4055 ; 1801 ( /fdb, /fdb )
ecmwffdb ; fdb ; ecmwf ; 4055 ; 1801 ( /fdb, /fdb )
backupfdb ; fdb ; c1b ; 4055 ; 1801 ( /fdb, /fdb )
localfdb ; fdb ; c1a ; 4055 ; 1801 ( /fdb, /fdb )
c1afdb ; fdb ; c1a ; 4055 ; 1801 ( /fdb, /fdb )
c1bfdb ; fdb ; c1b ; 4055 ; 1801 ( /fdb, /fdb )
hpcffdb ; fdb ; hpcf ; 4055 ; 1801 ( /fdb, /fdb )
pdb ; pdb ; hpc ; 4055 ; 0 ( /prod_fdb, /prod_fdb )
lnxcluster ; fdb ; lnxcl ; 4055 ; 1801 ( /vol/fdb, /vol/fdb )
scores2 ; fdb ; scores2 ; 4055 ; 1801 ( /scores2/data/fdb, /scores2/data/fdb)
scores1 ; fdb ; scores1 ; 4055 ; 1801 ( /scores1/data/fdb, /scores1/data/fdb)
} ;
%filesystems ; {
< /ma_fdb , /ma_fdb >
< /fdb , /fdb >
< /fws1/lb/fdb , /fws1/lb/fdb >
< /fws2/lb/fdb , /fws2/lb/fdb >
< /fws3/lb/fdb , /fws3/lb/fdb >
< /prod_fdb , /prod_fdb >
< /era_fdb , /era_fdb >
< /c1a/ma_fdb , /c1a/ma_fdb >
< /c1a/fdb , /c1a/fdb >
< /c1a/prod_fdb , /c1a/prod_fdb >
< /fdb/msx , /fdb/msx >
< /ms_crit/fdb , /ms_crit/fdb >
< /emos_esuite/ma_fdb , /emos_esuite/ma_fdb >
< /s1a/emos_esuite/ma_fdb , /s1a/emos_esuite/ma_fdb >
< /s1b/emos_esuite/ma_fdb , /s1b/emos_esuite/ma_fdb >
< /var/tmp , /var/tmp >
}
%networkhosts ; {
hpc = ( c1a, hpcf, c1b )
servers = ( lnxcl )
ecmwf = ( hpc, servers )
}
%databasehosts ; {
hpcf : ( hpcf0205, hpcf0105 )
c1a : ( c1a-fdb1, c1a-fdb2, c1a-fdb3 )
c1b : ( c1b-fdb1, c1b-fdb2, c1b-fdb3 )
lnxcl : ( bee28, lclfdb )
scores2 : ( scores2 )
scores1 : ( scores1 )
}
%siblinghosts ; {
c1a > ( hpcf )
c1b > ( c1a )
hpcf > ( c1a )
lnxcl > ( c1a )
} {
%database ; fdb ; 1, 32 ; FieldsDatabase ; 88 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type        ; 2 ; 1
time ; 2 ; 2 ; LEN=4
number ; 2 ; 3 ; TYPE=INT
ensemble ; 2 ; 3 ; TYPE=INT
cluster ; 2 ; 3 ; TYPE=INT
probability ; 2 ; 3 ; TYPE=INT
quantile ; 2 ; 3 ; TYPE=INT
fnumber ; 2 ; 3 ; LEN=1
reference ; 2 ; 4 ; TYPE=INT
diagnostic ; 2 ; 4 ; LEN=1
iteration ; 2 ; 5 ; LEN=2
origin ; 2 ; 6
leg ; 2 ; 7 ; TYPE=INT ; STAT=READOVER
newtwo ; 2 ; 8
newthree ; 2 ; 9
step           ; 0 ; 1 ; TYPE=FLOAT
parameter ; 0 ; 2 ; TYPE=INT
representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
levtype ; 0 ; 4 ; LEN=8
levelist ; 0 ; 5 ; TYPE=FLOAT
channel ; 0 ; 6 ; TYPE=INT
frequency ; 0 ; 7 ; TYPE=INT
direction ; 0 ; 8 ; TYPE=INT
ident ; 0 ; 9 ; TYPE=INT

}
%database ; eps ; 1, 32 ; FieldsDatabase ; 88 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type        ; 2 ; 1
time ; 2 ; 2 ; LEN=4
number ; 2 ; 3 ; TYPE=INT
ensemble ; 2 ; 3 ; TYPE=INT
cluster ; 2 ; 3 ; TYPE=INT
probability ; 2 ; 3 ; TYPE=INT
quantile ; 2 ; 3 ; TYPE=INT
fnumber ; 2 ; 3 ; LEN=1
reference ; 2 ; 4 ; TYPE=INT
diagnostic ; 2 ; 4 ; LEN=1
iteration ; 2 ; 5 ; LEN=2
origin ; 2 ; 6
leg ; 2 ; 7 ; TYPE=INT ; STAT=READOVER
newtwo ; 2 ; 8
newthree ; 2 ; 9
step           ; 0 ; 1 ; TYPE=FLOAT
parameter ; 0 ; 2 ; TYPE=INT
representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
levtype ; 0 ; 4 ; LEN=8
levelist ; 0 ; 5 ; TYPE=FLOAT
channel ; 0 ; 6 ; TYPE=INT
frequency ; 0 ; 7 ; TYPE=INT
direction ; 0 ; 8 ; TYPE=INT

}
%database ; ppdidb ; RequirementsDatabase ; 32 , 0 ; {
stream ; 1 ; 1 ; LEN=4
domain ; 1 ; 2 ; LEN=1
expver ; 1 ; 3 ; LEN=4
mode ; 1 ; 4 ; TYPE=CHAR

file   ; 2 ; 1 ; LEN=7
time ; 2 ; 2 ; LEN=2
reqst ; 2 ; 3 ; LEN=4
job ; 2 ; 4 ; LEN=1
sequence ; 0 ; 1 ; TYPE=INT
step ; 0 ; 2 ; TYPE=INT
name ; 0 ; 3 ; TYPE=CHAR ; LEN=8
number ; 0 ; 4 ; TYPE=INT

}
%database ; pdb ; 1 , 32 ; ProductsDatabase ; 0 ; {
class ; 1 ; 1 ; LEN=2
stream ; 1 ; 2 ; LEN=4
domain ; 1 ; 3 ; LEN=1
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
mode ; 1 ; 6 ; TYPE=CHAR

type ; 2 ; 1 ; LEN=2
time ; 2 ; 2 ; LEN=2
step ; 2 ; 3 ; LEN=4
country         ; 3 ;  1 ; LEN=2
streamid ; 3 ; 2 ; LEN=1
basedatetime ; 3 ; 3 ; LEN=8
validatdatetime ; 3 ; 4 ; LEN=8
fexpver ; 3 ; 5 ; LEN=4
dot ; 3 ; 6 ; LEN=1
destination ; 3 ; 7 ; LEN=3
priority ; 3 ; 8 ; LEN=2
opt ; 3 ; 9 ; LEN=1
format ; 3 ; 10 ; LEN=2
method ; 3 ; 11 ; LEN=3
jobno ; 3 ; 12 ; LEN=3

}
%database ; wpdb ; 1, 32 ; WeatherParameterDataBase ; 48 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2 ; TYPE=CHAR
domain ; 1 ; 3 ; TYPE=CHAR
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
mode ; 1 ; 6

type        ; 2 ; 1
time ; 2 ; 2 ; LEN=4
step ; 2 ; 3 ; TYPE=INT
destination ; 2 ; 4 ; LEN=3
disstream     ; 0 ;  1 ; LEN=2
north ; 0 ; 2 ; TYPE=INT
west ; 0 ; 3 ; TYPE=INT
parameter ; 0 ; 4 ; TYPE=INT
level ; 0 ; 5 ; TYPE=INT
stepincrement ; 0 ; 6 ; TYPE=INT
leveltype ; 0 ; 7 ; LEN=1

}
%database ; seas ; 1, 32 ; SeasonalDatabase ; 80 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type     ; 2 ; 1
time ; 2 ; 2 ; LEN=4
number ; 2 ; 3 ; TYPE=INT
quantile ; 2 ; 3 ; TYPE=INT
system ; 2 ; 4 ; TYPE=INT
method ; 2 ; 5 ; TYPE=INT
origin ; 2 ; 6
step           ; 0 ; 1 ; TYPE=FLOAT
parameter ; 0 ; 2 ; TYPE=INT
representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
levtype ; 0 ; 4 ; LEN=8
levelist ; 0 ; 5 ; TYPE=FLOAT
frequency ; 0 ; 6 ; TYPE=INT
direction ; 0 ; 7 ; TYPE=INT
fcmonth ; 0 ; 8 ; TYPE=INT
fcperiod ; 0 ; 9 ; TYPE=INT

}
%database ; ocean ; 1, 32 ; OceanDatabase ; 80 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
type ; 1 ; 5
refdate ; 1 ; 6 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
origin ; 1 ; 7

product ; 2 ; 1
number ; 2 ; 2 ; TYPE=INT
system ; 2 ; 3 ; TYPE=INT
method ; 2 ; 4 ; TYPE=INT
range ; 2 ; 5 ; TYPE=INT
time ; 2 ; 6 ; LEN=4
date ; 2 ; 7 ; LEN=8
step       ; 0 ; 1  ; TYPE=FLOAT
parameter ; 0 ; 2 ; TYPE=INT
levtype ; 0 ; 3 ; LEN=8
section ; 0 ; 4 ; LEN=8
levelist ; 0 ; 5 ; TYPE=FLOAT
latitude ; 0 ; 6 ; TYPE=FLOAT
longitude ; 0 ; 7 ; TYPE=FLOAT
fcperiod ; 0 ; 8 ; TYPE=INT
}
}

Replies (2)

RE: ECMWF Parallel I/O - FDB - Added by Luis Kornblueh about 11 years ago

I still did not get one point: how do you aggregate the grib records? Is that done in FDB or before on any of the compute nodes?

RE: ECMWF Parallel I/O - FDB - Added by George Mozdzynski about 11 years ago

We use MPI_alltoallv for gathering fields in parallel over as many MPI tasks as possible.
Today, for reasons of efficiency this number of tasks is typically 91 (the number of
vertical levels in IFS today) for the multi-level fields, or a lower number for surface fields.
Then these complete fields are grib encoded in parallel in the same receiving tasks of
the above operation. The final step is to do a FDB write call.

    (1-2/2)