ECMWF Parallel I/O - FDB - RAPS - DKRZ Project Management Service

ECMWF Parallel I/O - FDB

Added by Anonymous over 13 years ago

I thought it might be useful to describe an existing example of parallel I/O.
There may be better approaches to take in the future but today this is what we
use at ECMWF, and it appears to work well with little overhead.

We have measured the I/O overhead at about 1 percent of the wall clock time,
sustaining 7.5 GBytes/sec for a T1279 forecast model experiment running hourly
post-processing.

Of course, besides the actual I/O time, there is the time accumulated in several
parts of the IFS to either interpolate data, transform it (e.g. spectral, real),
gather into a complete field (MPI), encode to grib, and all this has been measured
at about 7-8 percent of job wall clock time for this same forecast model experiment.
Surprisingly, the actual I/O time is not really a problem!

At ECMWF, parallel I/O in our IFS applications (forecast models, 4D-Var, EPS, etc.)
is achieved by what we call the Fields Data Base (FDB).

A presentation (from 2007) describing the basic features of FDB s/w is attached.
A key feature of the FDB is that each task can write a field (typically in grib format)
to the FDB and this occurs in parallel with any number of tasks. This does not
involve any tasks or nodes other than those that perform computation.

The fdb per task buffers are flushed to GPFS per node disk cache either explicitly at the
end of a parallel execution or at the end of a post processing time-step or when some
default threshold (which can be overridden by environment variable) is reached, ensuring
data is moved to the GPFS disk cache in large multi MB blocks. Once such a flush has
taken place, fields are immediately available, for example, for products generation.

The layout of FDB file structure is determined by the FDB configuration file (see attached
example of such a Config file). This file contains list of attributes that describe the data
(for each logical database). Attributes can be adjoint into groups defining which ones form
directory name(s), file names or meta data (so called database index file).
Currently we have several logical databases fdb, pdb (products database), ocean, seas,
ppdidb (dissemination requirements database), etc.

The configuration file can be be easily changed, thus changing the FDB structure. The Config
file also allows us to chose the database file structure, we can put all data we have into a
single file or at the other extreme have one file per field.
When accessing FDB data, applications do not have to have any knowledge of the FDB structure,
just attribute names.

Enough for now....

The Config is not a binary file!!!!, I will paste here....

FDBCONFIGURATION ;

%servers ; machines ; 5 {
ifsfdb ; fdb ; hpc ; 4055 ; 1801 ( /fdb, /fdb )
hpcfdb ; fdb ; hpc ; 4055 ; 1801 ( /fdb, /fdb )
serversfdb ; fdb ; servers ; 4055 ; 1801 ( /fdb, /fdb )
ecmwffdb ; fdb ; ecmwf ; 4055 ; 1801 ( /fdb, /fdb )
backupfdb ; fdb ; c1b ; 4055 ; 1801 ( /fdb, /fdb )
localfdb ; fdb ; c1a ; 4055 ; 1801 ( /fdb, /fdb )
c1afdb ; fdb ; c1a ; 4055 ; 1801 ( /fdb, /fdb )
c1bfdb ; fdb ; c1b ; 4055 ; 1801 ( /fdb, /fdb )
hpcffdb ; fdb ; hpcf ; 4055 ; 1801 ( /fdb, /fdb )
pdb ; pdb ; hpc ; 4055 ; 0 ( /prod_fdb, /prod_fdb )
lnxcluster ; fdb ; lnxcl ; 4055 ; 1801 ( /vol/fdb, /vol/fdb )
scores2 ; fdb ; scores2 ; 4055 ; 1801 ( /scores2/data/fdb, /scores2/data/fdb)
scores1 ; fdb ; scores1 ; 4055 ; 1801 ( /scores1/data/fdb, /scores1/data/fdb)
} ;
%filesystems ; {
< /ma_fdb , /ma_fdb >
< /fdb , /fdb >
< /fws1/lb/fdb , /fws1/lb/fdb >
< /fws2/lb/fdb , /fws2/lb/fdb >
< /fws3/lb/fdb , /fws3/lb/fdb >
< /prod_fdb , /prod_fdb >
< /era_fdb , /era_fdb >
< /c1a/ma_fdb , /c1a/ma_fdb >
< /c1a/fdb , /c1a/fdb >
< /c1a/prod_fdb , /c1a/prod_fdb >
< /fdb/msx , /fdb/msx >
< /ms_crit/fdb , /ms_crit/fdb >
< /emos_esuite/ma_fdb , /emos_esuite/ma_fdb >
< /s1a/emos_esuite/ma_fdb , /s1a/emos_esuite/ma_fdb >
< /s1b/emos_esuite/ma_fdb , /s1b/emos_esuite/ma_fdb >
< /var/tmp , /var/tmp >
}
%networkhosts ; {
hpc = ( c1a, hpcf, c1b )
servers = ( lnxcl )
ecmwf = ( hpc, servers )
}
%databasehosts ; {
hpcf : ( hpcf0205, hpcf0105 )
c1a : ( c1a-fdb1, c1a-fdb2, c1a-fdb3 )
c1b : ( c1b-fdb1, c1b-fdb2, c1b-fdb3 )
lnxcl : ( bee28, lclfdb )
scores2 : ( scores2 )
scores1 : ( scores1 )
}
%siblinghosts ; {
c1a > ( hpcf )
c1b > ( c1a )
hpcf > ( c1a )
lnxcl > ( c1a )
} {
%database ; fdb ; 1, 32 ; FieldsDatabase ; 88 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type        ; 2 ; 1
   time        ; 2 ; 2 ; LEN=4
   number      ; 2 ; 3 ; TYPE=INT
   ensemble    ; 2 ; 3 ; TYPE=INT
   cluster     ; 2 ; 3 ; TYPE=INT
   probability ; 2 ; 3 ; TYPE=INT
   quantile    ; 2 ; 3 ; TYPE=INT
   fnumber     ; 2 ; 3 ; LEN=1
   reference   ; 2 ; 4 ; TYPE=INT
   diagnostic  ; 2 ; 4 ; LEN=1
   iteration   ; 2 ; 5 ; LEN=2
   origin      ; 2 ; 6
   leg         ; 2 ; 7 ; TYPE=INT ; STAT=READOVER
   newtwo      ; 2 ; 8
   newthree    ; 2 ; 9

step           ; 0 ; 1 ; TYPE=FLOAT
   parameter      ; 0 ; 2 ; TYPE=INT
   representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
   levtype        ; 0 ; 4 ; LEN=8
   levelist       ; 0 ; 5 ; TYPE=FLOAT
   channel        ; 0 ; 6 ; TYPE=INT
   frequency      ; 0 ; 7 ; TYPE=INT
   direction      ; 0 ; 8 ; TYPE=INT
   ident          ; 0 ; 9 ; TYPE=INT

}
%database ; eps ; 1, 32 ; FieldsDatabase ; 88 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type        ; 2 ; 1
   time        ; 2 ; 2 ; LEN=4
   number      ; 2 ; 3 ; TYPE=INT
   ensemble    ; 2 ; 3 ; TYPE=INT
   cluster     ; 2 ; 3 ; TYPE=INT
   probability ; 2 ; 3 ; TYPE=INT
   quantile    ; 2 ; 3 ; TYPE=INT
   fnumber     ; 2 ; 3 ; LEN=1
   reference   ; 2 ; 4 ; TYPE=INT
   diagnostic  ; 2 ; 4 ; LEN=1
   iteration   ; 2 ; 5 ; LEN=2
   origin      ; 2 ; 6
   leg         ; 2 ; 7 ; TYPE=INT ; STAT=READOVER
   newtwo      ; 2 ; 8
   newthree    ; 2 ; 9

step           ; 0 ; 1 ; TYPE=FLOAT
   parameter      ; 0 ; 2 ; TYPE=INT
   representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
   levtype        ; 0 ; 4 ; LEN=8
   levelist       ; 0 ; 5 ; TYPE=FLOAT
   channel        ; 0 ; 6 ; TYPE=INT
   frequency      ; 0 ; 7 ; TYPE=INT
   direction      ; 0 ; 8 ; TYPE=INT

}
%database ; ppdidb ; RequirementsDatabase ; 32 , 0 ; {
stream ; 1 ; 1 ; LEN=4
domain ; 1 ; 2 ; LEN=1
expver ; 1 ; 3 ; LEN=4
mode ; 1 ; 4 ; TYPE=CHAR

file   ; 2 ; 1 ; LEN=7
   time   ; 2 ; 2 ; LEN=2
   reqst  ; 2 ; 3 ; LEN=4
   job    ; 2 ; 4 ; LEN=1

sequence ; 0 ; 1 ; TYPE=INT
   step     ; 0 ; 2 ; TYPE=INT
   name     ; 0 ; 3 ; TYPE=CHAR ; LEN=8
   number   ; 0 ; 4 ; TYPE=INT

}
%database ; pdb ; 1 , 32 ; ProductsDatabase ; 0 ; {
class ; 1 ; 1 ; LEN=2
stream ; 1 ; 2 ; LEN=4
domain ; 1 ; 3 ; LEN=1
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
mode ; 1 ; 6 ; TYPE=CHAR

type ; 2 ; 1 ; LEN=2
   time ; 2 ; 2 ; LEN=2
   step ; 2 ; 3 ; LEN=4

country         ; 3 ;  1 ; LEN=2
   streamid        ; 3 ;  2 ; LEN=1
   basedatetime    ; 3 ;  3 ; LEN=8
   validatdatetime ; 3 ;  4 ; LEN=8
   fexpver         ; 3 ;  5 ; LEN=4
   dot             ; 3 ;  6 ; LEN=1
   destination     ; 3 ;  7 ; LEN=3
   priority        ; 3 ;  8 ; LEN=2
   opt             ; 3 ;  9 ; LEN=1
   format          ; 3 ; 10 ; LEN=2
   method          ; 3 ; 11 ; LEN=3
   jobno           ; 3 ; 12 ; LEN=3

}
%database ; wpdb ; 1, 32 ; WeatherParameterDataBase ; 48 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2 ; TYPE=CHAR
domain ; 1 ; 3 ; TYPE=CHAR
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
mode ; 1 ; 6

type        ; 2 ; 1
   time        ; 2 ; 2 ; LEN=4
   step        ; 2 ; 3 ; TYPE=INT
   destination ; 2 ; 4 ; LEN=3

disstream     ; 0 ;  1 ; LEN=2
   north         ; 0 ;  2 ; TYPE=INT
   west          ; 0 ;  3 ; TYPE=INT
   parameter     ; 0 ;  4 ; TYPE=INT
   level         ; 0 ;  5 ; TYPE=INT
   stepincrement ; 0 ;  6 ; TYPE=INT
   leveltype     ; 0 ;  7 ; LEN=1

}
%database ; seas ; 1, 32 ; SeasonalDatabase ; 80 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1 ; TYPE=CHAR
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
date ; 1 ; 5 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
refdate ; 1 ; 6 ; LEN=8

type     ; 2 ; 1
   time     ; 2 ; 2 ; LEN=4
   number   ; 2 ; 3 ; TYPE=INT
   quantile ; 2 ; 3 ; TYPE=INT
   system   ; 2 ; 4 ; TYPE=INT
   method   ; 2 ; 5 ; TYPE=INT
   origin   ; 2 ; 6

step           ; 0 ; 1 ; TYPE=FLOAT
   parameter      ; 0 ; 2 ; TYPE=INT
   representation ; 0 ; 3 ; TYPE=CHAR ; LEN=8; STAT=IGNORE
   levtype        ; 0 ; 4 ; LEN=8
   levelist       ; 0 ; 5 ; TYPE=FLOAT
   frequency      ; 0 ; 6 ; TYPE=INT
   direction      ; 0 ; 7 ; TYPE=INT
   fcmonth        ; 0 ; 8 ; TYPE=INT
   fcperiod       ; 0 ; 9 ; TYPE=INT

}
%database ; ocean ; 1, 32 ; OceanDatabase ; 80 , 0 ; PARALLEL=2 ; {
class ; 1 ; 1
stream ; 1 ; 2
domain ; 1 ; 3
expver ; 1 ; 4 ; LEN=4
type ; 1 ; 5
refdate ; 1 ; 6 ; LEN=8
hdate ; 1 ; 6 ; LEN=8
origin ; 1 ; 7

product ; 2 ; 1
   number  ; 2 ; 2  ; TYPE=INT
   system  ; 2 ; 3  ; TYPE=INT
   method  ; 2 ; 4  ; TYPE=INT
   range   ; 2 ; 5  ; TYPE=INT
   time    ; 2 ; 6  ; LEN=4
   date    ; 2 ; 7  ; LEN=8

step       ; 0 ; 1  ; TYPE=FLOAT
   parameter  ; 0 ; 2  ; TYPE=INT
   levtype    ; 0 ; 3  ; LEN=8
   section    ; 0 ; 4  ; LEN=8
   levelist   ; 0 ; 5  ; TYPE=FLOAT
   latitude   ; 0 ; 6  ; TYPE=FLOAT
   longitude  ; 0 ; 7  ; TYPE=FLOAT
   fcperiod   ; 0 ; 8  ; TYPE=INT
}
}

Download all files

fdb.2007.pdf (1.52 MB) fdb.2007.pdf
Config (7.36 KB) Config

Replies (2)

RE: ECMWF Parallel I/O - FDB - Added by Luis Kornblueh over 13 years ago

I still did not get one point: how do you aggregate the grib records? Is that done in FDB or before on any of the compute nodes?

RE: ECMWF Parallel I/O - FDB - Added by Anonymous over 13 years ago

We use MPI_alltoallv for gathering fields in parallel over as many MPI tasks as possible.
Today, for reasons of efficiency this number of tasks is typically 91 (the number of
vertical levels in IFS today) for the multi-level fields, or a lower number for surface fields.
Then these complete fields are grib encoded in parallel in the same receiving tasks of
the above operation. The final step is to do a FDB write call.

(1-2/2)

Project

General

Profile

RAPS