P2P scalability tests¶

Summary¶

A normal system is not able to handle more than 30.000 catalogs reliably. This limit is not imposed by the catalogs but by the files in them, so there's no difference if the catalogs are used for managing atomic datasets or other units.

TDS 4.2.10 (newest):
max 2.000 catalogs per GB of memory for CMIP5 normal catalogs (75.000 files)
max 20.000 Catalogs per GB of memory if reusing the files (this tests the catalog level itself).

The publisher will always have problems before this limit is reached. The limiting element are the number of files as all attributes in them are stored separately. This again doesn't affect the coarseness of the datasets, but the total number of files that can be handled.

TDS¶

(re-using the same files for the dummy catalog generation)

It takes about 500MB of memory to store 10.000 Catalogs and 45s to process them.

Catalogs #	JVM Max Heap	Processing time (s)	Memory (MB)
1	1024	1	57
1000	1024	10	150
10000	1024	45	570
15000	1024	83	790
16000	1024	120	830
10000	2048	45	570
20000	2048	90	1000
30000	2048	140	1500

ESG Publisher (esgcet)¶

These are the table sizes of some of the tables used by the publisher at our replication node for cmip5

catalog                22382
dataset                20623
dataset_version        22398
file                  915295
file_version          915295
file_variable        8393508
file_attr           23961353
file_var_attr       29564288
filevar_dimension   14133348
var_attr             3218266
var_dimension        1139604
variable              464501

Every dataset expands to over 3700 entries in total in the DB (note the greater numbers are all on file* tables)

Even a simple count takes already 13s.

EXPLAIN ANALYZE  SELECT count(*) from file_var_attr;
                                                            QUERY PLAN                                  
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=656307.70..656307.71 rows=1 width=0) (actual time=13381.250..13381.251 rows=1 loops=1)
   ->  Seq Scan on file_var_attr  (cost=0.00..582150.76 rows=29662776 width=0) (actual time=0.143..7566.273 rows=29564288 loops=1)
 Total runtime: 13381.579 ms
(3 rows)

Postgres should not have any issues regarding memory. But the sqlalchemy interface might.
Additionally the sqlalchemy framework issues probably more calls to the DB than those strictly required, increasing therefore the lag.

Files (0)

Project

General

Profile

CORDEX

Recently changed¶

Wiki

P2P scalability tests¶

Summary¶

TDS¶

ESG Publisher (esgcet)¶