P2P scalability tests¶
Summary¶
A normal system is not able to handle more than 30.000 catalogs reliably. This limit is not imposed by the catalogs but by the files in them, so there's no difference if the catalogs are used for managing atomic datasets or other units.
TDS 4.2.10 (newest):
max 2.000 catalogs per GB of memory for CMIP5 normal catalogs (75.000 files)
max 20.000 Catalogs per GB of memory if reusing the files (this tests the catalog level itself).
The publisher will always have problems before this limit is reached. The limiting element are the number of files as all attributes in them are stored separately. This again doesn't affect the coarseness of the datasets, but the total number of files that can be handled.
TDS¶
(re-using the same files for the dummy catalog generation)
It takes about 500MB of memory to store 10.000 Catalogs and 45s to process them.
Catalogs # | JVM Max Heap | Processing time (s) | Memory (MB) |
1 | 1024 | 1 | 57 |
1000 | 1024 | 10 | 150 |
10000 | 1024 | 45 | 570 |
15000 | 1024 | 83 | 790 |
16000 | 1024 | 120 | 830 |
10000 | 2048 | 45 | 570 |
20000 | 2048 | 90 | 1000 |
30000 | 2048 | 140 | 1500 |
ESG Publisher (esgcet)¶
These are the table sizes of some of the tables used by the publisher at our replication node for cmip5
catalog 22382 dataset 20623 dataset_version 22398 file 915295 file_version 915295 file_variable 8393508 file_attr 23961353 file_var_attr 29564288 filevar_dimension 14133348 var_attr 3218266 var_dimension 1139604 variable 464501
Every dataset expands to over 3700 entries in total in the DB (note the greater numbers are all on file* tables)
Even a simple count takes already 13s.
EXPLAIN ANALYZE SELECT count(*) from file_var_attr; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=656307.70..656307.71 rows=1 width=0) (actual time=13381.250..13381.251 rows=1 loops=1) -> Seq Scan on file_var_attr (cost=0.00..582150.76 rows=29662776 width=0) (actual time=0.143..7566.273 rows=29564288 loops=1) Total runtime: 13381.579 ms (3 rows)
Postgres should not have any issues regarding memory. But the sqlalchemy interface might.
Additionally the sqlalchemy framework issues probably more calls to the DB than those strictly required, increasing therefore the lag.