Bug #441: F3200 filter - CMIP5-QC - DKRZ Project Management Service

Actions

Copy link

Bug #441

open

F3200 filter

Added by Anonymous almost 14 years ago. Updated almost 14 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Category:

QC_data_checker

Start date:

09/14/2011

Due date:

% Done:

Estimated time:

Description

This is a feature request, not a bug.

I have seen a large number of F3200 "suspecting replicated data" warnings, most of them spurious. HDH eliminated about 90% of the unnecessary F3200 warnings by providing an option to issue no F3200's for 0-D variables. Even so, in my experience the majority of F3200's have been spurious.

The F3200 exception is based on a comparing few indicators such as mean and maximum. For my own use, I wrote (in Python) a filter to read the suspect data a second time and compare the entire data arrays, item by item. For the comparison I used the method favored by numerical analysts and implemented by the function numpy.ma.allclose(): scalars a and b are close if |a-b|<atol+rtol*max(a,b). My default tolerances are atol=1.e-9 and rtol=1.e-6, but any small numbers would work. So far I have sent hundreds of data sets through this filter, and now I have zero spurious warnings.

I think that a better place for this second-pass check would be the QC tool.

Actions

Copy link

Updated by Anonymous almost 14 years ago

Status changed from New to In Progress

The approach used in the QC to detect replicated records is not robust in an absolute sense, I agree.
Already two almost identical records with only two grid-cell values swapped would fail.
The reason for implementing it this way was firstly to be quick and secondly to hold across different sub-temporal files.
The hope was that global fluctuations were such high preventing false alarms. Unfortunately, there are variables with
only a few valid grid-cell values, e.g. some kind of snow/ice properties.

The numpy.ma.allclose algorithm is robust as there are others, too. But, it does not function across different sub-temporal
files which are quiet common; the QC does not know sub-temporal files, but the current one it is checking. Additionally,
the expenditure of time would be rather high for always opening all previous records at each current record.

I would like to propose to calculate a checksum (md5) of each record and compare these. The checksums of previous
sub-temporal files would be stored additionally in the qc<filename>.nc result file.

Actions

Copy link

Updated by Anonymous almost 14 years ago

two comments:

The existing replication check works well as a first pass. I have found the detailed direct check useful only as a second pass, for removing spurious results. I agree that it would be impossibly slow to do a detailed comparison of all pairs of records! I was only suggesting the use of a detailed check as the second pass of a two-pass algorithm.

Checksums are a good idea. The tricky part is for floating-point numbers where bit-for-bit agreement may not be the only relevant kind of identity. Maybe some kind of rounding could be part of a checksum-based algorithm.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

CMIP » CMIP5-QC

Bug #441

F3200 filter

Updated by Anonymous almost 14 years ago

Updated by Anonymous almost 14 years ago