Project

General

Profile

Actions

Bug #441

open

F3200 filter

Added by Anonymous over 12 years ago. Updated over 12 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
QC_data_checker
Start date:
09/14/2011
Due date:
% Done:

0%

Estimated time:

Description

This is a feature request, not a bug.

I have seen a large number of F3200 "suspecting replicated data" warnings, most of them spurious. HDH eliminated about 90% of the unnecessary F3200 warnings by providing an option to issue no F3200's for 0-D variables. Even so, in my experience the majority of F3200's have been spurious.

The F3200 exception is based on a comparing few indicators such as mean and maximum. For my own use, I wrote (in Python) a filter to read the suspect data a second time and compare the entire data arrays, item by item. For the comparison I used the method favored by numerical analysts and implemented by the function numpy.ma.allclose(): scalars a and b are close if |a-b|<atol+rtol*max(a,b). My default tolerances are atol=1.e-9 and rtol=1.e-6, but any small numbers would work. So far I have sent hundreds of data sets through this filter, and now I have zero spurious warnings.

I think that a better place for this second-pass check would be the QC tool.

Actions #1

Updated by Anonymous over 12 years ago

  • Status changed from New to In Progress

The approach used in the QC to detect replicated records is not robust in an absolute sense, I agree.
Already two almost identical records with only two grid-cell values swapped would fail.
The reason for implementing it this way was firstly to be quick and secondly to hold across different sub-temporal files.
The hope was that global fluctuations were such high preventing false alarms. Unfortunately, there are variables with
only a few valid grid-cell values, e.g. some kind of snow/ice properties.

The numpy.ma.allclose algorithm is robust as there are others, too. But, it does not function across different sub-temporal
files which are quiet common; the QC does not know sub-temporal files, but the current one it is checking. Additionally,
the expenditure of time would be rather high for always opening all previous records at each current record.

I would like to propose to calculate a checksum (md5) of each record and compare these. The checksums of previous
sub-temporal files would be stored additionally in the qc<filename>.nc result file.

Actions #2

Updated by Anonymous over 12 years ago

two comments:

The existing replication check works well as a first pass. I have found the detailed direct check useful only as a second pass, for removing spurious results. I agree that it would be impossibly slow to do a detailed comparison of all pairs of records! I was only suggesting the use of a detailed check as the second pass of a two-pass algorithm.

Checksums are a good idea. The tricky part is for floating-point numbers where bit-for-bit agreement may not be the only relevant kind of identity. Maybe some kind of rounding could be part of a checksum-based algorithm.

Actions

Also available in: Atom PDF