This is a feature request, not a bug.
I have seen a large number of F3200 “suspecting replicated data” warnings, most of them spurious. HDH eliminated about 90% of the unnecessary F3200 warnings by providing an option to issue no F3200’s for 0-D variables. Even so, in my experience the majority of F3200’s have been spurious.
The F3200 exception is based on a comparing few indicators such as mean and maximum. For my own use, I wrote (in Python) a filter to read the suspect data a second time and compare the entire data arrays, item by item. For the comparison I used the method favored by numerical analysts and implemented by the function numpy.ma.allclose(): scalars a and b are close if |a-b|<atol+rtol*max(a,b). My default tolerances are atol=1.e-9 and rtol=1.e-6, but any small numbers would work. So far I have sent hundreds of data sets through this filter, and now I have zero spurious warnings.
I think that a better place for this second-pass check would be the QC tool.
#1 Updated by Heinz-Dieter Hollweg over 8 years ago
- Status changed from New to In Progress
The approach used in the QC to detect replicated records is not robust in an absolute sense, I agree.
Already two almost identical records with only two grid-cell values swapped would fail.
The reason for implementing it this way was firstly to be quick and secondly to hold across different sub-temporal files.
The hope was that global fluctuations were such high preventing false alarms. Unfortunately, there are variables with
only a few valid grid-cell values, e.g. some kind of snow/ice properties.
The numpy.ma.allclose algorithm is robust as there are others, too. But, it does not function across different sub-temporal
files which are quiet common; the QC does not know sub-temporal files, but the current one it is checking. Additionally,
the expenditure of time would be rather high for always opening all previous records at each current record.
I would like to propose to calculate a checksum (md5) of each record and compare these. The checksums of previous
sub-temporal files would be stored additionally in the qc<filename>.nc result file.
#2 Updated by Jeffrey Painter over 8 years ago
The existing replication check works well as a first pass. I have found the detailed direct check useful only as a second pass, for removing spurious results. I agree that it would be impossibly slow to do a detailed comparison of all pairs of records! I was only suggesting the use of a detailed check as the second pass of a two-pass algorithm.
Checksums are a good idea. The tricky part is for floating-point numbers where bit-for-bit agreement may not be the only relevant kind of identity. Maybe some kind of rounding could be part of a checksum-based algorithm.