It seems like the ground truth (GT) images contain a substantial amount of sensor noise. Therefore, even a clean image and a better solution (as per qualitative/subject evaluation) can obtain a poor PSNR score in the objective measurement. Also, the noisy GT images encourage a solution to produce noisy outputs rather than a plausible clean image. How it should be compensated during the evaluation phase? Here, is one example has given: https://drive.google.com/file/d/1cCl7Ck3VascKnK8IOtaaLA2erhTJAxBL/view?usp=sharing
Left Image: GT (8bit visualized), Middle Image: Solution-X, Right Image: Input.
Here is one practical example: https://drive.google.com/file/d/1v3fdPuPQHr1fKXYaocQ1ON2jo5CG6GQN/view?usp=sharingPosted by: sharif_apu @ Feb. 4, 2021, 5:45 a.m.
Dear Sharif Apu,
The cameras used for capturing those sequences (Alexa ARRI) do have very low noise profiles and additionally there is a two image fusion stage (which should have some denoising effect as well). However, unfortunately there is always a certain amount of sensor noise that will get through, specially in very low-light regions of the image (also note that in the case of tonemapped images that noise might be further amplified by the mu-law function). As you rightly mention, a few images in the dataset have some noise in the GT - however those are quite reduced in number and the input noise magnitude is several orders of magnitude higher. We decided to keep them in the dataset despite the related limitations on denoising evaluation, as low-light images are a very challenging and interesting playground for HDR imaging as well.
I would like to note however that the case of the example image you are showing is actually slightly different. The GT image is almost perfectly clean and thus the gap in terms of PSNR might not be caused by computing PSNR against a noisy ground-truth, but rather by the over-smoothing of features in "Solution-X" vs "Solution-Y", where Y is clearly noisier than GT but clearly retains as well more texture and details in very dark areas than X (e.g. note all the details lost around the kick drum and metal structures). This is a common trade-off in HDR imaging and denoising (clean and over-smooth with loss of details vs more noise present but small details better recovered/preserved) and PSNR is behaving here as expected.
All in all, we encourage all the participants to search for an optimal solution given the sometimes imperfect data.
Thanks for your detail explanation.