Welcome to the the Australian Centre for Robotic Vision (ACRV) probabilistic object detection (PrOD) challenge continuous evaluation server!
Probabilistic object detection is only the first of the ACRV robotic vision cahllenges. For more robotic vision challenges visit roboticvisionchallenge.org
This server will stay up-to-date with the latest official PrOD challenge.
If you wish to participate in the next official challenge (IROS 2019) and be in the running to win prize money, please participate in the main challenge (link)
In this challenge, participants have to:
In contrast to traditional object detection challenges (such as COCO), our new probability-based detection quality (PDQ) measure which rewards accurate uncertainty estimates, and penalises both overconfident and underconfident detections.
This server is for quickly testing your algorithms on a small set of validation data (Validation Phase) or to compare and improve algorithms on data equivalent to the most current probabilistic object detection challenge (Test-Dev Phase).
We thank the Australian Centre for Robotic Vision for sponsoring and supporting this challenge.
If you want to quickly get started with the validation phase, just follow the procedure below.
For robotics applications, detections must not just provide where and what an object is, but provide how certain they are about the detection. Failing to do so can lead to catastrophic consequences from over or under-confident detections. To encourage researchers to provide accurate estimates for both spatial and semantic uncertainty, our object detection challenge provides a new detection format and evaluation measure suited for analysing detections which provide said uncertainty.
The continuous challenge comprises of two datasets. The validation dataset is smaller and useful for rapid development and the test-dev set is larger and closer in style to the official challenge test set.
Both sets use new high-fidelity simulated data. You can find further information and download the full sets of data in the "Participate" tab under the heading "Get Data". In summary, our data contains:
Playlist of all test-dev image sequences in this challenge
To enable detections with spatial uncertainty, we provide a new detection format which:
Example of generating spatial probability heatmap for PBox defined by two 2D Gaussian corners
To evaluate detections in a manner which incorporates both the semantic and spatial uncertainty which should be included in a robotic vision system, we developed the probability-based detection quality (PDQ) measure for this challenge. Full implementation details are supplied here and a summary is provided under the "Evaluation" heading.
The benefits of PDQ as a measure are that it:
This competition is evaluated using the probability-based detection quality (PDQ) measure.
We provide a summary here but direct you to the following paper for full details on implementation http://arxiv.org/abs/1811.10800. The evaluation code is publicly available here: https://github.com/jskinn/rvchallenge-evaluation.
For more tools related to PDQ evaluation (can differ to that used in competitions), see the following repository: https://github.com/david2611/pdq_evaluation
New PDQ Update! Since the original paper we have implemented one change to PDQ to further encourage label calibration. Details seen under "False Positive Cost" below.
In brief, PDQ is calculated as follows:
Pairwise PDQ is measured between any given detection and ground-truth object and is defined as the geometric mean of the spatial quality and label quality of the detection with respect to the ground-truth object.
Spatial quality is a measure of how well a detection captures the spatial extent of a ground-truth object, taking into account the spatial probabilities for individual pixels as calculated by the detector. It is a merged measure of foreground loss (loss from missing the detected object's pixels) and background loss (loss from detecting background pixels as foreground). The less uncertainty that a detection provides for an incorrect prediction (e.g. background pixels detected as foreground) the higher the loss which is incurred.
Note that as we are considering box-style detections, we ignore pixels which are not a part of the ground-truth object but are within the bounding box which encapsulates it when calculating either type of spatial loss. This can be seen visually below.
Example of foreground (white) and background (red) regions evaluated when calculating spatial quality. Detection pixels (orange) within the ground-truth object's bounding box (blue) that are not part of the foreground are simply ignored.
Final spatial quality will always be between zero and one.
Label quality measures how effectively a detection identifies what the object is. Within a given detection's label probability distribution, label quality is defined as the probability the detection gave to the correct class of the object being identified.
Note that this is irrespective of whether the class has the highest rank within the label probability distribution.
For each image, every detection is optimally assigned to a single ground-truth object. This is done using the Hungarian algrotihm based upon the pPDQ scores between all detections and ground-truth objects. Any detection which is either unassigned or assigned to a ground-truth object with zero pPDQ (no association) is considered a false positive (FP). Any ground-truth object which is either unassigned or assigned to a detection with zero pPDQ is considered a false negative (FN). All other detection-object pairings with some level of association are called "true positives" (TPs) for simplicity in terminology.
As an addition to the original paper, based on feedback from the previous iteration of this challenge we now introduce false positive cost as a new evaluation term. For any detection which has not been assigned to a ground-truth object through the assignment process (false positives), the false positive cost is equal to the maximum class label probability provided in the detection's probability distribution. In the final evaluation statistics we report the average false positive label quality which is the average of (1 - the false positive cost) for all FP detections. Total false positive cost is used in the calculation of the final PDQ score.
Final PDQ score across a set of ground-truth objects and detections is the total pPDQ for each optimally assigned pairing in each image divided by the total number of TPs and FNs and the total false positive cost. Differing from the original PDQ formulation of the previous competition and original paper, we no longer consider all FPs as equally incorrect. Instead, the effect of false positive detections is related to their label over-confidence, decreasing PDQ more for higher levels of FP over-confidence. This final scoring gives a result between zero and one. Note however that for ease of visualisation on the challenge leaderboard graph, PDQ scores are multiplied by 100 and so the final score provided will be between 0 and 100.
After submitting results in the format outlined in the "Submission" tab, evaluation shall be performed using the PDQ measure across all frames of every video sequence of our test data. This PDQ score is multiplied by 100 for ease of visualisation on the challenge leaderboard graph. For more information about the test data please go to the "Participate" tab under the heading "Get Data". Please note that as the PDQ performs pixel-wise evaluations on probabilistic output across many detections and ground-truths in every image, this process may take some time. Please be patient.
Alongside the main PDQ score (listed as "Score" in the leaderboard) we also provide pPDQ, spatial quality, and label quality scores averaged across all TP detections, the average false positive cost averaged across all FP detections, and the number of TPs , FPs, and FNs. In the leaderboard these are named "Average pPDQ", "Average Spatial Quality", "Average Label Quality", "Average FP Label Quality", "True Positives", "False Positives", and "False Negatives" respectively.
Error messages and further evaluation information shall be supplied within the output file to help with debugging. For more information about interpreting error messages, please see the "Troubleshooting" tab.
If the sum of the final label probability distribution for a detection exceeds 1.0, the probability distribution shall be re-normalized as part of the evaluation process.
If the sum of the final label probability distribution for a detection is less than 0.5, the detection is not used in evaluation.
Note that during evaluation we ignore objects which are on-screen but have height or width of less than 10 pixels or a total area of less than 100 pixels.
While the overall PDQ value described above under "Evaluation measure" is the primary measure within this competition (listed as "Score" in the leaderboard) the other measures provided with your results can help you understand the strengths and weaknesses of your detector.
This measure provides your average pPDQ (see above) over all detections which matched a ground-truth object. It ranges between zero and one, where higher values depict better results.
When the score is high, it shows that whenever a detection was matched to a ground-truth object, it accurately described that object both spatially and semantically. A detector with a high Average Overall Quality may not detect many objects or have detections that don't match anything, but when it does detect an object, it's detections are accurate and correctly labelled.
When the score is low, it shows that even when detections were optimally assigned to a particular ground-truth object, they poorly described that object, spatially, semantically, or both.
This measure provides your average spatial quality over all detections which were optimally assigned to a ground-truth object, irrespective of the assigned labels. It ranges between zero and one, where higher values indicate better results.
When the score is high, it shows that whenever a detection was matched to a ground-truth object, it accurately described the object's spatial layout. That is, when the detector correctly detects an object, it places an accurate box around that object (but may not label it correctly)
When the score is low, it shows that when detections where matched to ground-truth objects, they poorly described the object's spatial layout.
This measure provides your average label quality for over all detections which were optimally assigned to a ground-truth object, irrespective of the spatial quality. It ranges between zero and one, where higher values indicate better results.
When the score is high, it shows that whenever a detection was matched to a ground-truth object, it accurately predicted the class of the object.
When the score is low, it shows that when detections where matched to ground-truth objects, they poorly predicted the object's class.
This measure provides the average label quality for all false positive detections (detections which were not assigned to a ground-truth object). It ranges between zero and one hwere higher values indicate better results.
When the score is high, it shows that whenever a false positive detection was made, the label confidence of that detection was low.
When the score is low, it shows that when a false positive detection was made it was very confident of some class label, overconfident that it was detecting something.
A score of 1 is achieved if no FPs are present.
This measure shows how many detections were optimally assigned to a ground-truth object within the image.
This measure shows how many detections were not assigned to a ground-truth object within the challenge.
False positive detections occur when a detection has a lower pPDQ with all objects in an image than other detections, usually zero. Multiple detections of the same object will count as false positives.
This measure shows how many ground-truth objects were unmatched with detections within the challenge.
False negatives occur when there are fewer detections than ground truth objects, and all detections are better assigned to other ground-truth objects.
The data provided for this challenge along with this website belong to the ARC Centre of Excellence in Robotic Vision and are licensed under a Creative Commons Attribution 4.0 License.
Copyright (c) 2018, John Skinner, David Hall, Niko Sünderhauf, and Feras Dayoub, ARC Centre of Excellence for Robotic Vision, Queensland University of Technology
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
To enter the competition, first download the test data (see the "Participate" tab under the heading "Get Data"). The downloaded image data consists of several folders, each containing a sequence of images that the robot sees.
You may also want to download the starter kit, located here: https://github.com/jskinn/rvchallenge-starter-kit
The submission format is a zipped folder containing a json file for each sequence ('000000.json', '000001.json', etc..). These .json files can either be in the root directory or in some sub-directory within the zip file submission.
Each json file must contain the following structure:
"classes": [<an ordered list of class names>],
"bbox": [x1, y1, x2, y2],
[[xx1, xy1],[xy1, yy1]],
[[xx2, xy2],[xy2, yy2]]
"label_probs": [<an ordered list of probabilities for each class>]
In the starter kit, 'submission_builder.py' contains helper code to generate json files in this format, see the comments there for more exmaples.
The "classes" element is a list of class name strings for each class which the detector can detect and provides probability outputs for. The list is ordered to match the order of the label probabilities ("label_probs") output provided for each detection. For example if your classes are ['cat', 'bat', 'ball'], and your detection provides a label probability distribution of [0.1, 0.2, 0.3], it is assumed the detection is providing a 0.1 probability of 'cat', 0.2 of 'bat', and 0.3 of 'ball'.
The classes evaluated within this challenge are a subset of 30 classes from the COCO detection challenge. The full list of classes is as follows:
[bottle, cup, knife, bowl, wine glass, fork, spoon, banana, apple, orange, cake, potted plant, mouse, keyboard, laptop, cell phone, book, clock, chair, dining table, couch, bed, toilet, television, microwave, toaster, refrigerator, oven, sink, person]
It is recommended that the classes list in the submission file matches this list of classes, however, there are provisions in place for handling different class lists if provided.
If a detector returns more classes than the 30 being evaluated, such classes will not be considered when evaluated on our data, so if the system were trained on the full set of 80 COCO classes and returns probability distributions as appropriate, probabilities not associated with our 30 classes shall be removed from the distribution.
If the class list is in a different order than the one provided, this shall also be handled appropriately.
We allow for some synonyms of classes in the submitted "classes" list to be converted to match the naming conventions in the main class list. Conversions are as follows:
The detections element is a list of list of detections (represented by dictionaries). Each entry in the outer list represents an image, and each entry in the inner list represents the detections present in that image. Note that there must be an entry in the detections list for every image in the sequence, even if the inner list is empty (no detections for that image).
Individual detections are described by dictionaries containing the keys "bbox", "covars", and "label_probs" representing bounding box corners, covariance matrices, and label probability distributions respectively.
The "bbox" value for a detection dictionary is a 4-element list describing the locations of the corners of the predicted bounding box.
The format of this is [x1, y1, x2, y2] where x1, y1, x2, and y2 are the far left, top, right, and bottom coordinates of the box respectively.
The "covars" value for a detection dictionary is a list of two, 2D lists outlining the covariance matrices for the top-left and bottom-right corners of the detected bounding box respectively.
Each covariance matrix is a list of lists formatted as [[xx, xy], [xy, yy]] where xx is the variance along the x axis, yy is the variance along the y axis and xy is the covariance between x and y. Covariance matrices are supplied for both the top-left and bottom-right corner of the bounding box in turn.
Any supplied covariance matrix must be positive and semi-definite.
Unlike the other keys in the detection dictionary, covars is optional for reasons outlined below.
If a detection does not contain the "covars" key, a standard bounding box (BBox) without any uncertainty is utilised within evaluation. Note that these are treated as having no spatial uncertainty and any pixel within the BBox will have 100% spatial probability of describing the detected object (corners inclusive).
Detections with "covars" provided are used to generate probabilistic bounding box (PBox) detections. The corners of PBox detections are 2D Gaussians depicting where BBox corners which fully encapsulate the given object may exist. This provides spatial uncertainty, particularly as it relates to the extremes of the detected object with spatial uncertainty increasing for pixels further from the centre of the detection. Full details about how PBoxes are implemented can be found in the following paper http://arxiv.org/abs/1811.10800. An example of a PBox generated from two Gaussian corners can be seen below.
Example of generating spatial probability heatmap for PBox defined by two 2D Gaussian corners
Please note that for reasons of speed, the calculation of probability heatmaps after submission is an approximation which contains some low level of error.
The "label_probs" value for a detection dictionary is a list of class probabilities for each class outlined in the main "class" element of the result. The "label_probs" list in each detection must match the "classes" list at the top level, such that the i-th entry in "label_probs" is the probability of the i-th class in the class list as outlined previously under Classes.
If the sum of the final label probability distribution for a detection (after removing unevaluated class probabilities) exceeds 1.0, the probability distribution shall be re-normalized as part of the evaluation process.
If the sum of the final label probability distribution for a detection (after removing unevaluated class probabilities) is less than 0.5, the detection is not used in evaluation.
This page provides updates that have been made to the challenge in reverse chronological order (most recent posts at the top).
Since the original challenge and paper were released, we have added a FP cost to detection analysis. This is to discourage one-hot detectors (detectors with 100% label quality for the maximum-confidence class) that were prevalent in the original challenge.
FP cost is bigger the more confident the FP is that the detection is of a particular class (e.g. FP detection with 100% chair confidence is worse than FP detection with 30% chair, 30% person, 30% pottedplant, and 10% TV).
PDQ is no longer the sum of TP pPDQ scores divided by the number of TPs, FPs, and FNs. PDQ is now the sum of TP pPDQ scores divided by the number of TPs and FNs and the total FP cost.
FP Cost is also now represented in the final score summary statistics as average FP Label Quality which is the average of (1 - FP cost) across all FP detections with higher scores representing lower overall FP costs.
Maximum score of 1.0 FP Label Quality occurs when there are no FP detections.
The standard output within this challenge has several types of error messaging to help with debugging submissions. What follows are the different error and warning messages that can be raised and what information they provide.
If any of these messages get raised your code will be stopped and you shall not receive a score for your submission.
If this error has occurred, no detections were provided for a given sequence. This means you have not provided a json results file for that given set of sequences. In the afteramble you will be provided with the sequences which have no detections provided.
If this error has occurred, the submission has provided more than one json file for the given sequence. In the preamble you will be provided the name of the sequence with duplicate .json files and after the message you will be provided with the locations where the duplicate detections exist.
If this error has occurred, there is no detection entry for each image in the given sequence. The list of detections for a sequence must have an entry for every image, even if that entry is an empty list (no detections made for that image). The message will include which detections file is missing detections, how many entries were provided and how many entries should have been supplied.
If this error has occurred, a sequence's json results file was found not to contain the key 'classes'. Most likely, this means you have not supplied the class list that corresponds to your probability distributions for your results for that given sequence. Which sequence has caused the error will be supplied in the error message preamble.
If this error has occurred, there are none of the recognised classes in the .json file for the given sequence. Which sequence has caused the error will be supplied in the error message preamble.
If this error has occurred, a sequence's json results file was found not contain the key 'detections'. Most likely, this means you have not supplied the list of detections within your detections result for for that given sequence. Which sequence has caused the error will be supplied in the error message preamble.
If this error has occurred it means that a given detection does not contain the key 'label_probs'. Most likely this means that you have not supplied a given detection with a probability distribution. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, the label probability distribution provided by a given detection does not match the size of the class list provided with the sequence's json results file. All label probability distributions need to be a full probability distribution over all classes which the detector may output. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occured it means that a given detection does not contain the key 'bbox'. Most likely this means you have not provided the bounding box corner locations for the given detection. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, a given detection's bounding box has been provided in an invalid format. the list of bounding box values must be a list containing four elements [x1, y1, x2, y2] denoting the upper-left and lower-right corners of the bounding box. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, a given detection's bounding box's x1 (left) coordinate was bigger than the box's x2 (right) coordinate. x1 must always be less than or equal to x2 as we use a coordinate frame where (0,0) is the upper-left corner of the image. Therefore the leftmost coordinate must be a smaller number than the rightmost one. If the two are a equal, you have a box with a width of one pixel. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, a given detection's bounding box's y1 (upper) coordinate was bigger than the box's y2 (lower) coordinate. y1 must always be less than or equal to y2 as we use a coordinate frame where (0,0) is the upper-left corner of the image. Therefore the uppermost coordinate must be a smaller number than the lowermost one. If the two are equal, you have a box with a height of one pixel. In the preamble you will be provided with the sequence name, image index, and detection index where the error has occurred.
If this error has occurred, the covariances you have supplied for a given detection are invalid. If a detection has a 'covars' key, it must supply 2 covariance matrices. One for the upper-left corner of the PBox and one for the lower-right corner of the PBox. The format of the content within 'covars' for a given detection must have a shape (2, 2, 2). In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, at least one of the covariances provided for a given detection are invalid. All covariance matrices provided for generating PBoxes must be symmetric. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, the upper-left covariance provided for a given detection was invalid. All covariance matrices provided for generating PBoxes must be positive and semi-definite. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If this error has occurred, the lower-right covariance provded for a given detection was invalid. All covariance matrices provided for generating PBoxes must be positive and semi-definite. In the preamble you will be provided with the sequence name, image index, and detection index where this error has occurred.
If any of these messages get raised your code will still run and finish but there may be a condition you should be made aware of that might effect your performance.
If this warning occurs, more results have been provided than there are sequences to be evaluated. This shall not cause an error but any extra results provided will not be evaluated. The afteramble will outline which named sequences shall not be evaluated as they do not correspond with any ground-truth sequences.
Start: Jan. 1, 2019, midnight
Description: Validation Phase: Create models that detect objects in images and provide estimates of spatial and semantic uncertainty for each detection. This phase is evaluated on a small set of data for more rapid code evaluation.
Start: July 25, 2019, midnight
Description: Test-Dev Phase: Create models that detect objects in images and provide estimates of spatial and semantic uncertainty for each detection. Tested on set of data equivalent to that used in main challenge. For ranking detection systems when challenge is not active.
You must be logged in to participate in competitions.Sign In