YouMakeup VQA Challenge

Organized by LinliYao - Current server time: Oct. 25, 2020, 7:45 p.m. UTC

Previous

Test(Step Ordering)
April 8, 2020, midnight UTC

Current

Test(Facial Image Ordering)
April 8, 2020, midnight UTC

End

Competition Ends
June 1, 2020, 11 p.m. UTC

Overview of YouMakeup VQA Challenge

 

Project website: https://languageandvision.github.io/youmakeup_vqa/index.html

Dataset download: https://github.com/AIM3-RUC/YouMakeup

    In recent years, video semantic understanding has attracted increasing research attention. However, most works are limited to capture coarse semantic understanding such as action recognition in broad categories, which do not necessarily require models to distinguish actions with subtle differences or understand temporal relations in a certain activity. In order to improve fine-grained action understanding in videos, we propose the YouMakup Video Question Answering challenge based on a newly collected fine-grained instructional video dataset YouMakeup.

 

YouMakeup Dataset

    YouMakeup is a large-scale multimodal instructional video dataset introduced in YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension (EMNLP2019). It contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of steps, including temporal boundaries, grounded facial areas and natural language descriptions of each step. In this challenge, we design two video question answering tasks, namely Facial Image Ordering Sub-Challenge and Step Ordering Sub-Challenge.

 

 

Facial Image Ordering

    The task is to sort a set of facial images from a video into the correct order according to given step descriptions. The goal of this task is to understand the changes that a given action described in natural language will cause to a face object. The effects of action descriptions on facial appearances can vary greatly, depending not only on the text description, but also on the previous state of the facial appearance. Some actions may bring obvious facial changes, such as "apply red lipsticks on the lips", while some actions only cause slight differences, such as "apply foundation on the face with brush", which can be better detected if the previous appearance status is known. Therefore, fine-grained multimodal analysis on visual faces and textual actions is necessary to tackle this task.

image_ordering

 

Step Ordering 

    The task is to sort a set of action descriptions into the right order that these actions are performed in the video. It aims at evaluating models' abilities in cross-modal semantic alignments between visual and texts. Compared with previous video-text cross-modal localization, the novelty of this task has three aspects. Firstly, different actions share similar background contexts, thus it requires the model to specifically focus on actions and action-related objects instead of correlated but irrelevant contexts. Secondly, since different actions can be very similar in visual appearance, the task demands fine-grained discrimination in particular. Finally, our task goes beyond mere single text to single video localization and requires long-term temporal action reasoning and textual understanding.

step_ordering

Requirements

1. Participants should stick to the definition of training, validation and test partition in order to have a fair comparison of different approaches.

2. The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.

3. Each team can submit at most two trials a day for each sub-challenge on the test partition.

4. At the end of the Challenge, all teams will be ranked based on the evaluation described above. The top teams will receive award certificates.

 

Evaluation Criteria

    In order to unify the evaluation criteria of the two tasks, we made the evaluation of the tasks into the form of multiple-choice questions.  For each query, we set 4 candidates and only 1 is the ground-truth. The accuracy of multiple-choice questions will be the only standard for the model's evaluation.

 

Facial Image Ordering

    We choose five facial images of different steps from a video to form a question. Note that these facial images are extracted from the begin or end of each makeup step but not the middle of it, to make sure a step action is (not) performed on the face. We also manually check the extracted facial images to ensure the qualities. Then we set the original order of these facial images as the positive answer and three random shuffles as negative answers. Finally, we generate 1200 questions for 280 validation videos, and 1500 questions for 420 testing videos. The query example is as follows:

{"question_id": 1, 
"video_id": "-9GYpCvGIgM", "step_caption": ["Use moisturizer on face ",
"Use primer on face",
"Use foundation on face with brush",
"Use concealer on under-eye",
"Use powder on face with brush",
"Use blush on cheek with brush",
"Use brow gel on eyebrow",
"Use eyeshadow on eyelid with brush",
"Use curler on eyelash",
"Use mascara on eyelash",
"Use lipstick on lip",
"Use lip gloss on lip"], "groundtruth": [4, 1, 2, 5, 3],
"candidate_answer": [[4, 1, 2, 5, 3], [5, 2, 4, 1, 3], [2, 1, 3, 5, 4], [2, 5, 4, 3, 1]]}

    The corresponding image is provided in the file with question id as the file name.

 

Facial Step Ordering

    We select videos with more than four steps to generate questions. For each question, we provide a video and five step descriptions from the video. The positive answer is the original order of the five descriptions, while negative answers are random shuffles. We construct 1200 questions for 280 validation videos, and 3200 questions for 420 testing videos. To be noted, the testing videos in step ordering task is not overlapped with testing videos in image ordering task to avoid leaking information. The query example is as follows:

{"question_id": 1, 
"video_id": "-2FjMSPITn8", "step_caption": {"1": "Apply highlighters on cheeks and nose with brush",
"2": "Apply lip blam on lips",
"3": "Apply foundation on face with brush",
"4": "Contour the cheeks with brush",
"5": "Apply concealer on the under-eye area and nose with brush"}, "groundtruth": [3, 5, 4, 1, 2],
"candidate_answer": [[1, 4, 3, 5, 2], [3, 5, 4, 1, 2], [2, 4, 3, 5, 1], [3, 1, 4, 2, 5]]}

Terms and Conditions

If you use this dataset, please cite 

@inproceedings{wang2019youmakeup,
  title={YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension},
  author={Wang, Weiying and Wang, Yongcheng and Chen, Shizhe and Jin, Qin},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={5136--5146},
  year={2019}
}

If you use our baseline, please cite 

@inproceedings{chen2020vqabaseline,
  title={YouMakeup VQA Challenge: Towards Fine-grained Action Understanding inDomain-Specific Videos,
  author={Chen, Shizhe and Wang, Weiying and Ruan, Ludan and Yao, Linli and Jin, Qin},
  year={2019}
}

Submission Procedure

    To use codalab, please first apply for entry on the results submission page,  we will approve your application for the first time.

    To enter the competition, you need to run the choice answer generation locally on your machine. The submission procedure requires you to run your trained model on the dev and test queries, and then submit the resulting options in the appropriate JSON format.
    To submit your JSON file, name it "answer.json" and directly zip it (do not put it into a folder), then submit the zip file to the corresponding track, Facial Image Ordering(Dev), Facial Image Ordering(Test), Facial Step Ordering(Dev), Facial Step Ordering(Test)

 

Submission Format

    Submissions must be a valid JSON dictionary containing one chosen option for each query. The submission file must cover all the queries in the dev or test set. Each entry is a (key=question_id, value=chosed option) pair. The saved JSON file is as followings:

         {
"1": [4, 1, 2, 5, 3], "2": [5, 2, 4, 1, 3], "3": [2, 1, 3, 5, 4]......
}

Dev(Facial Image Ordering)

Start: April 6, 2020, midnight

Description: This phase evaluates algorithms on the YouMakeup val set of image ordering task. We recommend using this Dev phase for algorithm validation. This phase should not be used for reporting results in the paper. A submission needs to consist of chosen answers on the entire 1,200 validation questions to be considered as a valid submission.

Dev(Step Ordering)

Start: April 6, 2020, midnight

Description: This phase evaluates algorithms on the YouMakeup val set of step ordering task. We recommend using this Dev phase for algorithm validation. This phase should not be used for reporting results in the paper. A submission needs to consist of chosen answers on the entire 1,200 validation questions to be considered as a valid submission.

Test(Facial Image Ordering)

Start: April 8, 2020, midnight

Description: This phase evaluates algorithms on YouMakeup public test set of image ordering task. We recommend using this phase for reporting comparison numbers in academic papers. A submission needs to consist of chosen answers on the entire 1,500 test questions to be considered as a valid submission. This phase is aimed at the final evaluation of the model and one is not allowed to create multiple submissions using multiple teams.

Test(Step Ordering)

Start: April 8, 2020, midnight

Description: Description: This phase evaluates algorithms on YouMakeup public test set of step ordering task. We recommend using this phase for reporting comparison numbers in academic papers. A submission needs to consist of chosen answers on the entire 3,200 test questions to be considered as a valid submission. This phase is aimed at the final evaluation of the model and one is not allowed to create multiple submissions using multiple teams.

Competition Ends

June 1, 2020, 11 p.m.

You must be logged in to participate in competitions.

Sign In
# Username Score
1 acdart 0.69067
2 LinliYao 0.58933
3 RUCer-RLD 0.40000