This challenge is divided into two categories.

Category 1: Surgical tool classification and localization

This category will require the teams to train weakly supervised models. The model should localize (with bounding boxes) and classify the tools present within each frame of the video clips in the test set by training on noisy tool presence labels provided in the training set.

Category 2: Surgical visual question answering

This category will also require the teams to train weakly supervised models. Here the model should generate answers to open ended questions based on 30-second video clips. Training labels are tool presence labels, surgical steps and a description of the surgical step categories.