Evaluation-Criteria - Surgical Visual Understanding

``Evaluation of the submissions will be performed on a private test data set containing individual files of surgical videos @ 1FPS through grand-challenge automated docker submission and evaluation system. The test data and labels are hidden from the participants. Following are the metrics that will be used to evaluate algorithm performances and displayed on the leaderboards.

Category 1: Surgical tool detection

The standard COCO dataset bounding box detection evaluation metric (mAP@[0.5:0.05:0.95] - https://cocodataset.org/#detection-eval) will be used for evaluations and determining the winners

Category 2: Surgical visual question answering

For each question, five different answers will be provided as ground truth. For example, the question "Was a large needle driver used during the surgery?"

may have the following list as ground truth answers:

[
"No",
"No, a large needle driver was not used.",
"A large needle driver was not used.",
"No large needle driver was utilized.",
"There is no indication a large needle driver was used."
]

You can download 10 video samples with questions and answers in the same format used for evaluation to be used as reference here

Your algorithm should generate a single predicted answer for each question. The evaluation container will then compute the BLEU score between your predicted answer and each of the five ground-truth answers. The highest BLEU score among these five comparisons will be used as the final score for that question. Below is an example of an output from the evaluation container:

Candidate answer (output from your algorithm): no
reference tokens ['no']
candidate tokens  ['kakaka', 'force']
reference tokens ['no', 'a', 'large', 'needle', 'driver', 'was', 'not', 'used']
candidate tokens  ['kakaka', 'force']
reference tokens ['a', 'large', 'needle', 'driver', 'was', 'not', 'used']
candidate tokens  ['kakaka', 'force']
reference tokens ['no', 'large', 'needle', 'driver', 'was', 'utilized']
candidate tokens  ['kakaka', 'force']
reference tokens ['there', 'is', 'no', 'indication', 'a', 'large', 'needle', 'driver', 'was', 'used']
BLEU scores  [0.1778279410038923, 0.00016215809237314185, 0, 0.0011981952414407235, 2.1945711360427958e-05]
Max BLEU score calculated: 0.1778279410038923

Then, the platform will get the mean of all calculated max BLEU scores for each Q&A pair. For example:

{
    "results": [
        {
            "bleu_score": 0
        },
        {
            "bleu_score": 0.1778279410038923
        },
        {
            "bleu_score": 0.1778279410038923
        }
    ],
    "aggregates": {
        "bleu_score": 0.11855196066926153
    }
}

The BLEU score is a metric commonly used to evaluate the quality of text generated by models, such as in machine translation tasks. In our evaluation, it measures how closely your model’s predicted answer matches a set of reference answers by comparing sequences of words, or n-grams.

We use uniform BLEU weights of (0.25, 0.25, 0.25, 0.25), which means the score takes into account unigrams, bigrams, trigrams, and 4-grams equally. This encourages your model to generate responses that not only contain the right words but also preserve natural and coherent word order across short phrases.

A smoothing function is also applied to handle edge cases such as short outputs or missing n-grams.

References:

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P02-1040.pdf

Sample Bleu evaluation code snippet:

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load the CSV file
# Assume the CSV has columns 'Question' and 'Answer'
df = pd.read_csv('answers.csv')

# Group answers by question
grouped = df.groupby('Question')['Answer'].apply(list).reset_index()

def calculate_scores(group):
    question = group['Question']
    answers = group['Answer']

    references_tokens = [answer.split() for answer in answers]
    candidate_tokens = references_tokens[0] if len(references_tokens) > 0 else []

    # BLEU Score
    smoothing_function = SmoothingFunction().method1
    weights = (0.25, 0.25, 0.25, 0.25)  
    bleu = sentence_bleu(references_tokens[1:], candidate_tokens, weights=weights, smoothing_function=smoothing_function)

    return pd.Series({'BLEU Score': bleu})

# Apply the scoring function to each group
grouped['BLEU Score'] = grouped.apply(calculate_scores, axis=1)

# Display the results
print(grouped[['Question', 'BLEU Score']])