``Evaluation of the submissions will be performed on a private test data set containing individual files of surgical videos @ 1FPS through grand-challenge automated docker submission and evaluation system. The test data and labels are hidden from the participants. Following are the metrics that will be used to evaluate algorithm performances and displayed on the leaderboards. 

Category 1: Surgical tool detection

The standard COCO dataset bounding box detection evaluation metric (mAP@[0.5:0.05:0.95] - https://cocodataset.org/#detection-eval) will be used for evaluations and determining the winners

Category 2: Surgical visual question answering

The BLEU score is a metric used to evaluate the quality of text generated by models, commonly in machine translation. In our evaluation, it will be applied to measure how closely your generated answers match a set of reference answers. This is done by comparing sequences of words, specifically focusing on pairs of words (bigrams) using the weights (0.5, 0.5, 0, 0). This means your model's output will be assessed on how well it replicates word pairs found in the reference answers.

References:

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P02-1040.pdf

Sample Bleu evaluation code snippet:

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load the CSV file
# Assume the CSV has columns 'Question' and 'Answer'
df = pd.read_csv('answers.csv')

# Group answers by question
grouped = df.groupby('Question')['Answer'].apply(list).reset_index()

def calculate_scores(group):
    question = group['Question']
    answers = group['Answer']

    references_tokens = [answer.split() for answer in answers]
    candidate_tokens = references_tokens[0] if len(references_tokens) > 0 else []

    # BLEU Score
    smoothing_function = SmoothingFunction().method1
    weights = (0.5, 0.5, 0, 0)  # Standard BLEU-2
    bleu = sentence_bleu(references_tokens[1:], candidate_tokens, weights=weights, smoothing_function=smoothing_function)

    return pd.Series({'BLEU Score': bleu})

# Apply the scoring function to each group
grouped['BLEU Score'] = grouped.apply(calculate_scores, axis=1)

# Display the results
print(grouped[['Question', 'BLEU Score']])