``Evaluation of the submissions will be performed on a private test data set containing individual files of surgical videos @ 1FPS through grand-challenge automated docker submission and evaluation system. The test data and labels are hidden from the participants. Following are the metrics that will be used to evaluate algorithm performances and displayed on the leaderboards.
Category 1: Surgical tool detection
The standard COCO dataset bounding box detection evaluation metric (mAP@[0.5:0.05:0.95] - https://cocodataset.org/#detection-eval) will be used for evaluations and determining the winners
Category 2: Surgical visual question answering
The BLEU score is a metric used to evaluate the quality of text generated by models, commonly in machine translation. In our evaluation, it will be applied to measure how closely your generated answers match a set of reference answers. This is done by comparing sequences of words, specifically focusing on pairs of words (bigrams) using the weights (0.5, 0.5, 0, 0). This means your model's output will be assessed on how well it replicates word pairs found in the reference answers.
References:
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P02-1040.pdf
Sample Bleu evaluation code snippet:
import pandas as pd from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction # Load the CSV file # Assume the CSV has columns 'Question' and 'Answer' df = pd.read_csv('answers.csv') # Group answers by question grouped = df.groupby('Question')['Answer'].apply(list).reset_index() def calculate_scores(group): question = group['Question'] answers = group['Answer'] references_tokens = [answer.split() for answer in answers] candidate_tokens = references_tokens[0] if len(references_tokens) > 0 else [] # BLEU Score smoothing_function = SmoothingFunction().method1 weights = (0.5, 0.5, 0, 0) # Standard BLEU-2 bleu = sentence_bleu(references_tokens[1:], candidate_tokens, weights=weights, smoothing_function=smoothing_function) return pd.Series({'BLEU Score': bleu}) # Apply the scoring function to each group grouped['BLEU Score'] = grouped.apply(calculate_scores, axis=1) # Display the results print(grouped[['Question', 'BLEU Score']])