microsoft/Florence-2-large · Trying to get bounding box confidence values for object detection

I am currently trying to produce the bounding boxes, confidence level and labels for prediction on an image.
The code I am using is below.

image = Image.open(image_path)

        inputs = self.processor(text=self.prompt, images=image, return_tensors="pt")

        generated_ids = self.model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=128,
            num_beams=2,
            do_sample=False,
            return_dict_in_generate=True,
            output_scores=True,
        )
        generated_text = self.processor.batch_decode(
            generated_ids.sequences, skip_special_tokens=False
        )[0]

        parsed_answer = self.processor.post_process_generation(
            generated_text, task=self.prompt, image_size=(image.width, image.height)
        )

        transition_scores = self.model.compute_transition_scores(
            generated_ids.sequences,
            generated_ids.scores,
            generated_ids.beam_indices,
            normalize_logits=False,
        )

        bounding_box_tokens = generated_ids.sequences[0][4:-1].numpy()
        bounding_box_scores = transition_scores[0][3:-1].numpy()

        bounding_box_indexs = np.where(
            np.logical_and(bounding_box_tokens >= 50269, bounding_box_tokens <= 51268)
        )

        bounding_box_scores = bounding_box_scores[bounding_box_indexs]

        score_split_arrays = np.exp(
            np.mean(
                np.array_split(bounding_box_scores, len(bounding_box_scores) / 4),
                axis=1,
            )
        )

        return (
            torch.Tensor(parsed_answer[self.prompt]["bboxes"]),
            torch.Tensor(score_split_arrays),
            parsed_answer[self.prompt]["labels"],
        )

As you can see this is very similar to the example implementation. The key issue here is whether my approach to token isolation is correct. I am splitting the ends from the tokens and scores as they seem to belong to tokens that signify the ends of the sequence. I find token sequence indices where the token is between values that I believe signify the location tokens. I then use those to find the scores in the same indices. Here I am assuming that the scores are mapped to the same indices as their respective token. Is this the case? And more generally does this approach actually do what I believe it does based on my explanation?