Any documentation for the inputs to the vision model?

#2
by PaulTheHuman - opened

For the vision part of the model only, are there any documentation for the inputs? phi-3-v-128k-instruct-vision.onnx

for example it has two inputs: image_sizes and pixel_values

in the pixel_values there is a index called max_num_crops. Can you explain what this number does, and if this is set >1 what is the data that you send to the pixel_values? Is it a selection of crops from the same image? I'm a bit confused by this.

Also, what are the restrictions on the image sizes? Do they have to be a multiple of 336 pixels in width and height? What is the smallest and largest possible?

Sign up or log in to comment