The person detection model (Mobilenet v1) uses a 96x96 video input. Most models only need a small input image to work well. So, the 640x480 image needs to be downscaled to 96x96. This isn’t hard to update and there is already the code for doing this in the design (just not shown in the diagram). For the OV2640 camera we use, we place the camera in QQVGA mode which is 160x120 and then downscale to 96x96. Downscaling is just some counters to skip some pixels and lines. The model also works on monochrome images, so you’ll need to just keep the Y component of the incoming YCbCr data.