Quick Brief:
NEO-ov is a single, encoder-free vision-language model that takes pixels and text directly into one transformer, learning pixel–word and cross-frame relations end-to-end for images, multi-image inputs, and video.
Why It Matters:
Shows that fully native multimodal transformers can scale and compete with the standard “vision encoder + LLM” design. Points toward future VLMs that better preserve low-level visual detail and temporal structure, especially for video and perception-heavy tasks.