A system that takes raw input (like an image) and produces final output (like structured text) in one unified model, rather than chaining multiple separate tools together.