A multi-modal AI system for video