A step towards a General World Model (GWM) that can simulate complex video scenarios with natural language actions.