text-to-video

Multi-modal hybrid video generation that combines audio, video, image, and text inputs in any mix. Suitable for music-to-video, extracting camera motion and rhythm references from video, and multi-source blended generation. At least one media type must be provided (text-only generation is not supported).

meitu video-multimodal-generate

Usage Examples

# Music-driven video (required parameters + audio)
meitu video-multimodal-generate \
  --reference_audio_list ./music.mp3 \
  --prompt "Visuals that follow the rhythm of the music" \
  --json

# Image + video driven
meitu video-multimodal-generate \
  --image_list ./style.jpg \
  --reference_video_list ./camera.mp4 \
  --prompt "Camera motion reference + reference image style" \
  --json

# Multiple sources with full parameters and result download
meitu video-multimodal-generate \
  --image_list ./ref1.jpg \
  --reference_video_list ./ref.mp4 \
  --reference_audio_list ./bgm.mp3 \
  --prompt "Multi-source blended generation" \
  --video_duration 8 \
  --ratio 16:9 \
  --sound on \
  --json \
  --download-dir ./output

Parameters

ParameterRequiredDescription
--image_listNoType: string[]; optional reference images (up to 9)
--reference_video_listNoType: string[]; optional reference videos (up to 3; total duration max 15 seconds)
--reference_audio_listNoType: string[]; optional audio drive (up to 3; total duration max 15 seconds)
--promptYesType: string; generation description (*at least one media type must be provided; total assets max 12)
--video_durationNoType: number; default: -1 (auto); generated video duration
--ratioNoType: string; aspect ratio
--soundNoType: string; options: on / off; whether to include audio
--resolutionNoType: string; output resolution
--download-dirNoType: string; downloads result files to the specified local directory
--outputNoType: string[]; specifies output file paths, mapped in order to data.result.urls
--jsonNoOutputs results in JSON format for script or agent parsing