When user requirements exceed a single mode's limits, suggest combining modes or switching to a more capable model before starting. Ask the user to confirm the approach.
kf2v (first+last frame) and vace are capped at 5s. If the user wants a longer video:
1. Preferred: Use wan2.6-t2v (up to 15s) or wan2.6-i2v (up to 15s) which natively support longer durations. If the user has a first frame image, i2v is a direct replacement for kf2v with longer duration support.
2. If kf2v is essential (user needs both first AND last frame control): Generate the 5s kf2v video first, then extend it using vace video_extension — trim the kf2v output to a ≤3s tail clip and use it as first_clip_url to generate the next 5s segment. Repeat for additional segments, then concatenate all segments (see merge-media.md).
3. For any mode: To produce videos >15s, chain multiple generations — use the last frame or tail clip of each segment as input for the next. Concatenate final segments (see merge-media.md). Inform the user this is a multi-step workflow.
kf2v generates a 5s transition between first and last frames. By design the model interpolates motion between the two frames — it respects the visual content of both frames but heavily relies on the prompt to determine how the transition happens. If generated video doesn't match the input images well:
1. Turn off prompt_extend — Smart rewriting can drift from original intent. Set "prompt_extend": false to keep the model closer to your literal description.
2. Write transition-focused prompts — Instead of describing static scenes, describe the motion/change: "camera gradually rises from eye level", "person slowly turns around". The prompt should bridge the visual gap between the two frames.
3. Minimize visual difference between frames — If first and last frame are radically different (different scene, different character), the model may produce artifacts. Keep frames within the same scene/subject.
4. Use higher resolution — "resolution": "1080P" (wan2.2-kf2v-flash) produces more detail-faithful output than 480P/720P.
5. Consider alternatives when image fidelity matters most:
- i2v (wan2.6-i2v) — Uses only the first frame but respects it more faithfully, supports up to 15s + audio. Best when the starting image must be preserved exactly.
- vace image_reference — Lets you mark images as subject (obj) or background (bg), giving explicit control over how each image is referenced.
- r2v — For character-consistent multi-reference. Preserves character identity across scenes.
kf2v and vace produce silent video only. If the user also wants audio:
1. Switch mode: If the user can relax the first+last frame constraint, recommend wan2.6-i2v (first-frame + audio, up to 15s) or wan2.6-t2v (text + audio, up to 15s). These support audio_url for custom audio or auto-dubbing.
2. Post-process: Generate the silent video first, then add an audio track (see merge-media.md for ffmpeg/moviepy recipes), or use a separate TTS step (see qianwen-audio-tts for speech synthesis) to generate the audio file first.
3. Always inform the user about the silent limitation before generating, and propose the alternative.
wan2.6 models (t2v, i2v, r2v) natively support multi-shot video — multiple camera angles and scenes in a single generation, up to 15s. This is the recommended approach for cinematic storytelling.
How to use:
1. Set shot_type: "multi" AND prompt_extend: true in the request.
2. Structure the prompt using 第N个镜头[Xs-Ys] format (Chinese) or describe sequential scenes clearly:
第1个镜头[0-3秒] 雨夜街头,侦探快步前行,镜头从远处跟随。
第2个镜头[3-5秒] 侦探推开老旧建筑的大门,镜头推近特写。
第3个镜头[5-8秒] 室内昏暗灯光下,侦探发现一封信,低头阅读。
3. For multi-character stories, combine with r2v mode: provide reference videos/images for each character, use character1/character2 in prompt.
Limits: Single generation max 15s. For longer narratives, generate multiple multi-shot segments and chain them (use the last frame of segment N as the first frame of segment N+1 via i2v).
VACE functions can be chained as post-processing steps after initial video generation. Common pipelines:
| Pipeline | Steps | Use case |
|----------|-------|----------|
| Generate + Extend | t2v/i2v → vace video_extension (×N) | Make a 5s video into 10s, 15s, ... |
| Generate + Outpaint | t2v/i2v → vace video_outpainting | Expand 16:9 to 9:16 (horizontal to vertical) |
| Generate + Edit | t2v/i2v → vace video_edit | Replace a character or remove an object |
| Generate + Repaint | t2v/i2v → vace video_repainting | Change art style while keeping motion |
| kf2v + Extend + Audio | kf2v → vace video_extension → concatenate segments → add audio | Controlled transition, longer, with sound |
Chaining rules:
video_extension: input clip must be ≤3s. If extending a 5s video, trim the last 2-3s to prepare first_clip_url (see merge-media.md).For requests like "generate a kf2v transition, extend it to 15s, and add narration":
1. Plan the full pipeline with estimated step count and total time