Creating AI-Generated Music Videos: A Fun Experiment

February 15, 2025

I recently experimented with creating AI-generated music videos by combining various AI tools, and published the results in this YouTube playlist. The results were quite interesting - check out this example:

Here’s a quick overview of the process:

1. Image Generation with Flux-dev

First, I used Flux-dev on Replicate to generate anime-style images. Here’s an example prompt I used:

An anime-style illustration of a young girl standing outside a 
stadium with a bold 'Napoli' graffiti in the background. 
The girl has her hair styled in a unique braided ponytail, 
wearing a red sports jacket with white stripes, a gray 
sweatshirt, and a gold chain necklace. Her 
expression is introspective, and she is looking downward, 
creating a thoughtful mood. The background includes a detailed 
depiction of the stadium structure under a soft blue sky with clouds.

2. Music Generation with Suno

For the music, I used Suno to generate the audio tracks. While they don’t officially offer an API yet, there is an unofficial API available (though I haven’t tested it).

3. Image to Video with KLING 1.6

I then used KLING 1.6 to transform the static images into dynamic videos, adding subtle movements and transitions.

4. Combining Audio and Video

Finally, I wrote a simple Python script using moviepy to combine the generated audio and video. The script loops the video for the duration of the audio track:

from moviepy.editor import VideoFileClip, ImageClip, AudioFileClip

# ... rest of the code ...
async def create_video(
    self,
    audio_file: UploadFile,
    visual_file: UploadFile,
    bottom_crop_percent: int = 0
) -> str:
    """
    Create a video from an audio file and either an image or video file.

    Args:
        audio_file: The uploaded audio file
        visual_file: The uploaded image or video file
        bottom_crop_percent: Percentage to crop from bottom (default: 0)
        
    Returns:
        str: Path to the generated video file
    """
    # Load the audio and visual files
    audio_clip = AudioFileClip(audio_path)
    
    # Create video clip and loop if necessary
    if is_image:
        visual_clip = ImageClip(visual_path).set_duration(audio_duration)
    else:
        visual_clip = VideoFileClip(visual_path)
        if visual_clip.duration < audio_duration:
            visual_clip = visual_clip.loop(duration=audio_duration)
    
    # Combine and save
    final_clip = visual_clip.set_audio(audio_clip)
    final_clip.write_videofile(
        output_path,
        codec='libx264',
        audio_codec='aac'
    )
    # ... rest of the code ...

Reflections and Future Improvements

It’s crazy how easily is becoming to create cool things with AI, and without even needing ML skills, just basic programming skills.

Regarding the experiment, an interesting next step would be to automatically generate image prompts based on song lyrics. This could create a more cohesive narrative in the music videos. The process would look something like this:

Split lyrics into time-based chunks
For each chunk, generate:
- Text-to-image prompts for scene composition
- Image-to-video prompts for camera movements and transitions
Use these prompts to create a sequence of scenes that match the song’s narrative

This could lead to even more engaging and contextually relevant music videos!

Feel free to check out the example videos and let me know what you think!