Video Art via TensorFlow and Transfer Learning

github repo

Did you wake up this morning with a hankering to morph your YouTube videos into animated classical paintings? Or into something wilder? If so, you've come to the right place. In this brief tutorial I'll demonstrate how to translate a video into animated art of your choice.

For example, here's some modern art:

Using that as our desired base style, let's transform a park waterfall. The first half of this video is the original and the second half is the styled version.

Here's another example, this time of a more urban scene. I've styled this video with a Monet painting followed by an Escher print.

The general technique behind these renderings is transfer learning. Transfer learning allows us to leverage pre-trained neural networks into completely different tasks. Here we repurposed the vgg19 network. This network is trained to classify millions of images into one of a 1000 categories. We're not interested in those classifications, but we are interested in what the network has learned along the way in its hidden layers. These hidden layers encode image knowledge (perception of edges, colors, styles etc) at increasing levels of abstraction. The task of the styling algorithm is to reuse that information in generating new images.

Read this for a presentation of the idea.

The implementation in this paper requires a fair bit of work. However, our friends over at Google have simplified things for us and created a Tensorflow Hub module that encapsulates that logic. You'll find this module here. Given this styling module, all we need to do is to load our original image and the target style image, and then run them through the hub. The result is an image transformed by the given style.

In our case we're interested in videos not images. Therefore we'll need to first turn our videos into sequences of images, style those images via the hub styling model, and then rebuild the video from the resulting styled frames.

Let's take a look at the code.

The main function is below. The flow goes like this: we take as inputs the original video and the image we want to use as the style (such as a Monet or Escher jpeg). We then extract the audio track from our inputted video and then unpack the video into a sequence of images. As we go along, we also run the styling module on each image. The result is a directory of styled images. Finally, we recombine those images back into a video file and reattach the audio track.

if len(sys.argv) != 3:
    print('usage: video style')

name_original = sys.argv[1]
name_style = sys.argv[2]

# load and cache the styling dnn
hub_module = hub.load(HUB_URL)

# extract audio from the video
extract_mp3(PATH_VIDEOS + name_original)

# extract all frames from the video, style them
# and put results into tmp
generate_frames(PATH_VIDEOS + name_original, 
    PATH_STYLES + name_style)

# regenerate the video from the styled frames
output_name = os.path.splitext(name_original)[0] 
    + '.' + os.path.splitext(name_style)[0] + '.mp4'
generate_video(PATH_OUTPUTS + output_name)

# recombine the extracted audio into the newly-styled video
input_name = output_name
output_name = os.path.splitext(name_original)[0] 
    + '.' + os.path.splitext(name_style)[0] + '.audio.mp4'

add_mp3(PATH_OUTPUTS + input_name, PATH_OUTPUTS + output_name)

Now let's take a look at the underlying functions.

First, here's the method that extracts the mp3. We use ffmpeg directly via subprocess for this work, as there isn't a good python binding. The result is an mp3 that we store off for later.

def extract_mp3(path_video):

    print('Extracting audio: ', path_video, PATH_TMP_MP3)

    command = 'ffmpeg -i {0} -f mp3 -ab 192000 
        -vn {1}'.format(path_video, PATH_TMP_MP3), shell=True)

I use the OpenCV package to extract all the images from the video. We capture the video and then unpack each image into working directory. As we do this we apply the styling hub to each image. The result is a directory containing the original image frames for the video along with the styled frames.

def generate_frames(path_input_video, path_image_style):

    video_capture = cv2.VideoCapture(path_input_video)
    image_style = load_image(path_image_style);

    for count in range(MAX_FRAMES):

        success, image =
        if success == False: break

        path_frame = PATH_TMP + (str(count).zfill(5)) + '.jpg'
        path_converted_frame = PATH_TMP + 
            'x' + (str(count).zfill(5)) + '.jpg'

        cv2.imwrite(path_frame, image)

        image = load_image(path_frame)
        results = hub_module(tf.constant(image), 
        image = tf.squeeze(results[0], axis=0)
        mpl.image.imsave(path_converted_frame, image)
        print(count, path_frame, path_converted_frame)

Now we have all the styled images for the video. The next step is to iterate through these styled images and convert them back into video. This is done as follows:

def generate_video(path_output_video):

    image_list = []
    count = 0
    path_converted_frame = PATH_TMP + 'x' + 
        (str(count).zfill(5)) + '.jpg'

    image = cv2.imread(path_converted_frame)
    height, width, layers = image.shape
    size = (width,height)
    print('size: ', size)

    converted_files = [file_name for file_name in 
        os.listdir(PATH_TMP) if 'x' in file_name]

    for file_name in converted_files:

        path_converted_frame = PATH_TMP + file_name
        image = cv2.imread(path_converted_frame)

    video_writer = cv2.VideoWriter(path_output_video, 
        cv2.VideoWriter_fourcc(*'mp4v'), VIDEO_FPS, size)

    for i in range(len(image_list)):

    print('video generated: ', path_output_video)

Now we have a styled video. The final step is to reattach the audio track. We apply ffmpeg again for this task:

def add_mp3(path_input_video, path_output_video):

    print('Adding audio: ', PATH_TMP_MP3, 

    command = 'ffmpeg -i {0} -i {1} -c:v copy -c:a 
        aac -strict experimental {2} '.
            path_output_video), shell=True)

And that's it! You can see the full code listing here.

Compared to most ML projects, this code is relatively fast. You can do useful things even on slow machines. For example, I generated these videos on a down-market t2.large ec2 instance. No GPU, no TPU, no nothing. In that environment a 15-second video typically renders in less than an hour. Impressive given the amount of processing involved. Faster machines will of course blow right past that benchmark.

I've found that some styles work better than others. Colors, stroke style and shading transfer reasonably well, while larger-scale structures are minimized or lost. Therefore, something like a Monet painting will translate fairly nicely. In contrast, a Picasso or a Escher will lose character due to their unusual geometries. That may or may not matter to you. Also, it's important to keep the style images at reasonable resolutions. For my test cases, I found that anything around 512x512 worked well enough. Your mileage might vary.

Lastly, it's worth pausing for a moment to appreciate the general coolness of transfer learning. It's a powerful technique with broad and sometimes non-obvious applications. Consider: we just took a deep neural network trained to classify images and then used that network to create styled videos. That's quite a leap. I expect this technique to become even more powerful and pervasive over time.

Christopher Minson

© 2021 Christopher Minson