Table of Content

My humble Ph.D. project is emotion recognition on videos using deep learning. As a newbie, the three keywords, videos, deep learning, and emotion recognition, together make the data processing a real pain on my ass.

First, a video is usually “larger” than an image. Video data can be taken as arrays of 2D image data. They are usually more time- and space-consuming for saving and loading compared with image data.

Second, in the deep learning context, data are usually mini-batched for each epoch before feeding to neural networks. In the most ideal case, the data should be entirely loaded in the memory for mini-batching, resulting in a more neat and readable code. If the data is too large for the memory specs, one can only choose to slice the data into several groups and perform the loading one group at a time. The ideal and awkward cases are summaried in the pseudo pythonic code below.

# The ideal case
dataloader = Dataset(entire_data)
for video, label in dataloader:
	# do something

# The awkward case
for some_data in entire_data:
	dataloader = Dataset(some_data)

	for video, label in dataloader:
		# do something

Third, in the context of emotion recognition and neural science, the data are usually subject-wise. Also, the process of downsampling, windowing, and cross-validation are usually required.

The following scenarios show how painful it could be when the three keywords appear together.

  • Your dataset has 100 videos.
  • Each video are sized as $10000\times3\times120\times120$ for dimension length, channel, width and height.
  • Each video of uint8 type requires $10000\times3\times120\times120\times8=3.456$ Gb memory to be entirely loaded, theoretically.
    • Each video will require $3.456\times4=13.824$ Gb memory in float32 type.
    • Definitely you cannot load 100 videos into your memory even if you are using a fancy server with $250$ Gb memory.
  • Usually you are not meant to load an entire video, but to load some frames of it, say, only the first frame for every $n$ frames, to meet the downsampling requirement.
    • Should you load one video completely then indexing it?
    • Should you preprocess the video to several pieces of smaller videos so that you can load some of them completely instead?
      • You will have to generate an appropriate index for this idea.
      • You will have to generate another index to restore the windowed output to the original shape for visualization and analysis, it can be much more complex in this case.

How to deal with all these complications with neat code while achieving an awesome spatial/temporal trade-off?


Return to Table of Content

In this post, I would compare several data formats in terms of:

  • the disk space usage, and
  • the memory and time cost when loading.

Basically, the data will be created and loaded. For the latter, the disk space, memory, and time cost will be recorded.

The data formats include:

  • static images,
  • mp4,
  • binary,
  • mat,
  • pickle,
  • hdf5/h5 (they are literally the same),
  • npy.

For the static images option, we simply create 10000 jpg images with random values. For other formats, a 4D numpy matrix (length, channel, width, height) will be randomly initialized.

For hdf5/h5 format, I will create the data with and without chunking. So that we can see if it’s a good idea to chunk the data in a frame-wise manner.

For npy format, I will load it with and without memory mapping.


Return to Table of Content

The code is shown at the end of the post.

Creating Data

I simply create a video of white noise in all the data formats mentioned above.

Loading Data

I assume that only the first frame of every four frames of each video format is needed. This is equivalent to downsampling the data so that only the ${0, 4, 8, \ldots, 9992, 9996}$-th frames are sampled.


Return to Table of Content

Table 1. The disk, memory and time cost for loading a downsampled video in various format. The original video has 10000 frames. In the experiment, only the first frame for every 4 frames is sampled. The first column lists the data formats of the video (10000 images). The second column lists the disk usage of the video in each data format. The third and fourth columns list the memory and time usage to load part of the video (2500 images, i.e., $1/4$ of the video). h5 1 denotes the h5 file saved without chunking. h5 2 denotes the h5 file saved with frame-wise chunking. npy 1 denotes the npy file loaded without memory mapping. npy 2 denotes the npy file same as for npy 1 loaded with memory mapping.
Format Disk Space (Mb) Memory (Mb) Time (s)
jpg 118.000 102.0 4.358
mp4 36.088 102.6 1.578
binary 432.000 411.9 0.335
mat 432.000 412.1 0.358
pickle 432.000 412.0 0.864
h5 1 432.361 416.0 0.804
h5 2 442.021 881.1 2.644
npy 1 432.000 412.0 0.309
npy 2 432.000 ~0.0 0.004


Return to Table of Content

It is not possible to save memory usage by indexing part of the data matrix to be load. In the code, I attempted to load only the first frame for every $4$ frames, yet the binary, mat, pickle and h5 format all consumed around 412 Mb usage, which is exactly $400\%$ of that by frame-wise loading of jpg and mp4 format.

The npy format with memory map mode specified has an indescribably amazing performance on memory and time cost. We should use it more often!

When the mmap mode is enabled, the file is memory-mapped. A memory-mapped array is kept on disk. However, it can be accessed and sliced like any ndarray. Memory mapping is especially useful for accessing small fragments of large files without reading the entire file into memory.

In our context, the npy format with memory mapping is particular welcomed because no matter how large the data are in the disk, we can instantly load them into the memory, if only the disk space is affordable. In my Ph.D. project, a dataset in npy format is usually as large as $30-50$ Gb which is not an issue at all, even for a PC.

The jpg format can save memory usage because we manually controlled the loading in the forloop, the same achieved for mp4 format because I read it frame-by-frame and selectively load the needed frames. Though jpg and mp4 format can save disk space usage compared to the other formats, the time cost, however, is not cool at all.

The binary, mat, pickle, h5, npy formats all have the same disk space usage. Except for the npy format and inappropriately chunked h5 format, all of them have almost the same memory usage.

For h5 file, definitely, it is not a good idea to chunk the data for each frame. Considering the power-user-only instruction of chunking provided by the official h5py package, the brain-burning calculation of chunking size, and the changing downsampling interval, why should we use h5 format when there is npy format available? (Maybe I just made a bold claim but I choose to use npy format :)


Return to Table of Content

I used to work with static images at the beginning and found that it was extremely slow to load the data. Most of the elapsed time was for data loading, not computing. Then I moved to mp4 format and was happy because it has amazingly small disk usage. Then I realized that it was also slow to load, and was memory-eager. I had to save the video of a trial as several small mp4 clips in the hard disk, and then load them according to a list. Later when the output was yielded I had to use another list to place the output on the original position with respect to the complete video of that trial.

My life was easier when I met the npy format. Thanks to it I can save a single video as a single npy file. I can load all the npy files into memory without any issue, and resample them according to a specific index for each data item. To restore, I do not need to calculate a new one but simply resort to the same index as the loading.

Overall, by using the npy format, the training is faster (about 2-3 times) and the code is dramatically simplified (hundreds of lines reduced). And I have spent nearly five months just to figure this out…


Return to Table of Content

The following packages are installed by pip in a virtual environment of Python 3x:

pip install pillow numpy scipy opencv-python memory_profiler h5py

The following is the code.

import cv2
import numpy as np
import os
import pickle
import time
import h5py
from PIL import Image
from tqdm import tqdm
import as sio
from memory_profiler import profile

def creating_data(num_image=10000, dim=120, channel=3):

    # Create 10000 static images
    for i in tqdm(range(num_image)):
        image = np.random.randint(0, 254, (dim, dim, channel), dtype=np.uint8)
        image = Image.fromarray(image)"data/static/" + str(i) + ".jpg")
    print("Data in static jpg format has been created!")

    # Create a video with 10000 frames
    # In mp4 format
    codec = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
    writer = cv2.VideoWriter(filename="data/mp4.mp4", fourcc=codec, fps=64, frameSize=(dim, dim), isColor=True)
    for i in tqdm(range(num_image)):
        frame = np.random.randint(0, 254, (dim, dim, channel), dtype=np.uint8)
    print("Data in mp4 format has been created!")

    # In plain hdf5/h5 format
    video = np.random.randint(0, 254, (num_image, channel, dim, dim), dtype=np.uint8)
    with h5py.File("data/h5_plain.h5", "w") as data_file:
        data_file.create_dataset(name="video", data=video, compression="gzip", compression_opts=9)
    print("Data in plain hdf5/h5 format has been created!")

    # In chunked hdf5/h5 format
    hdf5_file = h5py.File('data/h5_chunked.h5', 'w')

    first_frame = np.random.randint(0, 254, (channel, dim, dim), dtype=np.uint8)
    hdf5_dataset = hdf5_file.create_dataset('video', data=first_frame[None, ...], maxshape=(
        None, first_frame.shape[0], first_frame.shape[1], first_frame.shape[2]), chunks=True, compression="gzip", compression_opts=9)

    for i in tqdm(range(num_image-1)):
        image = np.random.randint(0, 254, (channel, dim, dim), dtype=np.uint8)
        hdf5_dataset.resize(hdf5_dataset.len() + 1, axis=0)
        hdf5_dataset[hdf5_dataset.len() - 1] = image
    print("Data in chunked hdf5/h5 format has been created!")

    # In mat format
    sio.savemat("data/mat.mat", {'video': video})
    print("Data in mat format has been created!")

    # In binary format
    with open("data/binary.bin", "wb") as file:
    print("Data in binary format has been created!")

    # In pickle format
    with open('data/pickle.pkl', 'wb') as handle:
        pickle.dump(video, handle, protocol=pickle.HIGHEST_PROTOCOL)
    print("Data in plain pickle format has been created!")

    # In npy format'data/npy.npy', video)
    print("Data in npy format has been created!")

def load_static():
    data_matrix = np.zeros((2500, 3, 120, 120), dtype=np.uint8)
    for i, file in enumerate(os.listdir("data/static")):
        if i % 4 == 0:
            filename = os.path.join("data/static", file)
            data_matrix[i //4] = np.asarray(, 0, 1))

def load_mp4():
    data_matrix = np.zeros((2500, 3, 120, 120), dtype=np.uint8)
    cap = cv2.VideoCapture('data/mp4.mp4')
    for i in range(10000):
        ret, frame =
        if i % 4 == 0:
            data_matrix[i // 4] = frame.transpose((2, 0, 1))

def load_plain_h5():
    with h5py.File("data/h5_plain.h5", "r") as f:
        data_matrix = f['video'][()][::4]

def load_chunked_h5():
    with h5py.File("data/h5_chunked.h5", "r") as f:
        data_matrix = f['video'][()][::4]

def load_pickle():
    with open('data/pickle.pkl', 'rb') as f:
        data_matrix = pickle.load(f)[::4]

def load_mat():
    mat_content = sio.loadmat("data/mat.mat")
    data_matrix = mat_content['video'][::4]

def load_binary():
    data_matrix = np.fromfile("data/binary.bin")[::4]

def load_npy():
    data_matrix = np.load("data/npy.npy")[::4]

def load_npy_with_memory_map_mode():
    data_matrix = np.load("data/npy.npy", mmap_mode="c")[::4]

if __name__ == '__main__':
    # creating_data(num_image=10000, dim=120, channel=3)

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading static: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading mp4: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading binary: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading mat: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading pickle: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading plain hdf5/h5: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading chunked hdf5/h5: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading npy: {:4f}s\n".format(elapsed))

    start = time.time()
    elapsed = time.time() - start
    print("Elapsed for loading npy with mempry map mode: {:4f}s\n".format(elapsed))