How Disney Improved Activity Recognition Through Multimodal Approaches with PyTorch
Among the many things Disney Media & Entertainment Distribution (DMED) is responsible for, is the management and distribution of a huge array of media assets including news, sports, entertainment and features, episodic programs, marketing and advertising and more.
Our team focuses on media annotation as part of DMED Technology’s content platforms group. In our day-to-day work, we automatically analyze a variety of content that constantly challenges the efficiency of our machine learning workflow and the accuracy of our models.
Several of our colleagues recently discussed the workflow efficiencies that we achieved by switching to an end-to-end video analysis pipeline using PyTorch, as well as how we approach animated character recognition. We invite you to read more about both in this previous post.
While the conversion to an end-to-end PyTorch pipeline is a solution that any company might benefit from, animated character recognition was a uniquely-Disney concept and solution.
In this article we will focus on activity recognition, which is a general challenge across industries — but with some specific opportunities when leveraged in the media production field, because we can combine audio, video, and subtitles to provide a solution.
- Experimenting with Multimodality
- Working on a multimodal problem adds more complexity to the usual training pipelines. Having multiple information modes for each example means that the multimodal pipeline has to have specific implementations to process each mode in the dataset. Usually after this processing step, the pipeline has to merge or fuse the outputs.
- Multimodal Transformers
With a workbench based on MMF,
our initial model was based on a concatenation of features from each modality evolving to a pipeline that included a
Transformer-based fusion module to combine the different input modes.
- Searching for less data-hungry solutions
Searching for less data-hungry solutions, our team started studying MLP-Mixer.
This new architecture has been proposed by the Google Brain team and
it provides an alternative to well established de facto architectures
like convolutions or self-attention for computer vision tasks.
- Activity Recognition reinterpreting the MLP-Mixer
Our proposal takes the core idea of the MLP-Mixer — using multiple multi-layer perceptrons on a sequence and transposed sequence and
extends it into a Multi Modal framework that allows us to process video, audio & text with the same architecture. organization.