CDML • Collaborative Deep Metric Learning for Video Understanding • Recommender System • Metric Learning

Collaborative Deep Metric Learning for Video Understanding

CDML

19 August 2021

Collaborative Deep Metric Learning for Video Understanding (KDD 2018)

https://dl.acm.org/doi/pdf/10.1145/3219819.3219856

Motivation

challenges in video understanding
1. large video files, prohibited downlad and store
2. computationally expensive
3. costly labels
→ can be tackeld with metric learning
goal
- learn embedding function → project video onto a low-dimensional space
  - related videos: close to each other
  - unrelated videos: far apart from each other
- embedding function should generalize well across different video understanding tasks

Approach

source of information
1. raw video content
  - extract image and audio features using SOTA deep neural networks
2. user behavior - collaborative filtering information
  - construct a graph - edge if co-watched by many users
content-aware embedding: metric space of video content embedding trained to reconstruct the CF information
- mapping from video content to CF signals
- capture high-level semantic relationship between videos
semi-supervised

Methodology

Video Features
- extract visual / audio features using pre-trained models
- video features
  - 1 FPS using Inception-v3 trained on JFT dataset
  - apply PCA to last hidden layer
  - frame level features → video level : average pooling
- audio features
  - VGG-inspired model
  - non-overlapping 960 ms frames
  - average pooled into video level
Collaborative Deep Metric Learning
- construct a graph
  - nodes: videos
  - edges: co-watched, weight: co-watch frequency
- objective: co-watched videos → close in the embedding space
  - ranking triplet loss: training data point - triplet of three videos
    - anchor: more relevant to positive than negative
    - positive
    - negative
  - loss: hinge loss ← minimize
    \[\mathcal{L}_{hinge}(f_\theta(x_i^a),f_\theta(x_i^p),f_\theta(x_i^n))= [||f_\theta(x_i^a)-f_\theta(x_i^p)||^2_2 - ||f_\theta(x_i^a)-f_\theta(x_i^n)||^2_2 + \alpha]_+\]
  - minimize the distance between anchor and positive, maximize the distance between anchor and negative
  - $\alpha$: margin paramter, $=0$ in this experiment
- extracted features → embedding network
  1. early fusion
    - concatenate input features → FC
  2. late fusion
    - each input feature → FC layers → element-wise product

Experiments

semi-hard negative mining within mini-batch of 7200 triplets
- re-sample negative from the mini-batch
  - negative not too far from the anchor
  - closest one further from the positive

Task 1) Video Retrieval

How?
- compute similarity score between two candidates (cosine similarity)
- query: video
- candidates: co-watched videos
Dataset
- YouTube-8M dataset: 278M videos with 1000 or more views
Evaluation
- NDCG
- MAP
Results

Task 2) Video Recommendation

How?
- query: user - represented as recent watch history
- candidate video → compute similarity scores (cosine similarity) → recommend the video with highest arithmetic mean
Dataset
- MovieLens 100K + YouTube trailers
  - 25141 trailers for 26733 unique movies (94%)
  - users less than 10 test rating excluded
Evaluation
- NDCG@10
- MAP
Results

Task 3) Video Annotation / Classification

How?
- multi-labeled classification problem
- feature vector of the video → binary vector * number of classes
Dataset
- YouTube-8M (large-scale video classification challenge)
- MovieLens-20M (movie trailer → movie tag classification)
Evaluation
- GAP (Global Average Precision): average precision based on the top 20 predictions per example
- MAP
Results

References

https://dl.acm.org/doi/pdf/10.1145/3219819.3219856
이준석 교수님 MLVU 강의노트

comments powered by Disqus