GCML

23 August 2021

Large Scale Video Representation Learning via Relational Graph Clustering (CVPR 2020)

https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Large_Scale_Video_Representation_Learning_via_Relational_Graph_Clustering_CVPR_2020_paper.pdf

Previous Work: CDML (Collaborative Deep Metric Learning for Video Understanding - KDD 2018)

contributions
limitations
- triplet loss → requires online negative mining →requires large batch size
  - why? random initailization of the negatives
  - randomly sampled negatives too far from the anchor

Motivation

relational graph → learn representations that preserves relationships between videos
- node: video
- edge: similarity score between videos - binary / real numbers
  - sparsely observed
- hierarchical clustering
approach 1) smart triplets
- generate training triplets
- guarantee negatives with proper difficulty level
  
  → reduce training inefficiency of CDML
approach 2) pseudo-classification
- classification model: cluster membership → target pseudo-label
- treat clusters as pseudo-labels
- semi-supervised learning: does not require any labeled data other than the relational graph

Methodology

Graph Clustering
- relational graph → efficient data sampling
- any clustering algorithm is applicable: used affinity clustering
  - hierarchical graph using minimum spanning tree → nearest neighbor clustering
  - hierarchical: control difficulty of triplets to generate multiple levels
- affinity tree with 3 iterations
  - intermediate nodes: clusters of the lower-level nodes
[MODEL 1] Graph Clustering Metric Learning
- Smart Negative Sampling
  - anchor: random video
  - positive: chosen among neighbors of the anchor on the relational graph
  - negative: chosen from the anchor’s sibling clusters (share the same parent)
    - chosen at a desired level → can adjust difficulty level of the sampled negatives
    - negatives not too far from the anchor: can be more informative for model training
    - previous works: randomly sampled
- Training with Triplets
  - triplet loss: relevant videos → closer, less related → further
  - dist(anchor, positive) < dist(anchor, negative)
    \[min\sum_{i=1}^N[ ||f(x_i^a)-f(x_i^p)||^2 - ||f(x_i^a) - f(x_i^n)||^2 + \alpha]_+\]
  - online semi-hard negative mining
    - re-sample negatives within each mini-batch: closest ones that are farther than positives from the anchors
    - smart negatives would be consistently chosen
[MODEL 2] Cluter Labels Classification
- training objective: classify which cluster each video belongs to
  - classification model with sampled softmax
  - sampled softmax: subset of sampled classes in each iteration
- benefit of classification
  - no need to sample hard negatives
  - removes dependency on batch size → more scalable

Experiments

Architecture

Untitled 2

Audio-visual Features
- features extracted with pre-trained models
- FPS: 1
- model: Inception-v2, pretrained on JFT dataset
- dimension reduction with PCA → 1500
- average pooling: frame level → video level
- audio: modified ResNet-50
Embedding Network
- two FC layers
- two-tower model: each for visual and audio features
- aggregation: element-wise multiplication → L2 normaization
- loss: triplet loss or cross entropy loss

Task 1) Related Video Retrieval

How?
- compute similarity score between two candidates (cosine similarity)
- candidates: based on relational graph
Dataset
- YouTube-8M dataset (same as in CDML)
Evalutation
- MAP
- NDCG@60
Results

Task 2) Video Annotation

How?
- video embedding → FC layer → multi-label classifier
Dataset
- YouTube-8M
- Sports-1M
Evaluation
- GAP, MAP
- Hit@1, Hit@5
Results

References

https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Large_Scale_Video_Representation_Learning_via_Relational_Graph_Clustering_CVPR_2020_paper.pdf
이준석 교수님 MLVU 강의노트

VIP LAB Archive

SNU GSDS Visual Information Processing(VIP) Lab Archive site

Large Scale Video Representation Learning via Relational Graph Clustering

GCML

Large Scale Video Representation Learning via Relational Graph Clustering (CVPR 2020)

Previous Work: CDML (Collaborative Deep Metric Learning for Video Understanding - KDD 2018)

Motivation

Methodology

Experiments

References