Large Scale Video Representation Learning via Relational Graph Clustering

By VIP-Lab Comment

GCML

23 August 2021

Large Scale Video Representation Learning via Relational Graph Clustering (CVPR 2020)

https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Large_Scale_Video_Representation_Learning_via_Relational_Graph_Clustering_CVPR_2020_paper.pdf

Previous Work: CDML (Collaborative Deep Metric Learning for Video Understanding - KDD 2018)

  • contributions
  • limitations
    • triplet loss → requires online negative mining →requires large batch size
      • why? random initailization of the negatives
      • randomly sampled negatives too far from the anchor

Motivation

  • relational graph → learn representations that preserves relationships between videos
    • node: video
    • edge: similarity score between videos - binary / real numbers
      • sparsely observed
    • hierarchical clustering

    Untitled

  • approach 1) smart triplets
    • generate training triplets
    • guarantee negatives with proper difficulty level

      → reduce training inefficiency of CDML

  • approach 2) pseudo-classification
    • classification model: cluster membership → target pseudo-label
    • treat clusters as pseudo-labels
    • semi-supervised learning: does not require any labeled data other than the relational graph

Methodology

  1. Graph Clustering
    • relational graph → efficient data sampling
    • any clustering algorithm is applicable: used affinity clustering
      • hierarchical graph using minimum spanning tree → nearest neighbor clustering
      • hierarchical: control difficulty of triplets to generate multiple levels
    • affinity tree with 3 iterations
      • intermediate nodes: clusters of the lower-level nodes

      Untitled 1

  2. [MODEL 1] Graph Clustering Metric Learning
    • Smart Negative Sampling
      • anchor: random video
      • positive: chosen among neighbors of the anchor on the relational graph
      • negative: chosen from the anchor’s sibling clusters (share the same parent)
        • chosen at a desired level → can adjust difficulty level of the sampled negatives
        • negatives not too far from the anchor: can be more informative for model training
        • previous works: randomly sampled
    • Training with Triplets
      • triplet loss: relevant videos → closer, less related → further
      • dist(anchor, positive) < dist(anchor, negative)

        \[min\sum_{i=1}^N[ ||f(x_i^a)-f(x_i^p)||^2 - ||f(x_i^a) - f(x_i^n)||^2 + \alpha]_+\]
      • online semi-hard negative mining
        • re-sample negatives within each mini-batch: closest ones that are farther than positives from the anchors
        • smart negatives would be consistently chosen
  3. [MODEL 2] Cluter Labels Classification
    • training objective: classify which cluster each video belongs to
      • classification model with sampled softmax
      • sampled softmax: subset of sampled classes in each iteration
    • benefit of classification
      • no need to sample hard negatives
      • removes dependency on batch size → more scalable

Experiments

Architecture

Untitled 2

  • Audio-visual Features
    • features extracted with pre-trained models
    • FPS: 1
    • model: Inception-v2, pretrained on JFT dataset
    • dimension reduction with PCA → 1500
    • average pooling: frame level → video level
    • audio: modified ResNet-50
  • Embedding Network
    • two FC layers
    • two-tower model: each for visual and audio features
    • aggregation: element-wise multiplication → L2 normaization
    • loss: triplet loss or cross entropy loss

Task 1) Related Video Retrieval

  1. How?
    • compute similarity score between two candidates (cosine similarity)
    • candidates: based on relational graph
  2. Dataset
    • YouTube-8M dataset (same as in CDML)
  3. Evalutation
    • MAP
    • NDCG@60
  4. Results

    Untitled 3

Task 2) Video Annotation

  1. How?
    • video embedding → FC layer → multi-label classifier
  2. Dataset
    • YouTube-8M
    • Sports-1M
  3. Evaluation
    • GAP, MAP
    • Hit@1, Hit@5
  4. Results

    Untitled 4

    Untitled 5

References

  1. https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Large_Scale_Video_Representation_Learning_via_Relational_Graph_Clustering_CVPR_2020_paper.pdf
  2. 이준석 교수님 MLVU 강의노트