Unsupervised Learning of View-invariant Action Representations

Junnan Li; Qi Zhao; Yongkang Wong; Mohan S. Kankanhalli

Unsupervised Learning of View-invariant Action Representations

Junnan Li, Qi Zhao, Yongkang Wong, Mohan S. Kankanhalli

Computer Science and Engineering

Research output: Contribution to journal › Conference article › peer-review

66 Scopus citations

Abstract

The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.

Original language	English (US)
Pages (from-to)	1254-1264
Number of pages	11
Journal	Advances in Neural Information Processing Systems
Volume	2018-December
State	Published - 2018
Event	32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada Duration: Dec 2 2018 → Dec 8 2018

Bibliographical note

Publisher Copyright:
© 2018 Curran Associates Inc..All rights reserved.

OpenUrl availability

Full text

Cite this

@article{3e6bddc4d4794b59a85523a0a051019f,

title = "Unsupervised Learning of View-invariant Action Representations",

abstract = "The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.",

author = "Junnan Li and Qi Zhao and Yongkang Wong and Kankanhalli, {Mohan S.}",

note = "Publisher Copyright: {\textcopyright} 2018 Curran Associates Inc..All rights reserved.; 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 ; Conference date: 02-12-2018 Through 08-12-2018",

year = "2018",

language = "English (US)",

volume = "2018-December",

pages = "1254--1264",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

}

TY - JOUR

T1 - Unsupervised Learning of View-invariant Action Representations

AU - Li, Junnan

AU - Zhao, Qi

AU - Wong, Yongkang

AU - Kankanhalli, Mohan S.

PY - 2018

Y1 - 2018

N2 - The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.

AB - The recent success in human action recognition with deep learning methods mostly adopt the supervised learning paradigm, which requires significant amount of manually labeled data to achieve good performance. However, label collection is an expensive and time-consuming process. In this work, we propose an unsupervised learning framework, which exploits unlabeled data to learn video representations. Different from previous works in video representation learning, our unsupervised learning task is to predict 3D motion in multiple target views using video representation from a source view. By learning to extrapolate cross-view motions, the representation can capture view-invariant motion dynamics which is discriminative for the action. In addition, we propose a view-adversarial training method to enhance learning of view-invariant features. We demonstrate the effectiveness of the learned representations for action recognition on multiple datasets.

UR - http://www.scopus.com/inward/record.url?scp=85064811240&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064811240&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85064811240

SN - 1049-5258

VL - 2018-December

SP - 1254

EP - 1264

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018

Y2 - 2 December 2018 through 8 December 2018

ER -

Unsupervised Learning of View-invariant Action Representations

Abstract

Bibliographical note

OpenUrl availability

Other files and links

Fingerprint

Cite this