News

  • 10/2018

    I accepted the invitation to serve as an Associate Editor for Elsevier Journal of Measurement.

  • 10/2018

    I delivered a talk at the PhD Colloquium Series at RIT.

  • 10/2018

    I accepted the invitation to serve as a reviewer of CVPR 2019.

  • 09/2018

    Our paper titled "Adversarial Action Prediction Networks" was accepted by IEEE TPAMI. Congratulations! [PDF]

  • 08/2018

    Our paper titled "Clustered Lifelong Learning via Representative Task Selection" was accepted by ICDM 2018. Congratulations to Gan Sun!

  • 05/2018

    I accepted the invitation to serve as Senior Program Committee of AAAI 2019.

  • 02/2018

    I accepted the invitation to serve as Program Committee of ICMLA 2018.

  • 02/2018

    Our paper "Residual Dense Network for Image Super-Resolution" was accepted by CVPR 2018 as an oral paper. Congratulations to Yulun! [PDF] [code]

  • 10/2017

    Our paper "Action Prediction from Videos via Memorizing Hard-to-Predict Samples" was accepted by AAAI 2018. Congratulations! [PDF] [code]

  • 07/2017

    Our paper "Deep Active Learning Through Cognitive Information Parcels" was accepted by ACM Multimedia 2017. Congratulations to Wencang! [PDF] [code]

  • 06/2018

    A survey for human action recognition and prediction is available on arXiv. [PDF]

  • 04/2017

    Our paper "Multi-Stream Deep Similarity Learning Networks for Visual Tracking" was accepted by IJCAI 2017. Congratulations to Kunpeng! [PDF] [code]

  • 03/2017

    Our paper "Deep Sequential Context Networks for Action Prediction" was accepted by CVPR 2017. Congratulations! [PDF] [code]

Selective Research Projects

  • 2016-2018
    Human Action Prediction in Videos
    image

    Yu Kong, Zhiqiang Tao, Yun Fu. Adversarial Action Prediction Networks. IEEE T-PAMI, 2018. (IF=8.329) [PDF]
    Yu Kong, Zhiqiang Tao, Yun Fu. Deep Sequential Context Networks for Action Prediction. CVPR 2017. [PDF]

    This paper proposes efficient and powerful deep networks for action prediction from partially observed videos containing temporally incomplete action executions. Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from these partially observed videos. Our approach exploits abundant sequential context information to enrich the feature representations of partial videos. We reconstruct missing information in the features extracted from partial videos by learning from fully observed action videos. The amount of the information is temporally ordered for the purpose of modeling temporal orderings of action segments. Label information is also used to better separate the learned features of different categories. We develop a new learning formulation that enables efficient model training. Extensive experimental results on UCF101, Sports-1M and BIT datasets demonstrate that our approach remarkably outperforms state-of-the-art methods, and is up to 300× faster than these methods. Results also show that actions differ in their prediction characteristics; some actions can be correctly predicted even though only the beginning 10% portion of videos is observed.
  • 2017-2018
    Core Machine Learning
    image

    Gan Sun, Yang Cong, Yu Kong, and Xiaowei Xu. Clustered Lifelong Learning via Representative Task Selection. International Conference on Data Mining (ICDM), 2018. [PDF]
    Yu Kong, Zhengming Ding, Jun Li, and Yun Fu, Deeply Learned View-Invariant Features for Cross-View Action Recognition, IEEE Transactions on Image Processing (T-IP), 26(6):3028-3037, 2017. [PDF]
    Yu Kong, Ming Shao, Kang Li, and Yun Fu. Probabilistic Low-Rank Multi-Task Learning. IEEE Trans. Neural Networks and Learning Systems (T-NNLS), 2017. [PDF]

    Machine learning algorithms are popularly applied nowadays to address practical problems in various scenarios. I am particularly interested in learning from cross-modality data and additional source of data in order to improve the utility of data and improve the performance. I have done research in multi-view learing, multi-task learning, and life-long learning, and I am excited to develop machine learning algorithms to solve more challenging problems.
  • 2015-2016
    Multi-Modality Learning for Action Recognition
    image

    Yu Kong, Yun Fu. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition. IJCV 2017. (IF=11.541) [PDF]
    Yu Kong, Yun Fu. Bilinear Heterogeneous Information Machine for RGB-D Action Recognition. CVPR 2015. [PDF]

    We propose a novel approach, max-margin heterogeneous information machine (MMHIM), for human action recognition from RGB-D videos. MMHIM fuses heterogeneous RGB visual features and depth features, and learns effective action classifiers using the fused features. Rich heterogeneous visual and depth data are effectively compressed and projected to a learned shared space and independent private spaces, in order to reduce noise and capture useful information for recognition. Knowledge from various sources can then be shared with others in the learned space to learn cross-modal features. This guides the discovery of valuable information for recognition. To capture complex spatiotemporal structural relationships in visual and depth features, we represent both RGB and depth data in a matrix form. We formulate the recognition task as a low-rank bilinear model composed of row and column parameter matrices. The rank of the model parameter is minimized to build a low-rank classifier, which is beneficial for improving the generalization power. We also extend MMHIM to a structured prediction model that is capable of making structured outputs. Extensive experiments on a new RGB-D action dataset and two other public RGB-D action datasets show that our approaches achieve state-of-the-art results. Promising results are also shown if RGB or depth data are missing in training or testing procedure.
  • 2014-2016
    Max-margin Action Prediction Machine
    image

    Yu Kong, Yun Fu. Max-Margin Action Prediction Machine. IEEE T-PAMI, 2016. (IF=8.329) [PDF]
    Yu Kong, Dmitry Kit, Yun Fu. A Discriminative Model with Multiple Temporal Scales for Action Prediction. ECCV 2014 [PDF]

    The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this paper, we propose a novel discriminative multi-scale kernelized model for predicting the action class from a partially observed video. The proposed model captures temporal dynamics of human actions by explicitly considering all the history of observed features as well as features in smaller temporal segments. A compositional kernel is proposed to hierarchically capture the relationships between partial observations as well as the temporal segments, respectively. We develop a new learning formulation, which elegantly captures the temporal evolution over time, and enforces the label consistency between segments and corresponding partial videos. We prove that the proposed learning formulation minimizes the upper bound of the empirical risk. Experimental results on four public datasets show that the proposed approach outperforms state-of-the-art action prediction methods.
  • 2011-2014
    Human Interaction Recognition
    image

    Yu Kong, Yunde Jia, Yun Fu. Interactive Phrases: Semantic Descriptions for Human Interaction Recognition. IEEE T-PAMI, 2014. (IF=8.329) [PDF]
    Yu Kong, Yunde Jia, Yun Fu. Learning Human Interaction by Interactive Phrases. ECCV 2012 [PDF]

    This paper addresses the problem of recognizing human interactions from videos. We propose a novel approach that recognizes human interactions by the learned high-level descriptions, interactive phrases. Interactive phrases describe motion relationships between interacting people. These phrases naturally exploit human knowledge and allow us to construct a more descriptive model for recognizing human interactions. We propose a discriminative model to encode interactive phrases based on the latent SVM formulation. Interactive phrases are treated as latent variables and are used as mid-level features. To complement manually specified interactive phrases, we also discover data-driven phrases from data in order to find potentially useful and discriminative phrases for differentiating human interactions. An information-theoretic approach is employed to learn the data-driven phrases. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in the interactions. We evaluate our method on the BIT-Interaction data set, UT-Interaction data set, and Collective Activity data set. Experimental results show that our approach achieves superior performance over previous approaches.

People in the Computer Vision Lab

  • image

    Prof. Yu Kong

    B. Thomas Golisano College of Computing and Information Sciences, Rochester Institute of Technology

    Office: Room 1079, Golisano Hall
    Phone:(585)-475-5673
    Email: yu.kong AT rit.edu

  • image

    Junwen Chen

    PhD student (2018 Fall - present)

    Research interests: Computer vision and deep learning, Visual perception, Autonomous driving

    Project: Group action prediction

  • image

    Yongshun Xu (Visiting student)

    Email: xuys9401 AT gmail.com

    I am a visiting student in GCCIS at Rochester Institute of Technology, Rochester, NY, advised by Porf. Yu Kong. I received my M.S. degree in Operations research from the Northeastern University, Boston, MA in 2018 and my B.E. degree from University of Science and Technology of China in 2015. My research interests include deep learing, computer vision.

  • image

    Congcong Jin

    Visiting student

    Congcong Jin is currently a visiting student of RIT under the supervision of Prof. Kong, focusing on human action prediction. She is a master student from Xi'an Jiaotong University. Her research interests mainly include computer vision and deep learning.

Alumni

  • Tie Liu, August 2018 as a visiting student, now at Beihang University, China

Research Projects

  • image

    Human Action Prediction from Videos

    See details...

    In some scenarios, predicting the action label before the action execution ends is extremely important. For example, it would be very helpful if an intelligent system on a vehicle can predict a traffic accident before it happens; opposed to recognizing the dangerous accident event thereafter. In this project, we propose computational approaches to discover the temporal regularities in human actions, and improve the discrimination power of the features in the partially observed human actions.

    • Yu Kong, Zhiqiang Tao, Yun Fu. Adversarial Action Prediction Networks. IEEE T-PAMI, 2018. (IF=8.329) [PDF]
    • Yu Kong, Shangqian Gao, Bin Sun, and Yun Fu. Action Prediction from Videos via Memorizing Hard-to-Predict Samples. AAAI Conference on Artificial Intelligence (AAAI), 2018. [PDF]
    • Yu Kong, Zhiqiang Tao, Yun Fu. Deep Sequential Context Networks for Action Prediction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [PDF]
    • Yu Kong and Yun Fu. Max-Margin Action Prediction Machine. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 38(9):1844-1858, 2016. [PDF]
    • Yu Kong, Dmitry Kit, and Yun Fu. A Discriminative Model with Multiple Temporal Scales for Action Prediction. European Conference on Computer Vision (ECCV), pp. 596-611, 2014. [PDF]

  • image

    Deep Convolutional Neural Networks

    See details...

    Convolutional neural networks (CNNs) are powerful tools for learning highly expressive image features in a variety of computer vision tasks. In this project, we investigate the architecture of CNNs, and propose novel loss functions to create efficient and powerful CNNs.

    • Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu. Residual Dense Network for Image Super-Resolution. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [PDF]
    • Kunpeng Li, Yu Kong, and Yun Fu. Multi-Stream Deep Similarity Learning Networks for Visual Tracking. International Joint Conference on Artificial Intelligence (IJCAI), pp. 2166-2172, 2017. [PDF]
    • Yue Wu, Jun Li, Yu Kong, Yun Fu. Deep Convolutional Neural Network with Independent Softmax for Large Scale Face Recognition. ACM Multimedia (ACM MM), pp. 1063-1067, 2016. [PDF]

  • image

    Core Machine Learning

    See details...

    Machine learning algorithms are popularly applied nowadays to address practical problems in various scenarios. I am particularly interested in learning from cross-modality data and additional source of data in order to improve the utility of data and improve the performance. I have done research in multi-view learing, multi-task learning, and life-long learning, and I am excited to develop machine learning algorithms to solve more challenging problems.

    • Gan Sun, Yang Cong, Yu Kong, and Xiaowei Xu. Clustered Lifelong Learning via Representative Task Selection. International Conference on Data Mining (ICDM), 2018. [PDF]
    • Yu Kong, Zhengming Ding, Jun Li, and Yun Fu, Deeply Learned View-Invariant Features for Cross-View Action Recognition, IEEE Transactions on Image Processing (T-IP), 26(6):3028-3037, 2017. [PDF]
    • Yu Kong, Ming Shao, Kang Li, and Yun Fu. Probabilistic Low-Rank Multi-Task Learning. IEEE Trans. Neural Networks and Learning Systems (T-NNLS), 2017. [PDF]
  • image

    RGB-D Action Recognition

    See details...

    In addition to RGB visual data captured by conventional RGB cameras, depth data are provided in RGB-D cameras, encoding rich 3D structural information of the entire scene. In this research, we study the problem of recognizing human actions from RGB-D videos, and propose effective approaches to solve the challenge where RGB or depth data modality is missing.

    • Yu Kong and Yun Fu. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition. International Journal of Computer Vision (IJCV), 123(3):350-371, 2017. [PDF]
    • Yu Kong and Yun Fu. Discriminative Relational Representation Learning for RGB-D Action Recognition. IEEE Trans. Image Processing (T-IP), 25(6):2856- 2865, 2016. [PDF]
    • Yu Kong and Yun Fu. Bilinear Heterogeneous Information Machine for RGB-D Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1054-1062, 2015. [PDF]
    • Yu Kong, Behnam Satarboroujeni, and Yun Fu. Learning Hierarchical 3D Kernel Descriptors for RGB-D Action Recognition. Computer Vision and Image Understanding (CVIU), 144:14-23, 2016. [PDF]
    • Yu Kong, Behnam Sattar, and Yun Fu. Hierarchical 3D Kernel Descriptors for Action Recognition Using Depth Sequences. IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2015. [PDF]
    • Chengcheng Jia, Yu Kong, Zhenming Ding, and Yun Fu. Latent Tensor Transfer Learning for RGB-D Action Recognition, pp. 87-96, ACM Multimedia (ACM MM) 2014 (long paper). [PDF]

  • image

    Human Interaction Recognition

    See details...

    Recognizing human activities is a fundamental problem in the computer vision community and is a key step toward the automatic understanding of scenes. Compared with single-person action, human interaction is a typical human activity in real-world and has received much attention in the community. In this project, we propose to create robust and effective interaction representation methods that are robust to frequent occlusion and large appearance variations in human interactions.

    • Yu Kong and Yun Fu. Close Human Interaction Recognition using Patch-aware Models. IEEE Trans. Image Processing (T-IP), 25(1):167-178, 2015. [PDF]
    • Yu Kong, Yunde Jia, and Yun Fu. Interactive Phrases: Semantic Descriptions for Human Interaction Recognition. IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 36(9):1775-1788, 2014. [PDF]
    • Yu Kong and Yun Fu. Modeling Supporting Regions for Close Human Interaction Recognition. European Conference on Computer Vision workshop, pp. 29-44, 2014. [PDF]
    • Yu Kong, Yunde Jia, and Yun Fu. Learning Human Interactions by Interactive Phrases. European Conference on Computer Vision (ECCV), pp. 300-313, 2012
    • Yu Kong and Yunde Jia. A Hierarchical Model for Human Interaction Recognition. IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2012, oral. [PDF]
  • image

    Social Media Analytics

    See details...

    Assigning GPS coordinates (i.e., latitude, longitude) to images using image contents is challenging due to huge appearance variations of visual features across the world. This project studies a computational framework that learns similar feature representations for geographically close images and distinct feature representation for geographically distant images.

    • Shuhui Jiang, Yu Kong, Yun Fu. Deep Geo-constrained Auto-encoder for Non-landmark GPS Estimation. IEEE Trans. Big Data, 2017. [PDF]
    • Dmitry Kit, Yu Kong, and Yun Fu. Efficient Image Geotagging Using Large Databases. IEEE Trans. Big Data, 2(4):325-338, 2016. [PDF]
    • Dmitry Kit, Yu Kong, and Yun Fu. LASOM: Location Aware Self-Organizing Map for discovering similar and unique visual features of geographical locations. International Joint Conference on Neural Networks (IJCNN), pp. 263-270, 2014. [PDF]

Filter by type:

2018

Adversarial Action Prediction Networks

Yu Kong, Zhiqiang Tao, Yun Fu
Journal Papers IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 2018

Abstract

Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from partially observed videos containing incomplete action executions. It is challenging because these partial videos have insufficient discriminative information, and their temporal structure is damaged. We study this problem in this paper, and propose an efficient and powerful deep network for learning representative and discriminative features for action prediction. Our approach exploits abundant sequential context information in full videos to enrich the feature representations of partial videos. This information is encoded in latent representations using a variational autoencoder (VAE), which are encouraged to be progress-invariant. Decoding such latent representations using another VAE, we can reconstruct missing information in the features extracted from partial videos. An adversarial learning scheme is adopted to differentiate the reconstructed features from the features directly extracted from full videos in order to well align their distributions. A multi-class classifier is also used to encourage the features to be discriminative. Our network jointly learns features and classifiers, and generates the features particularly optimized for action prediction. Extensive experimental results on UCF101, Sports-1M and BIT datasets demonstrate that our approach remarkably outperforms state-of-the-art methods, and shows significant speedup over these methods. Results also show that actions differ in their prediction characteristics; some actions can be correctly predicted even though only the beginning 10% portion of videos is observed.

Clustered Lifelong Learning via Representative Task Selection

Gan Sun, Yang Cong, Yu Kong, and Xiaowei Xu
Conference Papers International Conference on Data Mining (ICDM), 2018

Abstract

Consider the lifelong machine learning problem where the objective is to learn new consecutive tasks depend- ing on previously accumulated experiences, i.e., knowledge library. In comparison with most state-of-the-arts which adopt knowledge library with prescribed size, in this paper, we propose a new incremental clustered lifelong learning model with two libraries: feature library and model library, called Clustered Lifelong Learning (CL3), in which the feature library maintains a set of learned features common across all the encountered tasks, and the model library is learned by identifying and adding representative models (clusters). When a new task arrives, the original task model can be firstly reconstructed by representative models measured by capped l2-norm distance, i.e., effectively assigning the new task model to multiple representative models under feature library. Based on this assignment knowledge of new task, the objective of our CL3 model is to transfer the knowledge from both feature library and model library to learn the new task. The new task 1) with a higher outlier probability will then be judged as a new representative, and used to refine both feature library and representative models over time; 2) with lower outlier probability will only update the feature library. For the model optimisation, we cast this problem as an alternating direction minimization problem. To this end, the performance of CL3 is evaluated through comparing with most lifelong learning models, even some batch clustered multi-task learning models.

Action Prediction from Videos via Memorizing Hard-to-Predict Samples

Yu Kong, Shangqian Gao, Bin Sun, Yun Fu
Conference Papers AAAI Conference on Artificial Intelligence (AAAI), 2018

Abstract

Action prediction based on video is an important problem in computer vision field with many applications, such as preventing accidents and criminal activities. It’s challenging to predict actions at the early stage because of the large variations between early observed videos and complete ones. Besides, intra-class variations cause confusions to the predictors as well. In this paper, we propose a mem-LSTM model to predict actions in the early stage, in which a memory module is introduced to record several “hard-to-predict” samples and a variety of early observations. Our method uses Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) to model partial observed video input. We augment LSTM with a memory module to remember challenging video instances. With the memory module, our mem-LSTM model not only achieves impressive performance in the early stage but also makes predictions without the prior knowledge of observation ratio. Information in future frames is also utilized using a bi-directional layer of LSTM. Experiments on UCF-101 and Sports-1M datasets show that our method outperforms state-of-the-art methods.

Residual Dense Network for Image Super-Resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu.
Conference Paper IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Abstract

A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep CNN based SR models do not make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via dense connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory (CM) mechanism. Local feature fusion in RDB is then used to adaptively learn more effective features from preceding and current local features and stabilizes the training of wider network. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. Extensive experiments on benchmark datasets with different degradation models show that our RDN achieves favorable performance against state-of-the-art methods.

Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition

Yi Tian, Yu Kong, Qiuqi Ruan, Gaoyun An, Yun Fu.
Journal Paper IEEE Transactions on Image Processing (T-IP), 2018. Accepted

Abstract

2017

Deeply Learned View-Invariant Features for Cross-View Action Recognition

Yu Kong, Zhengming Ding, Jun Li, and Yun Fu
Journal Paper IEEE Transactions on Image Processing (T-IP), 26(6):3028-3037, 2017

Abstract

Classifying human actions from varied views is challenging due to huge data variations in different views. The key to this problem is to learn discriminative view-invariant features robust to view variations. In this paper, we address this problem by learning view-specific and view-shared features using novel deep models. View-specific features capture unique dynamics of each view while view-shared features encode common patterns across views. A novel sample-affinity matrix is introduced in learning shared features, which accurately balances information transfer within the samples from multiple views and limits the transfer across samples. This allows us to learn more discriminative shared features robust to view variations. In addition, the incoherence between the two types of features is encouraged to reduce information redundancy and exploit discriminative information in them separately. The discriminative power of the learned features is further improved by encouraging features in the same categories to be geometrically closer. Robust view-invariant features are finally learned by stacking several layers of features. Experimental results on three multi-view data sets show that our approaches outperform the state-of-the-art approaches.

Deep Geo-constrained Auto-encoder for Non-landmark GPS Estimation

Shuhui Jiang, Yu Kong, Yun Fu
Journal Papers IEEE Trans. Big Data, 2017

Abstract

There are now billions of images stored on photo sharing websites. These images contain visual cues that reflect the geographical location of where the photograph was taken (e.g., New York City). Linking visual features in images to physical locations has many potential applications, such as tourism recommendation systems. However, the size and nature of these databases pose great challenges. For example, the distribution of images around the world is highly biased towards popular regions. This results in high redundancy in certain locations, while under-representing the features in other regions. Many density estimation methods are unable to handle such datasets. In this paper we employ an on-line unsupervised clustering method, Location Aware Self-Organizing Map (LASOM), to compress a large image database and learn similarity relationships between different geographical locations. Our method achieves promising results when used on a dataset containing approximately 900,000 images. We further show that the learned representation results in minimal information loss as compared to using k-Nearest Neighbor method. The noise reduction property of LASOM allows for superior performance when combining multiple features. The final part of the paper explores clothing as a new information source that may assist in geolocation of images.

Probabilistic Low-Rank Multi-Task Learning

Yu Kong, Ming Shao, Kang Li, and Yun Fu
Journal Paper IEEE Trans. Neural Networks and Learning Systems (T-NNLS), 2017

Abstract

In this paper, we consider the problem of learning multiple related tasks simultaneously with the goal of improving the generalization performance of individual tasks. The key challenge is to effectively exploit the shared information across multiple tasks as well as preserve the discriminative information for each individual task. To address this, we propose a novel probabilistic model for multitask learning (MTL) that can automatically balance between low-rank and sparsity constraints. The former assumes a low-rank structure of the underlying predictive hypothesis space to explicitly capture the relationship of different tasks and the latter learns the incoherent sparse patterns private to each task. We derive and perform inference via variational Bayesian methods. Experimental results on both regression and classification tasks on real-world applications demonstrate the effectiveness of the proposed method in dealing with the MTL problems.

Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition

Yu Kong and Yun Fu.
Journal Paper International Journal of Computer Vision (IJCV), 123(3):350-371, 2017

Abstract

We propose a novel approach, max-margin heterogeneous information machine (MMHIM), for human action recognition from RGB-D videos. MMHIM fuses heterogeneous RGB visual features and depth features, and learns effective action classifiers using the fused features. Rich heterogeneous visual and depth data are effectively compressed and projected to a learned shared space and independent private spaces, in order to reduce noise and capture useful information for recognition. Knowledge from various sources can then be shared with others in the learned space to learn cross-modal features. This guides the discovery of valuable information for recognition. To capture complex spatiotemporal structural relationships in visual and depth features, we represent both RGB and depth data in a matrix form. We formulate the recognition task as a low-rank bilinear model composed of row and column parameter matrices. The rank of the model parameter is minimized to build a low-rank classifier, which is beneficial for improving the generalization power. We also extend MMHIM to a structured prediction model that is capable of making structured outputs. Extensive experiments on a new RGB-D action dataset and two other public RGB-D action datasets show that our approaches achieve state-of-the-art results. Promising results are also shown if RGB or depth data are missing in training or testing procedure.

Deep Sequential Context Networks for Action Prediction

Yu Kong, Zhiqiang Tao, Yun Fu.
Conference Paper IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Abstract

This paper proposes efficient and powerful deep networks for action prediction from partially observed videos containing temporally incomplete action executions. Different from after-the-fact action recognition, action prediction task requires action labels to be predicted from these partially observed videos. Our approach exploits abundant sequential context information to enrich the feature representations of partial videos. We reconstruct missing information in the features extracted from partial videos by learning from fully observed action videos. The amount of the information is temporally ordered for the purpose of modeling temporal orderings of action segments. Label information is also used to better separate the learned features of different categories. We develop a new learning formulation that enables efficient model training. Extensive experimental results on UCF101, Sports-1M and BIT datasets demonstrate that our approach remarkably outperforms state-of-the-art methods, and is up to 300× faster than these methods. Results also show that actions differ in their prediction characteristics; some actions can be correctly predicted even though only the beginning 10% portion of videos is observed.

Deep Active Learning Through Cognitive Information Parcels

Wencang Zhao, Yu Kong, Zhengming Ding, and Yun Fu
Conference Papers ACM Multimedia (ACM-MM), 2017

Abstract

In deep learning scenarios, a lot of labeled samples are needed to train the models. However, in practical application elds, since the objects to be recognized are complex and non-uniformly distributed, it is di cult to get enough labeled samples at one time. Active learning can actively improve the accuracy with fewer training labels, which is one of the promising solutions to tackle this problem. Inspired by human being’s cognition process to acquire additional knowledge gradually, we propose a novel deep active learning method through Cognitive Information Parcels (CIPs) based on the analysis of model’s cognitive errors and expert’s instruction. e transformation of the cognitive parcels is defined, and the corresponding representation feature of the objects is obtained to identify the model’s cognitive error information. Experiments prove that the samples, selected based on the CIPs, can bene t the target recognition and boost the deep model’s performance effciently. The characterization of cognitive knowledge can avoid the other samples’ disturbance to the cognitive property of the model effectively. We believe that our work could provide a trial of thought about the cognitive knowledge used in deep learning field.

Multi-Stream Deep Similarity Learning Networks for Visual Tracking

Kunpeng Li, Yu Kong, and Yun Fu
Conference Paper International Joint Conference on Artificial Intelligence (IJCAI), pp. 2166-2172, 2017

Abstract

Visual tracking has achieved remarkable success in recent decades, but it remains a challenging problem due to appearance variations over time and complex cluttered background. In this paper, we adopt a tracking-by-verification scheme to overcome these challenges by determining the patch in the subsequent frame that is most similar to the target template and distinctive to the background context. A multi-stream deep similarity learning network is proposed to learn the similarity comparison model. The loss function of our network encourages the distance between a positive patch in the search region and the target template to be smaller than that between positive patch and the background patches. Within the learned feature space, even if the distance between positive patches becomes large caused by the appearance change or interference of background clutter, our method can use the relative distance to distinguish the target robustly. Besides, the learned model is directly used for tracking with no need of model updating, parameter fine-tuning and can run at 45 fps on a single GPU. Our tracker achieves state-of-the-art performance on the visual tracking benchmark compared with other recent real-time-speed trackers, and shows better capability in handling background clutter, occlusion and appearance change.

Sparse Subspace Clustering by Learning Approximation l0 Codes

Jun Li, Yu Kong, Yun Fu
Conference Paper AAAI Conference on Artificial Intelligence (AAAI), 2017

Abstract

Subspace clustering has been widely applied to detect meaningful clusters in high-dimensional data spaces. A main challenge in subspace clustering is to quickly calculate a “good” affinity matrix. l0, l1, l2 or nuclear norm regularization is used to construct the affinity matrix in many subspace clustering methods because of their theoretical guarantees and empirical success. However, they suffer from the following problems: (1) l2 and nuclear norm regularization require very strong assumptions to guarantee a subspace-preserving affinity; (2) although l1 regularization can be guaranteed to give a subspace-preserving affinity under certain conditions, it needs more time to solve a large-scale convex optimization problem; (3) l0 regularization can yield a tradeoff between computationally efficient and subspace-preserving affinity by using the orthogonal matching pursuit (OMP) algorithm, but this still takes more time to search the solution in OMP when the number of data points is large. In order to overcome these problems, we first propose a learned OMP (LOMP) algorithm to learn a single hidden neural network (SHNN) to fast approximate the l0 code. We then exploit a sparse subspace clustering method based on l0 code which is fast computed by SHNN. Two sufficient conditions are presented to guarantee that our method can give a subspace-preserving affinity. Experiments on handwritten digit and face clustering show that our method not only quickly computes the l0 code, but also outperforms the relevant subspace clustering methods in clustering results. In particular, our method achieves the state-of-the-art clustering accuracy (94.32%) on MNIST.

2016

Introduction

Yu Kong, and Yun Fu
Book Chapter Human Activity Recognition and Prediction, Springer, 2016

Action Recognition

Yu Kong, and Yun Fu
Book Chapter Human Activity Recognition and Prediction, Springer, 2016

Activity Prediction

Yu Kong, and Yun Fu
Book Chapter Human Activity Recognition and Prediction, Springer, 2016

RGB-D Action Recognition

Chengcheng Jia, Zhengming Ding, Yu Kong, and Yun Fu
Book Chapter Human Activity Recognition and Prediction, Springer, 2016

Efficient Image Geotagging Using Large Databases

Dmitry Kit, Yu Kong, and Yun Fu
Journal Paper IEEE Trans. Big Data, 2(4):325-338, 2016

Abstract

There are now billions of images stored on photo sharing websites. These images contain visual cues that reflect the geographical location of where the photograph was taken (e.g., New York City). Linking visual features in images to physical locations has many potential applications, such as tourism recommendation systems. However, the size and nature of these databases pose great challenges. For example, the distribution of images around the world is highly biased towards popular regions. This results in high redundancy in certain locations, while under-representing the features in other regions. Many density estimation methods are unable to handle such datasets. In this paper we employ an on-line unsupervised clustering method, Location Aware Self-Organizing Map (LASOM), to compress a large image database and learn similarity relationships between different geographical locations. Our method achieves promising results when used on a dataset containing approximately 900,000 images. We further show that the learned representation results in minimal information loss as compared to using k-Nearest Neighbor method. The noise reduction property of LASOM allows for superior performance when combining multiple features. The final part of the paper explores clothing as a new information source that may assist in geolocation of images.

Learning Fast Low-rank Projection for Image Classification

Jun Li, Yu Kong, Handong Zhao, Jian Yang, and Yun Fu
Journal Paper IEEE Trans. Image Processing (T-IP), 25(10):4803-4814, 2016

Abstract

Rooted in a basic hypothesis that a data matrix is strictly drawn from some independent subspaces, the low-rank representation (LRR) model and its variations have been suc- cessfully applied in various image classification tasks. However, this hypothesis is very strict to the LRR model as it cannot always be guaranteed in real images. Moreover, the hypothesis also prevents the sub-dictionaries of different subspaces from collaboratively representing an image. Fortunately, in supervised image classification, low-rank signal can be extracted from the independent label subspaces (ILS) instead of the independent image subspaces (IIS). Therefore, this paper proposes a projective low-rank representation (PLR) model by directly training a projective function to approximate the LRR derived from the labels. To the best of our knowledge, PLR is the first attempt to use the ILS hypothesis to relax the rigorous IIS hypothesis in the LRR models. We further prove a low-rank effect that the representations learned by PLR have high intraclass similarities and large interclass differences, which are beneficial to the classification tasks. The effectiveness of our proposed approach is validated by the experimental results on three databases.

Discriminative Relational Representation Learning for RGB-D Action Recognition

Yu Kong and Yun Fu
Journal Paper IEEE Trans. Image Processing (T-IP), 25(6):2856-2865, 2016

Abstract

This paper addresses the problem of recognizing human actions from RGB-D videos. A discriminative relational feature learning method is proposed for fusing heterogeneous RGB and depth modalities, and classifying the actions in RGB-D sequences. Our method factorizes the feature matrix of each modality, and enforces the same semantics for them in order to learn shared features from multimodal data. This allows us to capture the complex correlations between the two modalities. To improve the discriminative power of the relational features, we introduce a hinge loss to measure the classification accuracy when the features are employed for classification. This essentially performs supervised factorization, and learns discriminative features that are optimized for classification. We formulate the recognition task within a maximum margin framework, and solve the formulation using a coordinate descent algorithm. The proposed method is extensively evaluated on two public RGB-D action data sets. We demonstrate that the proposed method can learn extremely low-dimensional features with superior discriminative power, and outperforms the state-of-the-art methods. It also achieves high performance when one modality is missing in testing or training.

Max-Margin Action Prediction Machine

Yu Kong and Yun Fu
Journal Paper IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 38(9):1844-1858, 2016

Abstract

The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this paper, we propose a novel discriminative multi-scale kernelized model for predicting the action class from a partially observed video. The proposed model captures temporal dynamics of human actions by explicitly considering all the history of observed features as well as features in smaller temporal segments. A compositional kernel is proposed to hierarchically capture the relationships between partial observations as well as the temporal segments, respectively. We develop a new learning formulation, which elegantly captures the temporal evolution over time, and enforces the label consistency between segments and corresponding partial videos. We prove that the proposed learning formulation minimizes the upper bound of the empirical risk. Experimental results on four public datasets show that the proposed approach outperforms state-of-the-art action prediction methods.

Deep Convolutional Neural Network with Independent Softmax for Large Scale Face Recognition

Yue Wu, Jun Li, Yu Kong, Yun Fu
Conference Paper ACM Multimedia (ACM MM), pp. 1063-1067, 2016

Abstract

In this paper, we present our solution to the MS-Celeb-1M Challenge. This challenge aims to recognize 100k celebrities at the same time. The huge number of celebrities is the bottleneck for training a deep convolutional neural network of which the output is equal to the number of celebrities. To solve this problem, an independent softmax model is proposed to split the single classifier into several small classifiers. Meanwhile, the training data are split into several partitions. This decomposes the large scale training procedure into several medium training procedures which can be solved separately. Besides, a large model is also trained and a simple strategy is introduced to merge the two models. Extensive experiments on the MSR-Celeb-1M dataset demonstrate the superiority of the proposed method. Our solution ranks the first and second in two tracks of the final evaluation.

2015

Close Human Interaction Recognition using Patch-aware Models

Yu Kong and Yun Fu
Journal Paper IEEE Trans. Image Processing (T-IP), 25(1):167-178, 2015

Abstract

This paper addresses the problem of recognizing human interactions with close physical contact from videos. Due to ambiguities in feature-to-person assignments and frequent occlusions in close interactions, it is difficult to accurately extract the interacting people. This degrades the recognition performance. We, therefore, propose a hierarchical model, which recognizes close interactions and infers supporting regions for each interacting individual simultaneously. Our model associates a set of hidden variables with spatiotemporal patches and discriminatively infers their states, which indicate the person that the patches belong to. This patch-aware representation explicitly models and accounts for discriminative supporting regions for individuals, and thus overcomes the problem of ambiguities in feature assignments. Moreover, we incorporate the prior for the patches to deal with frequent occlusions during interactions. Using the discriminative supporting regions, our model builds cleaner features for individual action recognition and interaction recognition. Extensive experiments are performed on the BIT-Interaction data set and the UT-Interaction data set set #1 and set #2, and validate the effectiveness of our approach.

Learning Hierarchical 3D Kernel Descriptors for RGB-D Action Recognition

Yu Kong, Behnam Satarboroujeni, and Yun Fu
Journal Paper Computer Vision and Image Understanding (CVIU), 144:14-23, 2015

Abstract

Human action recognition is an important and challenging task due to intra-class variation and complexity of actions which is caused by diverse style and duration in performed action. Previous works mostly concentrate on either depth or RGB data to build an understanding about the shape and movement cues in videos but fail to simultaneously utilize rich information in both channels. In this paper we study the problem of RGB-D action recognition from both RGB and depth sequences using kernel descriptors. Kernel descriptors provide an unified and elegant framework to turn pixel-level attributes into descriptive information about the performed actions in video. We show how using simple kernel descriptors over pixel attributes in video sequences achieves a great success compared to the state-of-the-art complex methods. Following the success of kernel descriptors (Bo, et al., 2010) on object recognition task, we put forward the claim that using 3D kernel descriptors could be an effective way to project the low-level features on 3D patches into a powerful structure which can effectively describe the scene. We build our system upon the 3D Gradient kernel descriptor and construct a hierarchical framework by employing efficient match kernel (EMK) (Bo, and Sminchisescu, 2009) and hierarchical kernel descriptors (HKD) as higher levels to abstract the mid-level features for classification. Through extensive experiments we demonstrate the proposed approach achieves superior performance on four standard RGB-D sequences benchmarks.

Bilinear Heterogeneous Information Machine for RGB-D Action Recognition

Yu Kong, and Yun Fu
Conference Paper IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1054-1062, 2015

Abstract

This paper proposes a novel approach to action recognition from RGB-D cameras, in which depth features and RGB visual features are jointly used. Rich heterogeneous RGB and depth data are effectively compressed and projected to a learned shared space, in order to reduce noise and capture useful information for recognition. Knowledge from various sources can then be shared with others in the learned space to learn cross-modal features. This guides the discovery of valuable information for recognition. To capture complex spatiotemporal structural relationships in visual and depth features, we represent both RGB and depth data in a matrix form. We formulate the recognition task as a low-rank bilinear model composed of row and column parameter matrices. The rank of the model parameter is minimized to build a low-rank classifier, which is beneficial for improving the generalization power. The proposed method is extensively evaluated on two public RGB-D action datasets, and achieves state-of-the-art results. It also shows promising results if RGB or depth data are missing in training or testing procedure.

Hierarchical 3D Kernel Descriptors for Action Recognition Using Depth Sequences

Yu Kong, Behnam Sattar, and Yun Fu
Conference Paper IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2015.

Abstract

Action recognition is a challenging task due to intra-class motion variation caused by diverse style and duration in performed action videos. Previous works on action recognition task are more focused on hand-crafted features, treat different sources of information independently, and simply combine them before classification. In this paper we study action recognition from depth sequences captured by RGB-D cameras using kernel descriptors. Kernel descriptors provide an elegant way for combining a variety of information sources and can be easily applied to a hierarchical structure. We show how using kernel descriptors over pixel-level attributes in video sequences gains a great success compared to state-of-the-art methods. Following the success of kernel descriptors on object recognition tasks, we employ 3D kernel descriptors, which are a unified framework for capturing pixel-level attributes and turning them into discriminative low-level features on individual 3D patches. We use efficient match kernel (EMK) as the next level of our hierarchical structure to abstract the mid-level features for classification. Through extensive experiments we demonstrate using pixel-level attributes in the hierarchical architecture of our 3D kernel descriptor and EMK achieves superior performance on the standard depth sequences benchmarks.

2014

Interactive Phrases: Semantic Descriptions for Human Interaction Recognition

Yu Kong, Yunde Jia, and Yun Fu
Journal Paper IEEE Trans. Pattern Analysis and Machine Intelligence (T-PAMI), 36(9):1775-1788, 2014

Abstract

This paper addresses the problem of recognizing human interactions from videos. We propose a novel approach that recognizes human interactions by the learned high-level descriptions, interactive phrases. Interactive phrases describe motion relationships between interacting people. These phrases naturally exploit human knowledge and allow us to construct a more descriptive model for recognizing human interactions. We propose a discriminative model to encode interactive phrases based on the latent SVM formulation. Interactive phrases are treated as latent variables and are used as mid-level features. To complement manually specified interactive phrases, we also discover data-driven phrases from data in order to find potentially useful and discriminative phrases for differentiating human interactions. An information-theoretic approach is employed to learn the data-driven phrases. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in the interactions. We evaluate our method on the BIT-Interaction data set, UT-Interaction data set, and Collective Activity data set. Experimental results show that our approach achieves superior performance over previous approaches.

Recognizing Human Interaction from Videos by a Discriminative Model

Yu Kong, Wei Liang, and Yunde Jia
Journal Paper IET Computer Vision, 8(4):277-286, 2014

Abstract

Learning a discriminative mid-level feature for action recognition

Cuiwei Liu, Mingtao Pei, Xinxiao Wu, Yu Kong, and Yunde Jia
Journal Paper Science China Information Sciences, 57(5):1-13, 2014

Abstract

Modeling Supporting Regions for Close Human Interaction Recognition

Yu Kong, and Yun Fu
Conference Paper European Conference on Computer Vision workshop, pp. 29-44, 2014

Abstract

Latent Tensor Transfer Learning for RGB-D Action Recognition

Chengcheng Jia, Yu Kong, Zhenming Ding, and Yun Fu
Conference Paper ACM Multimedia (ACM-MM), pp. 87-96, 2014

Abstract

This paper proposes a method to compensate RGB-D images from the original target RGB images by transferring the depth knowledge of source data. Conventional RGB databases (e.g., UT-Interaction database) do not contain depth information since they are captured by the RGB cameras. Therefore, the methods designed for {RGB} databases cannot take advantage of depth information, which proves useful for simplifying intra-class variations and background subtraction. In this paper, we present a novel transfer learning method that can transfer the knowledge from depth information to the RGB database, and use the additional source information to recognize human actions in RGB videos. Our method takes full advantage of 3D geometric information contained within the learned depth data, thus, can further improve action recognition performance. We treat action data as a fourth-order tensor (row, column, frame and sample), and apply latent low-rank transfer learning to learn shared subspaces of the source and target databases. Moreover, we introduce a novel cross-modality regularizer that plays an important role in finding the correlation between RGB and depth modalities, and then more depth information from the source database can be transferred to that of the target. Our method is extensively evaluated on public by available databases. Results of two action datasets show that our method outperforms existing methods.

A Discriminative Model with Multiple Temporal Scales for Action Prediction

Yu Kong, Dmitry Kit, and Yun Fu
Conference Paper European Conference on Computer Vision (ECCV), pp. 596-611, 2014

Abstract

The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this paper, we propose a novel discriminative multi-scale model for predicting the action class from a partially observed video. The proposed model captures temporal dynamics of human actions by explicitly considering all the history of observed features as well as features in smaller temporal segments. We develop a new learning formulation, which elegantly captures the temporal evolution over time, and enforces the label consistency between segments and corresponding partial videos. Experimental results on two public datasets show that the proposed approach outperforms state-of-the-art action prediction methods.

LASOM: Location Aware Self-Organizing Map for discovering similar and unique visual features of geographical locations

Dmitry Kit, Yu Kong, and Yun Fu
Conference Paper International Joint Conference on Neural Networks (IJCNN), pp. 263-270, 2014

Abstract

Can a machine tell us if an image was taken in Beijing or New York? Automated identification of the geographical coordinates based on image content is of particular importance to data mining systems, because geolocation provides a large source of context for other useful features of an image. However, successful localization of unannotated images requires a large collection of images that cover all possible locations. Brute-force searches over the entire databases are costly in terms of computation and storage requirements, and achieve limited results. Knowing what visual features make a particular location unique or similar to other locations can be used for choosing a better match between spatially distance locations. However, doing this at global scales is a challenging problem. In this paper we propose an on-line, unsupervised, clustering algorithm called Location Aware Self-Organizing Map (LASOM), for learning the similarity graph between different regions. The goal of LASOM is to select key features in specific locations so as to increase the accuracy in geotagging untagged images, while also reducing computational and storage requirements. Different from other Self-Organizing Map algorithms, LASOM provides the means to learn a conditional distribution of visual features, conditioned on geospatial coordinates. We demonstrate that the generated map not only preserves important visual information, but provides additional context in the form of visual similarity relationships between different geographical areas. We show how this information can be used to improve geotagging results when using large databases.

Prior Work

Learning Human Interactions by Interactive Phrases

Yu Kong, Yunde Jia, and Yun Fu
Conference Paper European Conference on Computer Vision (ECCV), pp. 300-313, 2012

Abstract

In this paper, we present a novel approach for human interaction recognition from videos. We introduce high-level descriptions called interactive phrases to express binary semantic motion relationships between interacting people. Interactive phrases naturally exploit human knowledge to describe interactions and allow us to construct a more descriptive model for recognizing human interactions. We propose a novel hierarchical model to encode interactive phrases based on the latent SVM framework where interactive phrases are treated as latent variables. The interdependencies between interactive phrases are explicitly captured in the model to deal with motion ambiguity and partial occlusion in interactions. We evaluate our method on a newly collected BIT-Interaction dataset and UT-Interaction dataset. Promising results demonstrate the effectiveness of the proposed method.

A Hierarchical Model for Human Interaction Recognition

Yu Kong and Yunde Jia
Conference Paper IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2012, Oral

Abstract

Activity Recognition by Learning Structural and Pairwise Mid-level Features Using Random Forest

Jie Hu, Yu Kong, and Yun Fu
Conference Paper Automatic Face and Gesture Recognition (FG), 2013

Abstract

Decomposed Contour Prior for Shape Recognition

Zhi Yang, Yu Kong, and Yun Fu
Conference Paper International Conference on Pattern Recognition (ICPR), pp. 767-770, 2012

Abstract

Contour-HOG: A Stub Feature based Level Set Method for Learning Object Contour

Zhi Yang, Yu Kong, and Yun Fu
Conference Paper British Conference on Computer Vision (BMVC), pp. 1-11, 2012

Abstract

Action Recognition with Discriminative Middle Level Features

Cuiwei Liu, Yu Kong, Xinxiao Wu, and Yunde Jia
Conference Paper International Conference on Pattern Recognition (ICPR), pp. 1-13, 2012, oral

Abstract

Adaptive Learning Codebook for Action Recognition

Yu Kong, Xiaoqin Zhang, Weiming Hu, Yunde Jia
Journal Paper Pattern Recognition Letters. 32(8):1178-1186, 2011

Abstract

Recognizing Human Interaction by Multiple Features

Zhen Dong, Yu Kong, Cuiwei Liu, Yunde Jia
Conference Paper Asian Conference on Pattern Recognition, pp. 77-81, 2011

Abstract

Learning Human Action with an Adaptive Codebook

Yu Kong, Xiaoqin Zhang, Weiming Hu, Yunde Jia
Conference Paper International Conference on Virtual Systems and Multimedia, pp. 13-20, 2010

Abstract

Compact Visual Codebook for Action Recognition

Qingdi Wei, Xiaoqin Zhang, Yu Kong, Weiming Hu, and Haibin Ling
Conference Paper International Conference on Image Processing (ICIP), pp. 3805-3808, 2010

Abstract

A Swarm Intelligence Based Searching Strategy for Articulated 3D Human Body Tracking

Xiaoqin Zhang, Weiming Hu, Xiangyang Wang, Yu Kong, Nianhua Xie, Hanzi Wang
Conference Paper IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pp. 45-50, 2010

Abstract

Learning Group Activity in Soccer Videos from Local Motion

Yu Kong, Weiming Hu, Xiaoqin Zhang, Hanzi Wang, Yunde Jia
Conference Paper Asian Conference on Computer Vision (ACCV), pp. 103-112, 2009, oral

Abstract

Group Action Recognition Using Space-Time Interest Points

Qingdi Wei, Xiaoqin Zhang, Yu Kong, Weiming Hu, and Haibin Ling
Conference Paper International Symposium on Visual Computing, pp. 757-766, 2009

Abstract

Group Action Recognition in Soccer Videos

Yu Kong, Xiaoqin Zhang, Qingdi Wei, Weiming Hu, Yunde Jia
Conference Paper International Conference on Pattern Recognition (ICPR), pp. 1-4, 2008

Abstract

Current Teaching

  • Present 2019 Spring

    Advanced Computer Vision [Syllabus]

    Computer vision is an interdisciplinary field that deals with how computers can be made to gain high-level understanding from digital images or videos. It has enabled many successful practical applications including autonomous driving car, medical image analysis, human-computer interaction, visual surveillance, etc. This course covers popular computer vision topics, including large-scale image classification, art & computer vision, object detection, motion in videos, visual tracking, video understanding, and action recognition, etc. Fundamental mathematical tools of deep neural networks will also be covered, including convolutional neural networks, recurrent neural networks, adversarial generative networks, auto-encoders, etc. Hands-on exercises and projects using popular deep learning toolboxes will be provided.

  • Past 2018 Spring

    Data Visualization [Syllabus]

    Introduction to relevant topics and concepts in visualization, including computer graphics, visual data representation, physical and human vision models, numerical representation of knowledge and concept, animation techniques, pattern analysis, and computational methods. Tools and techniques for practical visualization. Elements of related fields including computer graphics, human perception, computer vision, imaging science, multimedia, human‐computer interaction, computational science, and information theory. Covers examples from a variety of scientific, medical, interactive multimedia, and artistic applications. Hands‐on exercises and projects will also be provided.

Data

BIT-Interaction Dataset

image

BIT-Interaction dataset contains 400 videos of human interactions, distributed in 8 interaction classes, including handshake, kick, etc.


Citation @inproceedings{KongECCV2012, author={Yu Kong and Yunde Jia and Yun Fu}, year={2012}, booktitle={European Conference on Computer Vision}, volume={7572}, title={Learning Human Interaction by Interactive Phrases}, pages={300-313}}