DL Title Mast

Deep Learning, often associated to deep artificial neural networks, is a subset of machine learning which gained traction in the recent Artificial Intelligence hype cycle. 
In A*STAR I2R, Deep Learning first started with the DL 1.0 where it is based on training models requiring large amounts of labelled data, and hence comes with its limitations when generating useful information in situations with smaller datasets.

Entering DL 2.0, we are looking at the adaptation of Deep Learning into applied AI research that is critical in the provision of solutions for various domains such as healthcare, advanced manufacturing and education. Our team works towards the developing of enhanced training infrastructure, next generation algorithms, next generation hardware and new untapped enterprise applications.


Unsupervised Deep Learning

Semi-supervised/Unsupervised Deep Learning: Towards 10X fewer labeled samples

Modern deep learning models typically require large amounts of high-quality labelled data to obtain good predictive performance. Collecting, and more importantly, labelling such large datasets is time consuming and expensive. Moreover, labelling the data may require specialized expert knowledge (e.g. in medicine and specialized industry applications), further increasing the cost of data acquisition. This data collection and annotation burden significantly hampers wide-spread applicability and adoption of deep learning.

Instead of training networks on labelled data alone, we aim to leverage large amounts of more readily available unlabelled data to train models. State-of-the-art unsupervised approaches are only just starting to bear fruit (Variational Auto-Encoders and Generative Adversarial Networks), and research into methods to fuse unlabelled and labelled data is still in the early-stages.

We aim to develop deep learning pipelines that can utilize vast quantities of cheap, unlabelled, and low-quality data combined with a small amount of high-quality, labelled, but expensive data to obtain performance comparable to when 10x the labelled data is used. This will enable models that can learn at scale without the associated labelling cost. We also aim to develop improved unsupervised deep learning models that are able to accurately model the data distribution, thus enabling accurate and scalable anomaly detection.

Incorporate Knowledge Graphs

Deep learning algorithms are trained to learn progressively using data. In order to deduce the right results, a vast amount of data is needed. There are numerous sources of external knowledge publicly available, including knowledge bases and graphs (e.g., Microsoft Concept Graph and MIT ConceptNet), as well as knowledge embedded in metadata (e.g., spatiotemporal signals).  

Current deep learning systems are not leveraging the vast amount of knowledge available in the real world. These external knowledge provides additional semantic information which is crucial to problems such as object detection in images and videos. The key intuition stems from semantic consistency. That is, object concepts that are semantically close to each other in the real world are more likely to appear together in images or videos, since images and videos are reflections of the real world. 

A*STAR I2R’s Deep Learning team is working on two major problems: 

  1. Knowledge-aware deep learning for object detection in images or videos. 
    By integrating external knowledge into deep learning, we can significantly increase detection accuracy, since existing approaches ignore such useful knowledge. 
  2. Large-scale deep learning-based knowledge harvesting from images or videos. 
    While today's knowledge bases are primarily based on text, we are looking at techniques for constructing and refining knowledge bases with multimedia content. The two problems are complementary as they can mutually reinforce each other’s performance. Potentially, knowledge-aware deep learning techniques can also be extended to many other AI tasks, including NLP and speech.   

From Object Detection (ImageNet) to Event Detection (YouTube8M)

Deemed as the catalyst to the AI boom, the ImageNet dataset provided the research community a large dataset of human-annotated images for the development and testing of algorithms. Unlike ImageNet, video datasets however are still lagging in such classifications. It is only till recently that Google released the YouTube8M, providing the academic community with 8M annotated videos: an unprecedented scale and diversity of labelled data.
Even with the YouTube8M, the techniques for assigning tags to multi-minute videos are still very primitive. Data sets like UCF101, HMDB, MIT moments are only a few seconds long, which often boils down to an exercise on detecting motion flow patterns.
In the recent YouTube8M Challenge 2017, A*STAR I2R’s Deep Learning team beat out 655 submissions (including text) in the YouTube8M CVPR 2017 challenge with a multimodal deep learning framework that incorporated vision, audio and text. The team’s knowledge graph paper was the first to incorporate MIT ConceptNet into the object detection framework and showed significant performance improvement on MS COCO.  Recently, they also showed how an end to end deep learning system with knowledge graphs improves video classification performance on YouTube8M.

In order to associate a meaningful tag to 5 minutes of video data with high accuracy, it requires at least the following:

  • Understanding of multiple actions/objects/events on a timeline, which would potentially require combining current deep learning techniques with symbolic AI techniques of the past 
  • Incorporation of common sense knowledge 
  • Understanding of causality 
  • Some level of reasoning 
  • Beyond of course, the obvious slew of current deep learning techniques for finding key spatial/temporal patterns, understanding correlations across modalities, attention mechanisms, etc

A*STAR I2R’s Deep Learning team aims to address the above with state-of-the-art multimodal approaches to reduce the semantic gap going from classifying single still images to multiple minutes of video. Research on this topic will be critical in the improvement of current deep learning systems’ performances, and also help make them less brittle. 

Deep Learning : Speech Processing and Language Understanding

Adoption of Deep Learning in Speech and Natural Language processing have generated capabilities such as speech recognition and machine translation. In Singapore’s context, this is particularly challenging given localised language and context. There are still numerous challenges that need to be addressed, ranging from the requirement of specialized linguistic resources to large amounts of annotated data, and to insufficient theoretical grounding in deep learning. 

A*STAR I2R’s Deep Learning team will focus on 3 main areas of concerns in the advancement of deep learning in Speech and Natural Language processing. 

  1. Minimizing required linguistic knowledge in model development

    Specialized linguistic expertise is required for annotating data in spoken language processing and natural language understanding. Annotation tasks include specifying pronunciation variations, transcribing words, identifying topics, summarizing content, and inferring intent. Such linguistic expertise might not be readily available, especially for languages that are under-studied such as Singapore Tamil, which differs from its cousin spoken in India. 

    A deep learning framework could potentially bypass the requirement of certain specialized linguistic knowledge in constructing pronunciation lexicons. For example, Bayesian probabilistic modeling would require much more explicit linguistic knowledge to construct a graphical model for characterizing the articulatory gestures or the pronunciation evolution of historically-related languages (e.g., Mandarin, Vietnamese both have roots from Ancient Chinese), whereas a deep learning model could automatically infer such knowledge representation implicitly in the neural networks. 

  2. Generalizing well with limited annotated data 

    For deep learning to work well, large amounts of human annotated data is required. Singapore’s unique linguistic landscape and multicultural heritage exemplifies this key challenge. For example, Singapore English differs from other varieties of English by its melody (prosodic influence from tonal languages like Mandarin), pronunciation and grammar (hodgepodge influence of British English, American English, and Chinese languages), and specific local terminology (e.g., “damn shiok” originates from English and Punjabi, “Toa Payoh” from Hokkien and Malay).

    Constructing parsimonious neural models that require fewer parameters can help slim down the massive amounts of data needed. Transfer learning from a well-resourced language (e.g., American English) to a less-resourced language (e.g., Tagalog) has enjoyed some success in speech processing, but has been less explored in natural language processing. 

  3. Exploration of different deep learning approaches for speech and language processing

    In the designing of neural network architecture for the speech and language processing, certain linguistic insights would be injected and then evaluated if the learned models correspond to known linguistic principles.  Leveraging and combining other approaches such as Bayesian probabilistic modeling and knowledge-based frameworks could also help alleviate the inefficiencies of deep learning. For example, generative adversarial networks, which uses a generative and a discriminative network to compete against each other, has shown promising results in computer vision, but still needs further investigation to be applied to speech and language processing. 

Deep Learning on Embedded Devices

Towards 100x smaller models

Physical restrictions of embedded systems limits its processors, memory and storage capacity therefore resulting in less computing power. Current deep learning models with its 10s of millions of neurons makes it infeasible for deployment on mobile and embedded hardware. 

Apart from the need of a deep learning accelerator hardware solution, the ability to achieve models that are 100x smaller, with little loss in performance is critical.  It is also crucial for the emerging neuromorphic platforms, where the number of spiking neurons is in orders of magnitude smaller than in current state-of-the-art deep learning models. 

The dominant approach for classifiers and detectors are Convolutional Neural Network (CNN) based descriptors. However, one major drawback of CNN based descriptors is that uncompressed deep neural network models require hundreds of megabytes of storage, making them inconvenient to deploy in mobile applications, embedded or neuromorphic hardware. 

We are studying the limitations and of neural network compression to achieve models that are two orders of magnitude smaller in size.

Online Deep Learning

In the past decade, there are tremendous advancements in the development of deep learning architectures and their learning algorithms. They have also been successfully also applied widely across several domains, including, but not limited to, sensor data analytics, healthcare, image analysis, text analysis, speech analysis etc. Despite their tremendous progress, all these algorithms require the complete data a priori to train the models. These models are trained with an underlying assumption that the process that generates the stream of data is stationary. However, most real-world problems are often characterized by data arising from non-stationary environments, with the characteristics and dimension of the data evolving over time. 


Furthermore, determining the optimal model complexity for large scale, streaming data sets is a challenge. Considering the cumbersome process of training a deep neural network, it is inefficient to retrain a model for applications defined by large data sets. Thus, it becomes imperative to develop efficient online deep learning algorithms that can help determine the optimal model complexity for large scale, streaming data sets, while also adapting themselves to the changing demands of non-stationary data.

A*STAR’s I2R deep learning team is working on developing online learning algorithms for deep architectures, to address the demands of non-stationary data. These algorithms enable deep neural network architectures that evolve with streaming data, resulting in an optimal architecture.