Automatic Speech Recognition System
ASR Systems try to map each signal input to the corresponding text. The architecture of ASR Systems typically includes Feature Extractor, Acoustic Model (Encoder) and Decoder. In many situations, Language Model might be used to improve prediction accuracy. After being extracted by Feature Extractor, data are encoded by the Encoder and decoded by the Decoder to produce the text. Click here for ASR demonstration.
Voice cloning is a task in text-to-speech (TTS) technology that allows for replicating a person’s voice to speak a given text. Voice cloning can be applied in various fields, such as creating voice agents for banks and commerce. It can restore the voices of deceased individuals or those with speech impairments. Additionally, it can be used to generate voices for characters in video games, enhancing the gaming experience with more personalized and realistic character interactions.
Voice Cloning System
Although there are several ongoing studies in this field, challenges remain. The generated voice often lacks the naturalness of human speech, and it may not capture the unique characteristics of the reference voice accurately. These limitations highlight the need for further advancements to achieve more realistic and distinctive voice cloning results.
DGSpeech Architecture
This project introduces DGSpeech, which is built on the FastSpeech2 architecture. DGSpeech incorporates:
MixStyle Layer Normalization: this technique perturbs style information by mixing and shuffling style embeddings, enhancing robustness and generalization of TTS models to adapt to varying styles and domains.
Flow-based Postnet: refines outputs from the transform decoder to produce more precise mel-spectrograms, enhancing the detail and quality of generated speech, particularly in expressive contexts.
Click here for Voice Cloning demonstration.
Currently, the demand for direct interaction between businesses and customers is increasing. Science and technology is constantly evolving, Chatbot has been launched to solve the above needs for companies and businesses. Chatbots are used in many topics depending on the scenario built for bots. In the topic of restaurant booking service, the chatbot was created to interact with customers in the most natural way but still solve the customer’s needs and get the most accurate information of the customer.
There are many frameworks for creating chatbots, one of which is the Rasa framework. The framework acts as the expected conversational flow which makes it feel more like a human interaction. This framework consists of two parts: Rasa NLU to determine user intent and Rasa Core based on user intent to decide a next action to be performed by the bot as the script is built.
Rasa NLU independent analysis system and independent action of Rasa Core ref
Test chatbot Rasa in fanpage facebook
The problem of emotion recognition is a large area of research based on two subjects, which are emotional psychology and artificial intelligence. Human’s emotions can be expressed through speech or nonverbal such as facial variations, tone of voice, etc and are recognized by sensor. In 1967, Mehrabian showed that 55% of emotions expressed from the face, 38% from the words voice and 7% from speech. That is why researchers are very interested in this field.
Facial emotion recognition has many applications in variety of fields:
Traditionally, facial emotion recognition has relied heavily on static images. However, in dynamic contexts such as videos, understanding temporal dynamics becomes crucial for accurate emotion interpretation. One approach to addressing this involves focusing on the most relevant frames within a video, rather than uniformly sampling frames. This is achieved by calculating a distribution for the importance of frames based on an attention mechanism, facilitating more efficient sampling and analysis.
In overview, the model is divided into 2 stages:
Stage 1: The model will generate an attention weight to measure the correlation between frames. Then, a new distribution will be calculated based on this matrix, allowing for more suitable sampling of new frames.
Stage 2: The model will take in the new frame samples and provide the primary prediction results.
Usage:
This multi-stage approach not only enhances the accuracy and efficiency of emotion recognition systems but also renders them indispensable tools across a multitude of fields and applications.
ECG Signal
Early detection and prediction of cardiac anomalies play an important role in the diagnosis and treatment of cardiovascular diseases. In medicine, electrocardiography provides valuable information for the doctors since they can accurately determine what is happening concerning the heart activities. Nevertheless, electrocardiography classification is a non-trivial challenge due to the specialties of these data as well as the reliability of manual data collection methods. This motivates IASLab’s members to study electrocardiography signal classification method based on Deep Learning approach to handle data collected from intelligent IoT devices.
In today’s digital age, images have become the universal language of the internet. Image Captioning - this extraordinary fusion of computer vision and natural language processing has the capacity to bestow machines with the gift of language, enabling them to perceive and describe the intricate tapestry of our visual world in ways that were once the exclusive domain of human understanding.
Image Captioning
The mechanics of image captioning involve a two-fold process: Visual Recognition and Natural Language Generation.
Image Captioning - Basic Model
With the explosion of social media platforms such as YouTube, Facebook, TikTok, along with the vast amount of short-duration videos being created and shared every day, the demand for models to process and automate video-related workflows, particularly in the field of video understanding, is growing rapidly. The field of Video Understanding comprises various core subtasks such as action recognition and classification, video topic classification, object detection and tracking, video caption generation, and more. Among these tasks, dense video captioning has become one of the most prominent and highly regarded topics due to the challenges it presents and real-world applications.
How does dense video captioning work?
Dense captions generator has many applications in variety of fields:
Our repository: https://github.com/Tien-Nam-Nguyen/Thesis
Our research team specializes in the exciting field of text-to-image generation. This technology has immense potential across various industries, including e-commerce, advertising, virtual reality, and creative content generation.
Datasets: MS-COCO, ANNA, Multi-modal CelebA-HQ, DeepFashion-MultiModal, …
Key Features and Capabilities:
Use Cases:
Vision is an essential aspect of human life, providing us with the ability to perceive and interact with the world around us. However, visual impairment is a prevalent global issue, affecting a significant proportion of the population. Living with visual impairment poses significant challenges for individuals when it comes to independently navigating their surroundings, both indoors and outdoors.
The objective of this project is to design and deploy an object detection device to assist individuals with visual impairments in their daily lives. This system applies computer vision techniques and image processing to help individuals recognize and locate objects in their surrounding environment. The core components of the system comprise an Intel depth camera, a processor such as a single-board computer/laptop, a microphone, and a headphone. These hardware components work in synergy to enable real-time object detection, obstacle avoidance, and virtual assistant functionalities.
The system architecture and modes
Key Features and Capabilities:
Click here to see the video demonstrating the system.
Presentations are commonly used in the fields of business, education, and research because they can effectively summarize and clarify large amounts of information using visual aids. With the development of deep learning, we aim to create a deep-learning model that can produce presentation slides on demand. This solution involves document summarizing, image and text retrieval, and slide organization to ensure that important components are presented in a suitable format. Our system is designed to help researchers efficiently create presentations on their respective topics.
Datasets: DOC2PPT, SciDuet, PS5K, …
Key Features and Capabilities:
Use Cases:
Lung Tumor Segmentation
Brain Tumor Segmentation
Nowadays, cancers is one of the most common causes of death. It is the growing of abnormal cells which is able to invade and destroy other cells and tissues; moreover, it can occur in any organ of the body. However, cancers can be treated if they’re detected early. Therefore, some researches on this domain have been conducted by IASLab. We focus on detecting tumor of Lung and Brain.
Tumor Segmentation is the process of separating the tumor from a medical image (2D or 3D). It provides useful information for diagnosis, clinical studies and treatment planning. By using Deep Learining architectures, we got some promising results on both Lung and Brain tumor segmentation aims. This significantly encourages us to expand our jobs in Tumor Segmentation domain.
Medical image segmentation, like a digital scalpel, unveils hidden details in scans to improve diagnoses, treatment, and patient care. While recent deep learning methods using Transformers and U-Net are powerful, they can be computationally expensive. Our MCUnet offers a solution, achieving high accuracy with efficient convolution operations. We introduce three key innovations: CRFBNet for improved skip connections, Multi-Head Output for better predictions, and Consistency Guide Loss for robust multi-head training.
Overall Architecture of MCUnet
Qualitative result
We always extend our research topic to meet our life needed, if you found a topic is interested, feel free to raise.