Computer Science - Multimedia Publications (50)


Computer Science - Multimedia Publications

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. Read More

Dynamic adaptive streaming over HTTP (DASH) has recently been widely deployed in the Internet and adopted in the industry. It, however, does not impose any adaptation logic for selecting the quality of video fragments requested by clients and suffers from lackluster performance with respect to a number of desirable properties: efficiency, stability, and fairness when multiple players compete for a bottleneck link. In this paper, we propose a throughput-friendly DASH (TFDASH) rate control scheme for video streaming with multiple clients over DASH to well balance the trade-offs among efficiency, stability, and fairness. Read More

The technique of hiding messages in digital data is called a steganography technique. With improved sequencing techniques, increasing attempts have been conducted to hide hidden messages in deoxyribonucleic acid (DNA) sequences which have been become a medium for steganography. Many detection schemes have developed for conventional digital data, but these schemes not applicable to DNA sequences because of DNA's complex internal structures. Read More

This paper presents an empirical study on applying convolutional neural networks (CNNs) to detecting J-UNIWARD, one of the most secure JPEG steganographic method. Experiments guiding the architectural design of the CNNs have been conducted on the JPEG compressed BOSSBase containing 10,000 covers of size 512x512. Results have verified that both the pooling method and the depth of the CNNs are critical for performance. Read More

Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. Despite works in computational multimodal modeling, the problem of cross-modal audio-visual generation has not been systematically studied in the literature. In this paper, we make the first attempt to solve this cross-modal generation problem leveraging the power of deep generative adversarial training. Read More

In the 360-degree immersive video, a user only views a part of the entire raw video frame based on her viewing direction. However, today's 360-degree video players always fetch the entire panoramic view regardless of users' head movement, leading to significant bandwidth waste that can be potentially avoided. In this paper, we propose a novel adaptive streaming scheme for 360-degree videos. Read More

Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. Read More

The panoramic video is widely used to build virtual reality (VR) and is expected to be one of the next generation Killer-Apps. Transmitting panoramic VR videos is a challenging task because of two problems: 1) panoramic VR videos are typically much larger than normal videos but they need to be transmitted with limited bandwidth in mobile networks. 2) high-resolution and fluent views should be provided to guarantee a superior user experience and avoid side-effects such as dizziness and nausea. Read More

Item features play an important role in movie recommender systems, where recommendations can be generated by using explicit or implicit preferences of users on traditional features (attributes) such as tag, genre, and cast. Typically, movie features are human-generated, either editorially (e.g. Read More

Music emotion recognition (MER) is usually regarded as a multi-label tagging task, and each segment of music can inspire specific emotion tags. Most researchers extract acoustic features from music and explore the relations between these features and their corresponding emotion tags. Considering the inconsistency of emotions inspired by the same music segment for human beings, seeking for the key acoustic features that really affect on emotions is really a challenging task. Read More

This paper addresses a challenging problem -- how to generate multi-view cloth images from only a single view input. To generate realistic-looking images with different views from the input, we propose a new image generation model termed VariGANs that combines the strengths of the variational inference and the Generative Adversarial Networks (GANs). Our proposed VariGANs model generates the target image in a coarse-to-fine manner instead of a single pass which suffers from severe artifacts. Read More

As a highlighting research topic in the multimedia area, cross-media retrieval aims to capture the complex correlations among multiple media types. Learning better shared representation and distance metric for multimedia data is important to boost the cross-media retrieval. Motivated by the strong ability of deep neural network in feature representation and comparison functions learning, we propose the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. Read More

In this work, we propose CLass-Enhanced Attentive Response (CLEAR): an approach to visualize and understand the decisions made by deep neural networks (DNNs) given a specific input. CLEAR facilitates the visualization of attentive regions and levels of interest of DNNs during the decision-making process. It also enables the visualization of the most dominant classes associated with these attentive regions of interest. Read More

This notebook paper describes our system for the untrimmed classification task in the ActivityNet challenge 2016. We investigate multiple state-of-the-art approaches for action recognition in long, untrimmed videos. We exploit hand-crafted motion boundary histogram features as well feature activations from deep networks such as VGG16, GoogLeNet, and C3D. Read More

This paper introduces a blind watermarking based on a convolutional neural network (CNN). We propose an iterative learning framework to secure robustness of watermarking. One loop of learning process consists of the following three stages: Watermark embedding, attack simulation, and weight update. Read More

Motivated by emerging vision-based intelligent services, we consider the problem of rate adaptation for high quality and low delay visual information delivery over wireless networks using scalable video coding. Rate adaptation in this setting is inherently challenging due to the interplay between the variability of the wireless channels, the queuing at the network nodes and the frame-based decoding and playback of the video content at the receiver at very short time scales. To address the problem, we propose a low-complexity, model-based rate adaptation algorithm for scalable video streaming systems, building on a novel performance model based on stochastic network calculus. Read More

Noise is often brought to host audio by common signal processing operation, and it usually changes the high-frequency component of an audio signal. So embedding watermark by adjusting low-frequency coefficient can improve the robustness of a watermark scheme. Moving Average sequence is a low-frequency feature of an audio signal. Read More

A synchronization code scheme based on moving average is proposed for robust audio watermarking in the paper. Two proper positive integers are chosen to compute the moving average sequence by sliding one sample every time. The synchronization bits are embedded at crosses of the two moving average sequences with the quantization index modulation. Read More

Steganography involves hiding a secret message or image inside another cover image. Changes are made in the cover image without affecting visual quality of the image. In contrast to cryptography, Steganography provides complete secrecy of the communication. Read More

Multimedia retrieval plays an indispensable role in big data utilization. Past efforts mainly focused on single-media retrieval. However, the requirements of users are highly flexible, such as retrieving the relevant audio clips with one query of image. Read More

In this paper, we design a system in order to perform the real-time beat tracking for an audio signal. We use Onset Strength Signal (OSS) to detect the onsets and estimate the tempos. Then, we form Cumulative Beat Strength Signal (CBSS) by taking advantage of OSS and estimated tempos. Read More

Cross-modal retrieval has become a highlighted research topic, which can provide flexible retrieval experience across multimedia data such as image, video, text and audio. Recently, researchers explore it via DNN, and a two-stage learning framework is adopted by most existing methods: The first learning stage is to generate separate representation for each modality, and the second learning stage is to get cross-modal common representation. But existing methods have three limitations: In the first learning stage, they only model intra-modality correlation, but ignore inter-modality one which can provide complementary context for learning better separate representation; in the second learning stage, they only adopt shallow network structures with single-loss regularization, which ignore intrinsic relevance of intra-modality and inter-modality correlation, so cannot effectively exploit and balance them to improve generalization performance; besides, only original instances are considered while complementary fine-grained clues provided by their patches are ignored. Read More

Copy-move forgery is one of the simple and effective operations to create forged images. Recently, techniques based on singular value decomposition (SVD) are widely used to detect copy-move forgery (CMF). Some approaches based on SVD are most acceptable to detect copy-move forgery but some copy-move forgery detection approaches can not produce satisfactory detection results. Read More

General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. Read More

Analyzing videos of human actions involves understanding the temporal relationships among video frames. CNNs are the current state-of-the-art methods for action recognition in videos. However, the CNN architectures currently being used have difficulty in capturing these relationships. Read More

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus on addressing audio information only.In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNN (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. Read More

With the headway of the advanced image handling software and altering tools, a computerized picture can be effectively controlled. The identification of image manipulation is vital in light of the fact that an image can be utilized as legitimate confirmation, in crime scene investigation, and in numerous different fields. The image forgery detection techniques intend to confirm the credibility of computerized pictures with no prior information about the original image. Read More

In this paper, we present a transfer learning approach for music classification and regression tasks. We propose to use a pretrained convnet feature, a concatenated feature vector using activations of feature maps of multiple layers in a trained convolutional network. We show that how this convnet feature can serve as a general-purpose music representation. Read More

This letter is about a principal weakness of the published article by Li et al. in 2014. It seems that the mentioned work has a terrible conceptual mistake while presenting its theoretical approach. Read More

Virtual reality (VR) video provides an immersive 360 viewing experience to a user wearing a head-mounted display: as the user rotates his head, correspondingly different fields-of-view (FoV) of the 360 video are rendered for observation. Transmitting the entire 360 video in high quality over bandwidth-constrained networks from server to client for real-time playback is challenging. In this paper we propose a multi-stream switching framework for VR video streaming: the server pre-encodes a set of VR video streams covering different view ranges that account for server-client round trip time (RTT) delay, and during streaming the server transmits and switches streams according to a user's detected head rotation angle. Read More

We report the design, implementation, and deployment of Lepton, a fault-tolerant system that losslessly compresses JPEG images to 77% of their original size on average. Lepton replaces the lowest layer of baseline JPEG compression-a Huffman code-with a parallelized arithmetic code, so that the exact bytes of the original JPEG file can be recovered quickly. Lepton matches the compression efficiency of the best prior work, while decoding more than nine times faster and in a streaming manner. Read More

The application of mobile computing is currently altering patterns of our behavior to a greater degree than perhaps any other invention. In combination with the introduction of BLE (Bluetooth Low Energy) and similar technologies enabling context-awareness, designers are today finding themselves empowered to build experiences and facilitate interactions with our physical surroundings in ways not possible before. The aim of this thesis is to present a research project, currently underway at the University of Cambridge, which is dealing with implementation of a BLE system into a museum environment. Read More

The demand for global video has been burgeoning across industries. With the expansion and improvement of video streaming services, cloud-based video is evolving into a necessary feature of any successful business for reaching internal and external audiences. This paper considers video streaming over distributed systems where the video segments are encoded using an erasure code for better reliability thus being the first work to our best knowledge that considers video streaming over erasure-coded distributed cloud systems. Read More

The paper presents a novel concept that analyzes and visualizes worldwide fashion trends. Our goal is to reveal cutting-edge fashion trends without displaying an ordinary fashion style. To achieve the fashion-based analysis, we created a new fashion culture database (FCDB), which consists of 76 million geo-tagged images in 16 cosmopolitan cities. Read More

This paper addresses the problem of handling spatial misalignments due to camera-view changes or human-pose variations in person re-identification. We first introduce a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair. The learned correspondence structure can not only capture the spatial correspondence pattern between cameras but also handle the viewpoint or human-pose variation in individual images. Read More

Dance Dance Revolution (DDR) is a popular rhythm-based video game. Players perform steps on a dance platform in synchronization with music as directed by on-screen step charts. While many step charts are available in standardized packs, users may grow tired of existing charts, or wish to dance to a song for which no chart exists. Read More

The details of an image with noise may be restored by removing noise through a suitable image de-noising method. In this research, a new method of image de-noising based on using median filter (MF) in the wavelet domain is proposed and tested. Various types of wavelet transform filters are used in conjunction with median filter in experimenting with the proposed approach in order to obtain better results for image de-noising process, and, consequently to select the best suited filter. Read More

Teleradiology enables medical images to be transferred over the computer networks for many purposes including clinical interpretation, diagnosis, archive, etc. In telemedicine, medical images can be manipulated while transferring. In addition, medical information security requirements are specified by the legislative rules, and concerned entities must adhere to them. Read More

Steganography is collection of methods to hide secret information ("payload") within non-secret information ("container"). Its counterpart, Steganalysis, is the practice of determining if a message contains a hidden payload, and recovering it if possible. Presence of hidden payloads is typically detected by a binary classifier. Read More

Studies show that refining real-world categories into semantic subcategories contributes to better image modeling and classification. Previous image sub-categorization work relying on labeled images and WordNet's hierarchy is not only labor-intensive, but also restricted to classify images into NOUN subcategories. To tackle these problems, in this work, we exploit general corpus information to automatically select and subsequently classify web images into semantic rich (sub-)categories. Read More

This paper reviews the causes of discomfort in viewing stereoscopic content. These include objective factors, such as misaligned images, as well as subjective factors, such as excessive disparity. Different approaches to the measurement of visual discomfort are also reviewed, in relation to the underlying physiological and psychophysical processes. Read More

The emergence of smart Wi-Fi APs (Access Point), which are equipped with huge storage space, opens a new research area on how to utilize these resources at the edge network to improve users' quality of experience (QoE) (e.g., a short startup delay and smooth playback). Read More

Motion compensation is a fundamental technology in video coding to remove the temporal redundancy between video frames. To further improve the coding efficiency, sub-pel motion compensation has been utilized, which requires interpolation of fractional samples. The video coding standards usually adopt fixed interpolation filters that are derived from the signal processing theory. Read More

Streaming video is becoming the predominant type of traffic over the Internet with reports forecasting the video content to account for 80% of all traffic by 2019. With significant investment on Internet backbone, the main bottleneck remains at the edge servers (e.g. Read More

Progress in Multiple Object Tracking (MOT) has been historically limited by the size of the available datasets. We present an efficient framework to annotate trajectories and use it to produce a MOT dataset of unprecedented size. In our novel path supervision the annotator loosely follows the object with the cursor while watching the video, providing a path annotation for each object in the sequence. Read More

In general, the quality of experience QoE is subjective and context-dependent, identifying and calculating the factors that affect QoE is a difficult task. Recently, a lot of effort has been devoted to estimating the users QoE in order to enhance video delivery. In the literature, most of the QoE-driven optimization schemes that realize trade-offs among different quality metrics have been addressed under the assumption of homogenous populations, nevertheless, people perceptions on a given video quality may not be the same, which makes the QoE optimization harder. Read More

The use of peer-to-peer (P2P) networks for multimedia distribution has spread out globally in recent years. The mass popularity is primarily driven by cost-effective distribution of content, also giving rise to piracy. An end user (buyer/peer) of a P2P content distribution system does not want to reveal his/her identity during a transaction with a content owner (merchant), whereas the merchant does not want the buyer to further distribute the content illegally. Read More

Music auto-tagging is often handled in a similar manner to image classification by regarding the 2D audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstractions. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. Read More

Recently, the end-to-end approach that learns hierarchical representations from raw data using deep convolutional neural networks has been successfully explored in the image, text and speech domains. This approach was applied to musical signals as well but has been not fully explored yet. To this end, we propose sample-level deep convolutional neural networks which learn representations from very small grains of waveforms (e. Read More

Sports data analysis is becoming increasingly large-scale, diversified, and shared, but difficulty persists in rapidly accessing the most crucial information. Previous surveys have focused on the methodologies of sports video analysis from the spatiotemporal viewpoint instead of a content-based viewpoint, and few of these studies have considered semantics. This study develops a deeper interpretation of content-aware sports video analysis by examining the insight offered by research into the structure of content under different scenarios. Read More