Xiaobin Hu

Xiaobin Hu (胡晓彬)

I am a research scientist at Tencent working closely with Dr. Ying Tai and Dr. Chengjie Wang through ‘腾讯技术大咖’ program. I receive Shanghai Overseas Talents Award (Baiyulan Young Talent Program) in 2023, Before that, I obtained my Ph.D. degree at the School of Computer Science and Engineering, Technische Universität München, Germany, under the joint supervision of Prof. Bjoern Menze and Prof. Kuangyu Shi. During Phd studies, I also worked as a long-term intern in Chinese Academy of Sciences with Prof. Wenqi Ren, and a half-year internship with Prof. Dengping Fan and Hang Dai in IIAI and Mbzuai of United Arab Emirates.

Email: xbhunanu [at] gmail.com

My research interests are primarily around cutting-edge Generative AI research and its applications, with a particular focus on leveraging advanced large-scale vision and language models:

ID consistency image/video generation: AnyMaker (arxiv24), VTON-HandFit (arxiv24), DiffuMatting (ECCV24)
Image/video perception and understanding: JIF-MMFA (PR24), RLR (ECCV24), Manipvqa (IROS24), HitNet (AAAI23), M-RCNN (Friction22), DIS5K(ECCV22)
High-fidelity image/video restoration: MS-SVAN (TCSVT24), AutoGAN-Synthesizer (MICCAI22), MBL(CVPR2021), PyNAS (ICCV 2021)
Human-centric image/video editing and generation: TAFB (MM24), RealTalk (arxiv24), Plug-and-Play 3D (TPAMI22), FSR-3D (ECCV2020Spotlight)

Email / Google Scholar / Github /

News

07/2025 – Attain the runner-up in the 2025 ACM Multimedia Challenge (Identity-preserving video generation challenge)
07/2025 – 2 papers, DICE-Talk and StrandDesigner accepted by ACM MM 2025
06/2025 – 2 papers, OracleFusion and UniCombine accepted by ICCV 2025
02/2025 – 8 papers accepted by CVPR 2025
07/2024 – 1 paper accepted by MM 2024
07/2024 – 2 paper accepted by ECCV 2024
07/2024 – 1 paper accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT).
05/2024 – 1 paper accepted by Pattern Recognition (PR).

Selected Publications

	Sonic: Shifting Focus to Global Audio Perception in Portrait Animation Xiaozhong Ji1, Xiaobin Hu*, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, Qinglin Lu, Chengjie Wang CVPR, 2025* arXiv / video / bibtex / Code We propose a novel paradigm, dubbed as Sonic, to shift focus on the exploration of global audio perception.
	VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding Yujie Liang, Xiaobin Hu*, Boyuan Jiang, Donghao Luo, Xu Peng, Kai WU, Chengming Xu, Wenhui Han, Taisong Jin, Chengjie Wang, Rongrong Ji CVPR, 2025* arXiv / video / bibtex / Code Although diffusion-based image virtual try-on has made considerable progress, emerging approaches still struggle to effectively address the issue of hand occlusion (i.e., clothing re- gions occluded by the hand part), leading to a notable degradation of the try-on performance. To tackle this issue widely existing in real-world scenarios, we propose VTON-HandFit, leveraging the power of hand priors to reconstruct the appearance and structure for hand occlusion cases.
	FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on Boyuan Jiang, Xiaobin Hu*, Donghao Luo, Qingdong He, Chengming Xu, Jinlong Peng, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Yanwei Fu arxiv, 2025* arXiv / video / bibtex / Code FitDiT is designed for high-fidelity virtual try-on using Diffusion Transformers (DiT).
	Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, Boyu Wang CVPR, 2025 arXiv / video / bibtex / Code The prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the inversion and invariance control based on the flow transformer.
	SVFR: A Unified Framework for Generalized Video Face Restoration Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji CVPR, 2025 arXiv / video / bibtex / Code SVFR is a unified framework for face video restoration that supports tasks such as BFR, Colorization, Inpainting, and their combinations within one cohesive system.
	Highly Accurate Dichotomous Image Segmentation Xuebin Qin, Hang Dai Xiaobin Hu, Deng-Ping Fan, Ling Shao, Luc Van Gool, ECCV, 2022 arXiv / video / bibtex / Code To build the highly accurate Dichotomous Image Segmentation dataset (DIS5K), we first manually collected over 12,000 images from Flickr1 based on our pre-designed keywords. Then, we obtained 5,470 images of 22 groups and 225 categories from the 12,000 images according to the structural complexities of the objects. Each image is then manually labeled with pixel-wise accuracy using GIMP. The labeled targets in DIS5K mainly focus on the “objects of the images defined by the pre-designed keywords (categories)” regardless of their characteristics e.g., salient, common, camouflaged, meticulous, etc. The average per-image labeling time is ∼30 minutes and some images cost up to 10 hours.
	DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation Xiaobin Hu, Xu Peng, Donghao Luo, Xiaozhong Ji, Jinlong Peng, Zhengkai Jiang, Jiangning Zhang, Taisong Jin, Chengjie Wang, Rongrong Ji European Conference on Computer Vision (ECCV), 2024 paper / video / bibtex / code Our DiffuMatting shows several potential applications (e.g., matting-data generator, community-friendly art design and controllable generation).
	Efficiently Exploiting Spatially Variant Knowledge for Video Deblurring Qian Xu, Xiaobin Hu* , Donghao Luo, Ying Tai, Chengjie Wang, Yuntao Qian, (equal contribution) IEEE Transactions on Circuits and Systems for Video Technology, 2024 paper / video / bibtex / code Video deblurring is a challenging task as the blur is often spatially variant. Existing methods mainly engage in building the spatial-temporal correspondence among the frames
	3D Priors-Guided Diffusion for Blind Face Restoration Xiaobin Lu, Xiaobin Hu* , Jun Luo, zhuben, paulruan, Wenqi Ren, (equal contribution) ACM Multimedia (MM), 2024 paper / video / bibtex / code A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process.
	Joint-Individual Fusion Structure with Fusion Attention Module for Multi-Modal Skin Cancer Classification Peng Tang, Xintong Yan, Yang Nan, Xiaobin Hu #, Bjoern H Menze, Sebastian Krammer, Tobias Lasser. (corresponding author: #) Pattern Recognition, 2024 paper / video / bibtex / code Thus, this paper introduces a novel fusion method that integrates dermatological images (dermoscopy images or clinical images) with patient metadata for skin cancer classification, focusing on enhancing FS and FM components.
	Automated segmentation of the human supraclavicular fat depot via deep neural network in water-fat separated magnetic resonance images Yu Zhao, Chunmeng Tang, Bihao Cui, Arun Somasundaram, Johannes Raspe, Xiaobin Hu #, Christina Holzapfel, Daniela Junker, Hans Hauner, Bjoern Menze, Mingming Wu, Dimitrios Karampinos. (corresponding author # ) Quantitative Imaging in Medicine and Surgery, 2023 paper / video / bibtex / code Human brown adipose tissue (BAT), mostly located in the cervical/supraclavicular region, is a promising target in obesity treatment. Magnetic resonance imaging (MRI) allows for mapping the fat content quantitatively.
	High-resolution Iterative Feedback Network for Camouflaged Object Detection Xiaobin Hu , Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, Ling Shao Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 23) paper / video / bibtex / code To tackle this challenge, we aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries.
	AutoGAN-Synthesizer: Neural Architecture Search for Cross-Modality MRI Synthesis Xiaobin Hu , Ruolin Shen, Donghao Luo, Ying Tai, Chengjie Wang, Bjoern Menze MICCAI 2022 paper / video / bibtex / code In this study, we present a novel MRI synthesizer, called AutoGAN-Synthesizer, which automatically discovers generative networks for cross-modality MRI synthesis.
	Face Restoration via Plug-and-Play 3D Facial Priors Xiaobin Hu, Wenqi Ren, Jiaolong Yang, Xiaochun Cao, David Wipf, Bjoern Menze, Xin Tong, Hongbin Zha, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021 paper / video / bibtex / code Existing face restoration algorithms only employ 2D priors without considering high dimensional information (3D). The 3D morphable facial priors are the main novelty of this work and are completely different from recently related 2D prior works
	Face Super-Resolution Guided by 3D Facial Priors Xiaobin Hu, Wenqi Ren, John LaMaster, Xiaochun Cao, Xiaoming Li, Zechao Li, Bjoern Menze, Wei Liu, European Conference on Computer Vision (ECCV), 2020, (Spotlight Presentation) paper / video / bibtex / code In this paper, we propose a novel face super resolution method that explicitly incorporates 3D facial priors which grasp the sharp facial structures. Our work is the first to explore 3D morphable knowledge based on the fusion of parametric descriptions of face attributes (e.g., identity, facial expression, texture, illumination, and face pose)
	Pyramid Architecture Search for Real-Time Image Deblurring Xiaobin Hu, Wenqi Ren, Kaicheng Yu, Kaihao Zhang, Xiaochun Cao, Wei Liu, Bjoern Menze, International Conference on Computer Vision (ICCV), 2021, Montreal, Canada paper / video / bibtex / code we propose a novel deblurring method, dubbed PyNAS, towards automatically designing hyper-parameters including the scales, patches, and standard cell operators. Our primary contribution is a real-time deblurring algorithm (around 58 fps) for 720p images while achieves state-of-the-art deblurring performance on the GoPro and Video Deblurring datasets.
	SRGAT: Single Image Super-Resolution With Graph Attention Network Yanyang Yan, Wenqi Ren, Xiaobin Hu, Kun Li, Haifeng Shen, Xiaochun Cao, IEEE Transactions on Image Processing (TIP), 2021 paper / video / bibtex / code
	Ultra-High-Definition Image Dehazing via Multi-Guided Bilateral Learning Zhuoran Zheng, Wenqi Ren, Xiaochun Cao, Xiaobin Hu, Tao Wang, Fenglong Song, Xiuyi Jia, Computer Vision and Pattern Recognition (CVPR), 2021 paper / video / bibtex / code
	Weakly supervised deep learning for determining the prognostic value of 18 F-FDG PET/CT in extranodal natural killer/T cell lymphoma, nasal type Rui Guo, Xiaobin Hu , Haoming Song, Pengpeng Xu, Haoping Xu, Axel Rominger, Xiaozhu Lin, Bjoern Menze, Biao Li, Kuangyu Shi, (equal contribution) European Journal of Nuclear Medicine and Molecular Imaging, 2021, Top journal* paper / video / bibtex / code
	Morphological Residual Convolutional Neural Network (MRCNN) for Intelligent Recognition of Wear Particles From Artificial Joints Xiaobin Hu, Jian Song, Zhenhua Liao, Yuhong Liu, Jian Gao, Bjoern Menze, Weiqiang Liu, Friction, 2021, Top journal paper / video / bibtex / code
	Feedback Graph Attention Convolutional Network for MR Images Enhancement by Exploring Self-Similarity Features Xiaobin Hu, Yanyang Yan, Wenqi Ren, Hongwei Li, Amirhossein Bayat, Yu Zhao, Bjoern Menze, Medical Imaging with Deep Learning (MIDL), 2021 paper / video / bibtex / code
	Coarse-to-Fine Adversarial Networks and Zone-Based Uncertainty Analysis for NK/T-Cell Lymphoma Segmentation in CT/PET Images Xiaobin Hu , Rui Guo, Jieneng Chen, Hongwei Li, Diana Waldmannstetter, Yu Zhao, Biao Li, Kuangyu Shi, Bjoern Menze, Journal of Biomedical and Health Informatics, 2020, Top journal paper / video / bibtex / code
	Spatial-Frequency Non-local Convolutional LSTM Network for pRCC Classification Yu Zhao, et al., Xiaobin Hu ^# , Bjoern Menze; corresponding author ^#. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2019) paper / video / bibtex / code
	Toward a Brain-Inspired System: Deep Recurrent Reinforcement Learning for a Simulated Self-Driving Agent Jieneng Chen, Jingye Chen, Ruiming Zhang, Xiaobin Hu ^# ; corresponding author ^#. Frontiers in neurorobotics, 2019 paper / video / bibtex / code