Multimodal Social Conversations
Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
If you want to know more about this research, read our paper, which will be presented at the Foundation Models (FoMo-HRI) workshop at the 2025 IEEE International Conference on Robot-Human Interactive Communication (RO-MAN)!
Feel free to get in touch if you have any other questions or want to know more! You can use any of the channels at the bottom of this page, or send an email to ruben[dot]janssens[at]ugent[dot]be.
If you use our work in any future research, please use the following citation:
@inproceedings{janssens2025towards,
title={Human – Foundation Models Interaction: A Focus On Multimodal Information},
author={Janssens, Ruben and Belpaeme, Tony},
booktitle={arXiv preprint arXiv:2507.19196},
year={2025}
}