VisualHow: Multimodal Problem Solving

Jinhui Yang; Xianyu Chen; Ming Jiang; Shi Chen; Louis Wang; Qi Zhao

doi:10.1109/CVPR52688.2022.01518

VisualHow: Multimodal Problem Solving

Jinhui Yang, Xianyu Chen, Ming Jiang, Shi Chen, Louis Wang, Qi Zhao

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

Recent progress in the interdisciplinary studies of computer vision (CV) and natural language processing (NLP) has enabled the development of intelligent systems that can describe what they see and answer questions accordingly. However, despite showing usefulness in performing these vision-language tasks, existing methods still struggle in understanding real-life problems (i.e., how to do something) and suggesting step-by-step guidance to solve them. With an overarching goal of developing intelligent systems to assist humans in various daily activities, we propose VisualHow, a free-form and open-ended research that focuses on understanding a real-life problem and deriving its solution by incorporating key components across multiple modalities. We develop a new dataset with 20,028 real-life problems and 102,933 steps that constitute their solutions, where each step consists of both a visual illustration and a textual description that guide the problem solving. To establish better understanding of problems and solutions, we also provide annotations of multimodal attention that localizes important components across modalities and solution graphs that encapsulate different steps in structured representations. These data and annotations enable a family of new vision-language tasks that solve real-life problems. Through extensive experiments with representative models, we demonstrate their effectiveness on training and testing models for the new tasks, and there is significant scope for improvement by learning effective attention mechanisms. Our dataset and models are available at https://github.com/formidify/VisualHow.

Original language	English (US)
Title of host publication	Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Publisher	IEEE Computer Society
Pages	15606-15616
Number of pages	11
ISBN (Electronic)	9781665469463
DOIs	https://doi.org/10.1109/CVPR52688.2022.01518
State	Published - 2022
Event	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: Jun 19 2022 → Jun 24 2022

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2022-June
ISSN (Print)	1063-6919

Conference

Conference	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/Territory	United States
City	New Orleans
Period	6/19/22 → 6/24/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Keywords

Datasets and evaluation
Vision + language

Access

10.1109/CVPR52688.2022.01518

OpenUrl availability

Full text

Cite this

Yang, J., Chen, X., Jiang, M., Chen, S., Wang, L., & Zhao, Q. (2022). VisualHow: Multimodal Problem Solving. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 (pp. 15606-15616). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52688.2022.01518

VisualHow: Multimodal Problem Solving. / Yang, Jinhui; Chen, Xianyu; Jiang, Ming et al.
Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society, 2022. p. 15606-15616 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Yang, J, Chen, X, Jiang, M, Chen, S, Wang, L & Zhao, Q 2022, VisualHow: Multimodal Problem Solving. in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, IEEE Computer Society, pp. 15606-15616, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, United States, 6/19/22. https://doi.org/10.1109/CVPR52688.2022.01518

@inproceedings{0c82a146c33347d1b131cb490821f019,

title = "VisualHow: Multimodal Problem Solving",

abstract = "Recent progress in the interdisciplinary studies of computer vision (CV) and natural language processing (NLP) has enabled the development of intelligent systems that can describe what they see and answer questions accordingly. However, despite showing usefulness in performing these vision-language tasks, existing methods still struggle in understanding real-life problems (i.e., how to do something) and suggesting step-by-step guidance to solve them. With an overarching goal of developing intelligent systems to assist humans in various daily activities, we propose VisualHow, a free-form and open-ended research that focuses on understanding a real-life problem and deriving its solution by incorporating key components across multiple modalities. We develop a new dataset with 20,028 real-life problems and 102,933 steps that constitute their solutions, where each step consists of both a visual illustration and a textual description that guide the problem solving. To establish better understanding of problems and solutions, we also provide annotations of multimodal attention that localizes important components across modalities and solution graphs that encapsulate different steps in structured representations. These data and annotations enable a family of new vision-language tasks that solve real-life problems. Through extensive experiments with representative models, we demonstrate their effectiveness on training and testing models for the new tasks, and there is significant scope for improvement by learning effective attention mechanisms. Our dataset and models are available at https://github.com/formidify/VisualHow.",

keywords = "Datasets and evaluation, Vision + language",

author = "Jinhui Yang and Xianyu Chen and Ming Jiang and Shi Chen and Louis Wang and Qi Zhao",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022",

year = "2022",

doi = "10.1109/CVPR52688.2022.01518",

language = "English (US)",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "15606--15616",

booktitle = "Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022",

}

TY - GEN

T1 - VisualHow

T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

AU - Yang, Jinhui

AU - Chen, Xianyu

AU - Jiang, Ming

AU - Chen, Shi

AU - Wang, Louis

AU - Zhao, Qi

PY - 2022

Y1 - 2022

N2 - Recent progress in the interdisciplinary studies of computer vision (CV) and natural language processing (NLP) has enabled the development of intelligent systems that can describe what they see and answer questions accordingly. However, despite showing usefulness in performing these vision-language tasks, existing methods still struggle in understanding real-life problems (i.e., how to do something) and suggesting step-by-step guidance to solve them. With an overarching goal of developing intelligent systems to assist humans in various daily activities, we propose VisualHow, a free-form and open-ended research that focuses on understanding a real-life problem and deriving its solution by incorporating key components across multiple modalities. We develop a new dataset with 20,028 real-life problems and 102,933 steps that constitute their solutions, where each step consists of both a visual illustration and a textual description that guide the problem solving. To establish better understanding of problems and solutions, we also provide annotations of multimodal attention that localizes important components across modalities and solution graphs that encapsulate different steps in structured representations. These data and annotations enable a family of new vision-language tasks that solve real-life problems. Through extensive experiments with representative models, we demonstrate their effectiveness on training and testing models for the new tasks, and there is significant scope for improvement by learning effective attention mechanisms. Our dataset and models are available at https://github.com/formidify/VisualHow.

AB - Recent progress in the interdisciplinary studies of computer vision (CV) and natural language processing (NLP) has enabled the development of intelligent systems that can describe what they see and answer questions accordingly. However, despite showing usefulness in performing these vision-language tasks, existing methods still struggle in understanding real-life problems (i.e., how to do something) and suggesting step-by-step guidance to solve them. With an overarching goal of developing intelligent systems to assist humans in various daily activities, we propose VisualHow, a free-form and open-ended research that focuses on understanding a real-life problem and deriving its solution by incorporating key components across multiple modalities. We develop a new dataset with 20,028 real-life problems and 102,933 steps that constitute their solutions, where each step consists of both a visual illustration and a textual description that guide the problem solving. To establish better understanding of problems and solutions, we also provide annotations of multimodal attention that localizes important components across modalities and solution graphs that encapsulate different steps in structured representations. These data and annotations enable a family of new vision-language tasks that solve real-life problems. Through extensive experiments with representative models, we demonstrate their effectiveness on training and testing models for the new tasks, and there is significant scope for improvement by learning effective attention mechanisms. Our dataset and models are available at https://github.com/formidify/VisualHow.

KW - Datasets and evaluation

KW - Vision + language

UR - http://www.scopus.com/inward/record.url?scp=85141759478&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85141759478&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01518

DO - 10.1109/CVPR52688.2022.01518

M3 - Conference contribution

AN - SCOPUS:85141759478

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 15606

EP - 15616

BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

PB - IEEE Computer Society

Y2 - 19 June 2022 through 24 June 2022

ER -

VisualHow: Multimodal Problem Solving

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this