NameGuess: Column Name Expansion for Tabular Data

Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Shen Wang, Huzefa Rangwala, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent advances in large language models have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NAMEGUESS, to expand column names (used in database schema) as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the complexities associated with polysemy and ambiguity in NAMEGUESS, we enhance autoregressive language models by conditioning on table content and column header names - yielding a fine-tuned model (with 2.7B parameters) that matches human performance. Furthermore, we conduct a comprehensive analysis (on multiple LLMs) to validate the effectiveness of table content in NAMEGUESS and identify promising future opportunities. Code has been made available at https://github.com/amazon-science/nameguess.

Original languageEnglish (US)
Title of host publicationEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics (ACL)
Pages13276-13290
Number of pages15
ISBN (Electronic)9798891760608
StatePublished - 2023
Externally publishedYes
Event2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore
Duration: Dec 6 2023Dec 10 2023

Publication series

NameEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
Country/TerritorySingapore
CityHybrid, Singapore
Period12/6/2312/10/23

Bibliographical note

Publisher Copyright:
©2023 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'NameGuess: Column Name Expansion for Tabular Data'. Together they form a unique fingerprint.

Cite this