Computer Science

A Computational Analysis of Vagueness in Revisions of Instructional Texts

A. Debnath and M. Roth

This research by Alok Debnath and Michael Roth dives into the intricacies of vagueness in instructional texts from the WikiHowToImprove dataset. By analyzing edits involving vagueness and developing a novel neural model to enhance clarity in instructions, they demonstrate significant advancements over existing techniques. Tune in to discover these insightful findings!

00:00

Playback language: English

Index

Introduction

Instructional texts aim for clarity and conciseness. WikiHow, an open-domain repository of instructional articles, allows users to revise articles, creating a rich source of revision histories. The wikiHowToImprove dataset compiles these histories, offering insights into linguistic phenomena in edits, including ambiguity and vagueness. This paper focuses on lexical vagueness, specifically instances where the main verb is revised to provide more specific information. The authors' goal is to contribute towards automated text editing for clarification by identifying and classifying vague instructions. They aim to create a dataset of vague and clarified instructions, analyze it using semantic frames (FrameNet), and evaluate a neural model's ability to distinguish between the two versions. This work represents a novel application of revision histories for studying vagueness and contributes a new dataset for linguistic research. Previous work on vagueness has often relied on logical representations, while this study focuses on lexical changes in revisions, similar to approaches that examine context-dependent resolution of vague expressions. The use of the wikiHowToImprove corpus provides a readily available source of user-generated edit pairs, making it a valuable resource for the analysis.

Literature Review

The use of revision histories as corpora for NLP tasks was introduced by Ferschke et al. (2013). Yang et al. (2016, 2017) explored categorizing edit intentions in Wikipedia edits. Anthonio et al. (2020) performed a similar categorization on WikiHow revisions. Traditional computational analyses of vagueness have used logical representations (DeVault and Stone, 2004; Tang, 2008). In contrast, this paper focuses on lexical changes in revisions, similar to analyses of context-dependent resolution of vague expressions (Meo et al., 2014). Other computational approaches have focused on detecting vague sense definitions in ontologies (Alexopoulos and Pavlopoulos, 2014), website privacy policies (Lebanoff and Liu, 2018), and historical documents (Vertan, 2019). Recent advances in neural models for hyponymy-hypernymy classification using sentence-level information are also relevant (Roller et al., 2018; Snow et al., 2004; Shwartz et al., 2016).

Methodology

The authors used the wikiHowToImprove corpus, initially cleaning and preprocessing the data to filter typos and misspellings using the Enchant python API and discarding sentences outside a length range (4-50 words). They extracted instructional sentences based on syntactic properties (imperative, instructional indicative, or passive with 'let') and retained pairs with an edit distance less than 10. FrameNet was employed to analyze verb frame relations. The INCEpTION Project's neural FrameNet Tools parser was used to identify evoked frames, focusing on frame relations (subframe-of, inherits-from, uses) between original and revised verb frames. Cases where FrameNet failed to identify the main verb or assign a frame were categorized as 'Other'. A pairwise ranking task was implemented to determine if a neural model could distinguish between original and revised instructions. The model consisted of two BiLSTM modules (LSTM 1A and LSTM 1B) processing each sentence version, followed by a joint layer LSTM AB and additional BiLSTM modules (LSTM 2A and LSTM 2B) for re-encoding. FastText and BERT embeddings were used, and the model was trained using a cross-entropy objective with self-attention. Experiments involved training on 30,044 pairs, testing on 6,237 pairs, and validation on 5,334 pairs, using the existing partition in wikiHowToImprove. The results were compared against a BiLSTM-Attention baseline from Anthonio et al. (2020).

Key Findings

The study resulted in a corpus of 41,615 sentences. The analysis of verb frame relations revealed that most edits fell into three categories: subframe-of, inherits-from, and uses. The 'Other' category encompassed instances where FrameNet tools failed to provide a frame. The pairwise ranking task showed that the proposed model, particularly with BERT embeddings, significantly outperformed the baseline (approximately 7% improvement). The model achieved an accuracy of around 71.16% with FastText embeddings and better accuracy with BERT embeddings. The analysis revealed that sentence pairs with a subframe relation were easiest to distinguish, while those with a usage relation were most often confused. The BERT model performed better on revisions with no clear frame-to-frame relation. The model's errors often involved synonymous verbs with high cosine similarity in their embeddings (e.g., allow/permit, choose/decide, create/make), suggesting that embeddings alone might be insufficient for this task.

Discussion

The findings demonstrate the feasibility of using a neural model to distinguish between vague and clarified instructions in revised instructional texts. The improved performance compared to the baseline highlights the effectiveness of the joint representation and the use of BERT embeddings. The observation that synonymous verbs often cause errors underscores the importance of incorporating additional features beyond word embeddings in future work. The ability to automatically identify vague instructions represents a significant step towards developing automated tools for text clarification, going beyond traditional grammar and style correction. The dataset created in this study provides a valuable resource for further research in vagueness and text revision.

Conclusion

This paper presents a method for extracting and analyzing vague instructions from revision histories, creating a new dataset for research on vagueness. A neural model incorporating a joint representation and BERT embeddings effectively distinguishes between vague and clarified instructions, surpassing existing baselines. Future work should explore additional features such as discourse context and more detailed FrameNet information to improve the model's accuracy. The methodology presented here has the potential to be extended to other linguistic phenomena and corpora, paving the way for more sophisticated automated text revision tools.

Limitations

The study's reliance on FrameNet may limit its applicability to cases where the main verb is not covered in FrameNet. The accuracy of automatic frame identification can also affect the results. The focus on lexical vagueness based on verb changes might not capture all instances of vagueness in instructional texts.

Related Publications

Explore these studies to deepen your understanding of the subject.

The Arts

Dynamics of artistic style: a computational analysis of the Maker’s motoric qualities in a clay-relief practice

N. Dick, A. Prusak, et al.

Health and Fitness

Influence of social determinants of health in the evolution of the quality of life of older adults in Europe: A comparative analysis between men and women

R. Llorens-ortega, C. Bertran-noguer, et al.

Economics

Analysis of the agricultural economic value of a weather forecasting service based on a survey of peasant households in Chinese provinces

D. Shen, X. Zhao, et al.

Humanities

A bibliometric analysis of cultural heritage research in the humanities: The Web of Science as a tool of knowledge management

L. Vlase and T. Lähdesmäki

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny