Computer Science
Improving Web Element Localization by Using a Large Language Model
M. Nass, E. Alégroth, et al.
The paper addresses the robustness challenge in GUI-based automated regression testing, specifically the problem of reliably localizing web elements across evolving versions of web applications. Traditional multi-locator similarity approaches (e.g., Similo and VON Similo) compare attributes to compute similarity but lack semantic understanding of text and contextual reasoning about web application structures. Minor GUI changes (e.g., button text changing from “Submit” to “Send” or markup changes from input type=button to button) can cause false positives or broken scripts. The authors hypothesize that large language models (LLMs), trained on vast web/text corpora, can complement conventional algorithms with semantic and contextual reasoning to improve element localization. They propose VON Similo LLM, which leverages an LLM to choose the best candidate among the top-ranked elements from VON Similo, aiming to approach human-like selection. The paper contributes: (1) a novel LLM-augmented localization approach, (2) an empirical evaluation of effectiveness and efficiency versus a baseline, and (3) a qualitative analysis of LLM motivations to understand the aspects it uses when comparing elements.
The related work spans postrepair and preventive (resilient locator) strategies for GUI/web testing. Postrepair approaches include WATER and WATERFALL, which repair broken locators using weighted heuristics and version knowledge; COLOR and Erratum that leverage multiple clues and flexible tree matching; GUIDE for GUI differencing to suggest repairs. Preventive strategies generate robust locators or diversify signals: algorithms for resilient XPath generation (Montoto; ROBULA/ROBULA+), multi-locator voting (Leotta et al.), and contextual triangulation using neighboring elements (ATA/ATA-QV). Recent NLP/LLM-based methods identify elements from natural-language instructions or generate human-like test actions and inputs (e.g., Kirinuki et al., GPTDroid, QTypist, CrawLabel). Similo and VON Similo combine multiple properties and visual overlap concepts. The proposed VON Similo LLM advances these by incorporating LLM-based semantic understanding and context awareness on top of conventional similarity ranking.
The study evaluates VON Similo LLM against VON Similo on effectiveness (correct localizations) and efficiency (execution time), and qualitatively analyzes LLM motivations. Dataset: 804 oracle pairs of corresponding web elements from old vs. new versions of 48 real-world websites (derived from Alexa Top 50 with two exclusions), where older versions (12–60 months apart) were accessed via the Internet Archive. Elements eligible as oracles supported actions/assertions/synchronization, belonged to core features, and existed on both homepages. Properties were extracted with a Java scraper. The VON concept was applied to merge visually overlapping DOM nodes (overlap ratio > 0.85) into visual web elements, allowing properties to hold multiple values. Baseline algorithm: VON Similo ranks candidates by a weighted sum over comparisons of 14 properties (e.g., tag, text, xpaths, class, href, alt, location, size/area, shape, visible/neighbor text), returning the top match or a ranked list. LLM selection: GPT-4 was chosen for effectiveness and larger context window despite higher cost/latency compared to GPT-3.5-turbo. VON Similo LLM process: (1) use VON Similo to rank all candidates on the new page versus desired (old) properties; (2) take top 10 candidates; (3) construct an LLM prompt with instructions, the 10 candidates and the desired target in JSON-like format; (4) query GPT-4 to return the widget_id of the most similar candidate; optionally request motivations. Prompt engineering: zero-shot vs. one-shot (with a single example). On a 70-case subset where VON Similo failed, one-shot improved located cases from 37 to 41 (52.9% to 58.6%). Full experiment used one-shot and 10 candidates to balance accuracy, prompt size, and cost; this constraint excluded 13 cases where the true element was not in the top 10. Experimental phases: (1) run VON Similo on all 804 to identify failures; (2) for the 70 VON Similo failures, query GPT-4 for top-10 selection plus motivations and qualitatively code motivations into comparison operator, semantic understanding, or context awareness; (3) run VON Similo LLM on all 804, requesting only the widget_id for efficiency, and measure execution times from call to result. A control with randomized candidate order in prompts yielded identical results, ruling out ordering bias. Localization criterion: a candidate is considered located if any of its (possibly multiple) XPaths matches the oracle XPath exactly. Efficiency metric: average time per localization in milliseconds; API cost tracked.
- Effectiveness (RQ1): VON Similo located 734/804 (91.3%), failing 70. VON Similo LLM located 764/804 (95.0%), failing 40, a 42.9% reduction in failures (70→40).
- Overlap: 724 cases located by both; 40 located only by VON Similo LLM; 10 located only by VON Similo.
- Efficiency (RQ2): Average time per localization was 29 ms for VON Similo vs. 1934 ms (STD 537) for VON Similo LLM, showing the LLM approach is nearly two orders of magnitude slower due to GPT-4 API latency.
- Cost: Total GPT-4 API cost reported as $35.86 for the 804 prompts (~$0.045 per prompt).
- Prompt engineering: On the 70-case subset where VON Similo failed, one-shot prompting improved located cases from 37/70 (52.9%) to 41/70 (58.6%).
- Motivations (RQ3): For VON Similo-incorrect cases, 47% of LLM motivations reflected context awareness, 17% semantic understanding, 36% comparison operator usage. For VON Similo-correct cases, comparison-operator-style motivations increased to about 45.4%. Examples included recognizing menu/footer groupings via neighbor_text and inferring semantic equivalence of labels like “Health & Beauty” vs. “Beauty, Health & Hair.”
The findings support the hypothesis that augmenting similarity-based localization with LLM reasoning improves effectiveness: GPT-4 often leverages context awareness (layout, grouping patterns, neighbor text) and semantic understanding (paraphrases and synonyms in labels) beyond conventional string/distance comparisons. This reduces false positives and maintenance efforts in GUI test automation. However, the LLM approach incurs substantial latency (2 s per decision) and runtime cost. An indicative ROI analysis suggests the API cost can be negligible relative to manual maintenance: with an average maintenance effort of 110 minutes per test case and 47 localizations per test case, at $100/hour, maintenance costs ($183) dwarf API costs (~$2), and increased robustness (reducing failures from 70 to 40) could lower maintenance to ~$105. Nonetheless, these estimates vary by context. Limitations include reliance on GPT-4 (cloud, latency, cost, security) and the top-10 constraint that omitted some true targets, potentially underestimating LLM effectiveness. The 10 cases that baseline found but LLM missed could not be analyzed in detail because motivations were not collected in the full 804-case run. Overall, LLM-enhanced localization shows promise as a complementary technique, with practicality depending on performance and cost trade-offs that may improve with future models.
VON Similo LLM, which uses an LLM to select the best match among top-ranked candidates from a conventional multi-locator method, improves web element localization accuracy from 91.3% to 95.0% on 804 element pairs across 48 real-world sites, reducing failures by ~43%. The LLM contributes semantic understanding and context awareness (e.g., recognizing UI groupings and paraphrased labels), potentially decreasing manual maintenance and false positives in GUI test automation. The approach is currently limited by API latency and cost, as well as reliance on cloud services. Future work includes exploring LLM-only selection across all page elements (e.g., tournament selection), providing richer modalities such as pixel-level UI representations, employing structured prompting methods (e.g., Chain/Tree of Thought) to enhance reasoning, hybrid decision strategies that invoke the LLM only when baseline confidence is low, and evaluating faster or local LLMs to mitigate latency, cost, and security concerns.
- External validity: Websites drawn from Alexa Top sites; only homepages were used due to Internet Archive constraints, which may not represent all page types. Versions compared were 12–60 months apart; change magnitude may differ from typical industrial release cycles.
- Internal/construct validity: Manual selection of oracle pairs might miss heavily changed but still corresponding elements. Use of GPT-4 specifically; results may differ with other LLMs or later versions.
- Methodological constraint: Limiting LLM input to top-10 candidates excluded 13 true targets from consideration, likely underestimating LLM potential. Increasing candidates would raise prompt size, latency, and cost.
- Efficiency measurement limitations: High GPT-4 latency dominates runtime; std dev measured only for LLM. API quotas and token accounting influence throughput and cost.
- Analysis limitation: In the full 804-case run, only widget_ids were requested, so motivations were unavailable to analyze why LLM missed 10 cases that baseline found.
- Practical concerns: Cloud API cost/latency and potential data security/privacy risks; generalizability to other LLMs and execution environments remains to be established.
Related Publications
Explore these studies to deepen your understanding of the subject.

