logo
Loading...
A lexical network approach to second language development

Linguistics and Languages

A lexical network approach to second language development

H. Chen

Explore the fascinating findings of Heng Chen as this study delves into lexical development in second language learning through a complex network approach. Uncover how bi-gram lexical networks unveil L2 proficiency dynamics, illustrating the power of word associations beyond frequency metrics. Perfect for linguistics enthusiasts!... show more
Introduction

Interlanguage is a distinct linguistic system formed during L2 acquisition, reflecting influences from both the L1 and target language. Complex Dynamic System Theory (CDST) offers a lens to examine the non-linearity, phase transitions, and self-organization observed in L2 development. Prior work has relied on lexical and syntactic complexity indices (e.g., vocabulary richness, clause and phrase measures) but faces challenges of consistency across measures and limited insight into interactions among linguistic subsystems. Complex network science provides tools to model language as interconnected elements (nodes/links) with properties like small-worldness and scale-freeness. This study proposes a simpler bigram word co-occurrence (lexical) network approach—requiring no dependency parsing or reference corpora—to examine L2 Chinese development and compare it against traditional lexical complexity indices. It addresses three research questions: (1) whether L2 Chinese lexical networks are complex networks; (2) whether lexical network parameters can distinguish proficiency levels; and (3) whether lexical network parameters outperform traditional lexical complexity indices in this task.

Literature Review

CDST has illuminated L2 development dynamics but measurement consistency and subsystem interactions remain problematic. Traditional indices (vocabulary richness, mean lengths, phrasal/clausal features) highlight dynamic properties but sometimes yield mixed results (e.g., word length and frequency stable across proficiency; Guiraud varies; chunks/n-grams are strong proficiency predictors). Complex network approaches, widely used in cognitive and applied linguistics (authorship attribution, stylometry, language complexity), model language as a complex adaptive system with small-world and scale-free properties. In acquisition research, syntactic network analyses differentiate proficiency and indicate gradual approximation to target language; however, reliance on dependency treebanks is resource-intensive and sensitive to annotation schemes. Usage-based perspectives emphasize the centrality of multiword expressions; bigrams capture frequent syntagmatic associations and approximately half of syntactic dependencies involve adjacent words. Bigrams and TTR-bigram consistently predict proficiency, aligning with chunking-based models of processing. Network models grounded in word associations also map cognitive spreading activation and can rapidly capture relations among many words. Hence, a bigram co-occurrence network can offer a robust, resource-light method to probe lexical development in L2 Chinese.

Methodology

Corpora: Data were drawn from the Guangwai-Lancaster Chinese Learner Corpus (GLCLC; ~1.2 million words) and native texts from the Lancaster Corpus of Mandarin Chinese (LCMC). Seven written narrative sub-corpora were compiled: L2 learners with two L1 backgrounds (Thai; Romance languages: French/Spanish) at three proficiency levels (beginner, intermediate, advanced), plus a native Chinese sub-corpus. To control for genre and size effects, all sub-corpora are narrative texts (topics: my life/story/memory trip) with ~10,000 Chinese characters each. Thai/Romance 1, 2, 3 denote beginner, intermediate, advanced, respectively. Preprocessing and tools: Chinese word segmentation and POS tagging were performed with NIPIR-ICTCLAS. Lexical (bigram co-occurrence) networks were constructed and measured in Pajek. Traditional lexical complexity indices were computed using CTAP. Network parameters: Seven parameters were computed—network size (n, number of word types/nodes), edges (m, number of distinct bigram types), network density (ND = 2m/[n(n−1)]), degree-based network centralization (NC ≈ (Kmax/n) − ND), clustering coefficient (C; average of local clustering Ci = 2Ei/[ki(ki−1)]), average path length (, mean shortest path), and cluster/community count (Cl). Small-worldness (S) was quantified following Humphries & Gurney (2008): S = (C/Crnd)/(/rnd), where Crnd and rnd are from an ER random network with same n and mean degree. Networks with S>1 and scale-free degree distributions were considered complex networks. Lexical complexity indices: Seven indices—Uber (U) = (log Tokens − log Types)/log Tokens; Guiraud = Types/√Tokens; Average Word Length (AWL, syllables/word); Lexical Density (LD, lexical items/total words); Hapax legomena count (HL); Long Word Percentage (LWP, ≥3 syllables/total words); Type-Token Ratio of bigrams (TTR-bigram, unique bigrams/total bigrams). Statistical analyses: Hierarchical clustering (SPSS 20, Euclidean distance, average linkage) assessed the ability of network parameters and lexical indices to classify proficiency and native vs learner corpora. Information gain (IG) for each feature quantified predictive utility for proficiency classification: IG(X) = H(Y) − H(Y|X), comparing pre/post-split entropy of proficiency classes by feature values.

Key Findings
  • Complex-network characteristics: All seven lexical networks (Thai 1–3, Romance 1–3, Native) displayed S-Small-worldness > 1 and scale-free degree distributions, thus qualifying as complex networks. Example S values: Thai 1: 1.96; Thai 2: 3.93; Thai 3: 3.55; Romance 1: 2.93; Romance 2: 3.81; Romance 3: 3.95; Native: 2.76.
  • Proficiency-related patterns in network parameters (Table 2): With increasing proficiency, global network size expands—n (nodes) and m (edges) increase (e.g., Thai 1 n=969, m=3596 → Thai 3 n=1258, m=4358; Romance 1 n=1005, m=3766 → Romance 3 n=1235, m=4327). The number of clusters (Cl) also tends to increase (e.g., Thai 1: 15 → Thai 2/3: 16; Romance 1: 12 → Romance 3: 18). Local connectivity measures decrease: ND declines (Thai 1: 0.0077 → Thai 3: 0.0055; Romance 1: 0.0075 → Romance 3: 0.0057), C declines (Thai 1: 0.0710 → Thai 3: 0.0566; Romance 1: 0.0662 → Romance 3: 0.0557). Average path length modestly increases with proficiency and is largest for native (e.g., Thai 1: =2.9860 → Thai 3: 3.1378; Native: 3.7952). Counting nodes with degree ≥100 showed fewer hubs at higher levels (Thai 1–3: 8, 7, 7; Romance 1–3: 9, 8, 7; Native: 6).
  • Classification by network parameters: Hierarchical clustering using standardized network parameters cleanly grouped sub-corpora by proficiency and ultimately with Native Chinese. Information gain indicated n, m, ND, and C as strongest predictors (IG=1.5849 each), followed by NC and (0.9182 and 0.7925), with clusters (Cl) contributing least.
  • Traditional lexical complexity results (Table 4): Intermediate/advanced learners show higher lexical and bigram diversity and shorter average word length than beginners: Guiraud increases (e.g., Thai 1: 11.52 → Thai 3: 14.78; Romance 1: 12.74 → Romance 3: 15.50), TTR-bigram increases (Thai 1: 0.5809 → Thai 3: 0.7119; Romance 1: 0.6459 → Romance 3: 0.7245), AWL slightly decreases (Thai 1: 1.5253 → Thai 3: 1.5048; Romance 1: 1.5119 → Romance 3: 1.4841), HL counts rise (e.g., Thai 1: 609 → Thai 2: 705; Romance 1: 518 → Romance 3: 723). LD shows little change across levels. Native Chinese shows much higher U and Guiraud and TTR-bigram (U=67.53; Guiraud=26.33; TTR-bigram=0.8893).
  • Classification by lexical complexity indices: Clustering based on the seven traditional indices successfully grouped beginners and natives but failed to separate intermediate vs. advanced learners, which clustered more by L1 background than proficiency. Information gain ranked Guiraud and TTR-bigram highest (1.5849), AWL and LWP moderate (0.9183), and U, LD, HL lowest (0.5849).
Discussion

Findings address all research questions. RQ1: L2 Chinese lexical networks are complex networks—each sub-corpus exhibits small-worldness (S>1) and scale-free degree distributions. RQ2: Lexical network parameters effectively distinguish proficiency levels, producing clear hierarchical clusters by Beginner, Intermediate, Advanced, and Native. The most informative parameters (by information gain) are n, m, ND, and C, indicating that both global network expansion and local connectivity patterns capture developmental differences. RQ3: Compared to traditional lexical complexity metrics, network parameters better differentiate intermediate and advanced learners and approximate native usage, suggesting that word association structures (beyond frequency counts) capture deeper properties of interlanguage. While lexical diversity measures like Guiraud and TTR-bigram remain strong predictors, other traditional indices (e.g., LD, HL, U) are less informative for mid-to-high proficiency distinctions. The results support a CDST view where lexical development reflects reorganization in word association networks: as proficiency increases, the lexicon grows, hubs weaken, local density and clustering decrease given corpus-size constraints, and paths lengthen modestly—indicating a broader, sparser yet more articulated network. The network approach appears less sensitive to learners’ L1 background than traditional indices at higher levels. The study advocates integrating lexical frequency and network features within CDST. It also highlights the promise of multilayer network models to jointly capture interactions among linguistic units and social actors (e.g., alignment/priming in dialogue), potentially extending to human–AI interactions in language learning.

Conclusion

The study introduces a bigram-based lexical network methodology for assessing L2 Chinese development and demonstrates that L2 lexical networks are complex (small-world and scale-free). Network parameters reliably classify proficiency levels and outperform traditional lexical complexity indices for distinguishing intermediate and advanced learners, with network size, edges, density, and clustering coefficient as especially informative features. Traditional diversity measures (Guiraud, TTR-bigram) remain valuable, but frequency-based sophistication and density measures are less discriminative in mid-to-high proficiency. The work recommends combining lexical frequency and network features within CDST for a fuller understanding of lexical development. Future research should expand multi-layer network approaches to integrate linguistic structures with interactional and social dimensions, offering deeper insight into mechanisms such as alignment and the role of AI-supported learning.

Limitations

The authors note that corpus size can significantly influence observed network parameter patterns. Although each sub-corpus was controlled to ~10,000 Chinese characters for comparability, network expansion with proficiency can make local neighborhoods relatively sparse, reducing local density and clustering—an effect likely driven by the limited corpus size analyzed. Additionally, analyses are confined to written narrative texts and bigram co-occurrences; while segmentation was applied, Chinese words are not lemmatized as in inflecting languages, which may affect comparability with lemmatized examples in other languages.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny