Interdisciplinary Studies
Place identity: a generative AI's perspective
K. M. Jang, J. Chen, et al.
The paper examines whether generative AI (GenAI) can capture and represent the identity of places—defined as distinctive properties that differentiate one place from another—across global cities. Drawing from environmental psychology and human geography, place identity is distinguished from abstract space by its meanings, attachments, and unique visual and cultural features. Traditional approaches to measure place identity rely on qualitative methods (interviews, photo-elicitation, identity markers), which are insightful but time- and resource-intensive and may suffer from small sample biases. With the rise of user-generated text and images, NLP and computer vision have advanced quantitative assessments of place perceptions. Recent progress in GenAI (LLMs like ChatGPT and text-to-image models like DALL-E2) enable scalable generation of coherent text and realistic images, offering potential for urban applications but raising concerns about reliability, bias, and transparency. This study asks two research questions: (1) How does generative AI illustrate place identity? (2) To what extent can we trust generative models’ place identity results when compared with fact-based descriptions? To address these, the authors propose a framework using GenAI to generate city-specific textual and visual outputs for 64 cities worldwide and validate them against Wikipedia texts and Google image search, assessing similarities and distinctiveness across cities.
Foundational theories define place identity as meanings and attachments linked to physical settings, distinguishing it from abstract space. Prior work explores identity of place (distinctive properties) versus identity with place (people’s attachment), emphasizing physical settings, events, and shared meanings. Traditional qualitative studies (interviews, photo-elicitation) and studies of identity markers (towers, signs, establishments) have illuminated place identity but face scalability limits. The advent of user-generated content enabled NLP (sentiment, topics) and CV (street-level imagery analysis) to infer urban perceptions and visual features at scale. GenAI has shown promise across domains (transportation, education, climate literacy, geospatial AI) but raises reliability, bias, and transparency issues. In design and planning, text-to-image tools have been used for rapid ideation; however, many outputs emphasize geometry over contextual, place-specific meanings and may fail to depict composition and locale accurately. Concerns about black-box models, training data coverage, and potential misinformation underscore the need to evaluate GenAI’s ability to represent place identity credibly and contextually.
The study implements a two-step computational framework: (1) generate place identity with GenAI, and (2) validate outputs against real-world data. Data generation with GenAI:
- Text (ChatGPT): Prompts used for 64 global cities were (a) “What is the place identity of {city}? Give me in ten bullet points.” (b) “What is the urban identity of {city}? Give me in ten bullet points.” and (c) “What is the place identity of streetscapes in {city}? Give me in ten bullet points.” Responses were constrained to 10 bullet points for comparability.
- Images (DALL-E2): For each city, 20 images (256×256 px) were generated with the prompt “What is the place identity of streetscapes of {city}?” Real-world reference datasets:
- Wikipedia text: City introductions were collected as a collaborative, fact-based textual reference for each city.
- Google Images: For each city, the top 30 images returned for the query “{city}” were scraped to represent widely recognized visual identity cues. Text similarity validation:
- Wikipedia introductions were cleaned and tokenized into sentences (NLTK). A sentence-transformer BERT model (MiniLM-based) converted each ChatGPT bullet and each Wikipedia sentence into embeddings. Cosine similarity was computed between each ChatGPT bullet and all Wikipedia sentences; the maximum similarity per bullet was taken. Word clouds were also generated for visual topical comparison between ChatGPT and Wikipedia for each city. Image similarity validation:
- Perceptual similarity between DALL-E2 outputs and Google images was measured using LPIPS (AlexNet features), with similarity S = 1 – LPIPS. For each DALL-E2 image, similarity to all Google images of the same city was computed and the top three matches were examined.
- Human evaluation: A survey (n=30 respondents) rated similarity of 64 representative DALL-E2–Google pairs (one per city, each selected as the pair closest to the city’s mean S) on a 7-point Likert scale. Mean similarities were compared to LPIPS-based scores; Pearson correlation was assessed. City-by-city distinctiveness:
- Normalized Chamfer distance (CD) between latent representations of DALL-E2 image sets for every city pair was computed; 1–CD represents similarity. Two hypotheses were tested: (1) same-city images are more similar than different-city images; (2) cities geographically/culturally closer are more similar than distant ones. Cities were sorted by longitude to inspect geographic patterning. Pairwise similarities were also compared with Google image-based similarities. Additional analyses and prompt engineering:
- Word cloud comparisons explored thematic alignment. Qualitative inspection of exemplars highlighted place-specific versus generic scenes and effects of lighting (day/night) mismatches. Prompt engineering suggestions (POV, pitch, perspective, objects, time-of-day) were proposed and demonstrated to standardize and focus DALL-E2 outputs. A bias probe varied a single keyword (e.g., “white community” vs “black community” in Boston streetscapes) to illustrate potential stereotyping in generative outputs.
- ChatGPT understanding and outputs: ChatGPT’s definition of place identity aligned with identity of place. Across 64 cities, average cosine similarities between ChatGPT and Wikipedia were approximately: place identity 0.59, urban identity 0.58, and street identity 0.56, indicating broadly consistent alignment regardless of prompt type. High-similarity examples included Munich and Busan; low-similarity examples included Rome and Prague, partly attributable to mismatched sentence lengths. Word cloud comparisons showed substantive thematic overlaps (e.g., Seoul’s cultural/meaning aspects; Singapore’s diversity; Barcelona’s ‘Catalan’; Almaty’s ‘Kazakhstan’), indicating ChatGPT captured salient ethnographic and cultural cues.
- DALL-E2 vs Google image similarity: Mean perceptual similarity S (1–LPIPS) across cities was 0.575 (SD 0.066). High similarity example: Brussels, where DALL-E2 depicted Baroque guildhalls on the Grand-Place (top matches S≈0.764, 0.724, 0.711). Low similarity example: Tokyo, where repetitive, generic streets lacked strong visual cues (matches S≈0.375, 0.386, 0.416). Daytime bias in DALL-E2 versus nighttime prevalence in Google images for some cities likely reduced similarity scores.
- Human survey vs LPIPS: The mean human-rated similarity was 3.406 (SD 0.606) on a 7-point scale. Top human-rated cities: Chicago (4.967), Madrid (4.867), Montreal (4.267). Lowest: Seoul (2.367), Auckland (2.467), Kobe (2.467). Pearson correlation between LPIPS-based similarity and human ratings was r=0.229 (p=0.071), indicating a weak but positive relationship, significant at 0.1, and suggesting LPIPS does not fully capture human perception.
- City distinctiveness and regional patterns: Pairwise 1–CD showed higher similarity along the diagonal (same-city), supporting hypothesis (1). Cities with strongest within-city distinctiveness included Abu Dhabi, Amsterdam, Dublin, Cairo, Johannesburg, Brussels, Kyoto, Caracas, Paris, and Dubai. Regionally, Western cities (Americas and Europe) showed clear intraregional visual similarities; the Amsterdam–Brussels pair had the highest cross-city similarity (≈0.70). Non-Western regions showed less consistent intraregional similarity, with an exception of Abu Dhabi–Dubai (≈0.66). Pairwise similarities from DALL-E2 correlated positively with those from Google images (r≈0.21, p<0.001), supporting hypothesis (2) partially and indicating consistency with real-world imagery.
- Place-specific vs generic renderings: DALL-E2 captured distinctive architectural and streetscape elements for some cities (e.g., New York fire escapes, Paris Haussmann facades, Singapore shophouses and rain trees) but also produced generic urban scenes lacking landmarks or clear identity (e.g., Sydney), raising concerns about placeless outputs.
- Bias exploration: Changing a single demographic keyword in prompts for Boston streetscapes produced stereotyped differences (well-maintained vs degraded environments), highlighting risks of social bias and stigmatization in GenAI outputs.
Findings indicate that GenAI can illustrate aspects of place identity in both text and images, aligning moderately with fact-based descriptions and widely recognized visuals. ChatGPT captured salient cultural, historical, and ethnographic markers, while DALL-E2 often rendered recognizable architectural/street elements for some cities. However, limitations emerged: image similarities varied widely across cities, LPIPS correlated weakly with human perception, and DALL-E2 sometimes produced generic, placeless scenes. Pairwise analyses showed GenAI could differentiate cities, especially in Western regions, aligning with Google image-based similarities, but revealed potential placelessness and uneven distinctiveness in many non-Western cities. The results address the research questions by showing how GenAI represents place identity and by quantifying the extent of trustworthiness through text and image similarity measures, human ratings, and city-by-city comparisons. Practical implications include potential use of GenAI for scalable, multimodal place identity exploration, urban design ideation, and comparative analyses. Ethical and methodological considerations are crucial: prompt design strongly affects outcomes; black-box training data and potential circularity with Wikipedia/Google necessitate caution; multilingual and cultural generalizability must be critically examined; and human-in-the-loop evaluation is needed to capture nuanced perception. Opportunities lie in better prompt engineering (POV, perspective, objects, time-of-day), multi-source validation (including social media), more explainable and domain-specialized models, and inclusive evaluations that consider LMICs and marginalized communities.
GenAI models show promise in capturing salient aspects of cities’ place identity through scalable text and image generation. Validations against Wikipedia and Google imagery, plus human assessments and pairwise comparisons, suggest that GenAI can reflect distinctive characteristics and inter-city similarities to a meaningful extent, though not uniformly across contexts. The study pioneers a multimodal framework for evaluating GenAI’s reliability in urban studies, highlighting both potential and pitfalls. Future work should expand datasets and languages, refine prompts and evaluation metrics, integrate additional human and social data sources, develop explainable and domain-tuned models, and systematically address biases and representational fairness. Such advances can support planners and designers in placemaking, city branding, and community engagement, helping shape more imageable cities.
- Absence of a definitive ground-truth place identity dataset; reliance on Wikipedia and Google images, which may overlap with GenAI training data (circularity risk).
- English-only prompts and outputs; potential language and cultural biases favoring Western contexts due to data imbalance (e.g., dominance of English Wikipedia).
- DALL-E2 sometimes generates generic, placeless scenes and shows a tendency toward daytime imagery, causing mismatches with real-world nighttime scenes.
- Human survey used a small, convenience sample (n=30) and one representative pair per city; LPIPS–human correlation was weak, indicating perceptual nuances not captured by the metric.
- Image resolution limited to 256×256 px; prompt brevity may limit specificity; landmark omission in some cities reduces identity cues.
- City selection, while global, may not capture within-country diversity; temporal cutoff (pre-September 2021 training) means recent changes are not reflected.
- Demonstrated susceptibility to social stereotyping based on prompt wording (e.g., demographic labels), raising fairness and ethical concerns.
Related Publications
Explore these studies to deepen your understanding of the subject.

