Speakers
Description
The German Federal Ministry for Education and Research (BMBF) funding priority Sustainable Development of Urban Regions, (SURE) is an application-oriented research that supports ten collaborative research projects focused on urban sustainability. The ten collaborative projects aim to develop practical solutions for more sustainable and resilient cities and regions in Southeast Asia and China. The SURE Facilitation and Synthesis Project has one of the main aims to research and employ state-of-the-art technologies and develop digital tools that help to uncover dormant knowledge, identify open questions and problems, and outline solutions and trends for future research on sustainable urban development. This research is a part of a series of experiments aimed at creating a benchmark methodology for identifying Topic clusters and monitoring Research trends in the rapidly growing corpus of texts under the SURE funding initiative.
Topic modeling and extraction is not a new field in synthesis-research, but with the current advancements in LLMs, such processes can as well be advanced with better accuracy and efficiency. Machine based topic/trend extraction has traditionally focused on variations of LSI (Latent Sematic Indexing), LDA (Latent Dirichlet Allocation) (Churchill and Singh 2022) or a simpler keyword evaluation of research papers (Nguyen et al. 2022). During the past decade, however, with the emergence of the word embeddings-based methods, the field of LLMs (Large Language Models) has taken a center stage, and the continuous improvements in these models make them ever more accurate for analysis of texts. In this paper, we explore the application of an LLM based methodology, firstly to validate the domain experts based manual identification of topics and secondly, to create a benchmark method for machine-based extraction of Topics related to sustainable urban development, based on our case studies. The research uses a corpus of documents, including field reports and grant application documents, from the ten SURE-projects, and hence it provides a window into the often-ignored part of applied research like activity reports and other forms of knowledge transfers. The process builds upon established methodology by (Thompson and Mimno 2020) whereby the tokenized text is used to create embeddings with BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018) to create a vector space. The embeddings are then clustered and labeled according to their semantic similarity (cosine) with domain expert input, which allows a comprehensive view of themes/topics and sub-topics.
The results of the approach aim to take a step towards building a domain specific Language Model to synthesize Sustainable Urban Development practices and approaches which can further support decision making and knowledge transfer in applied research projects.
References
Churchill, R. and Singh, L. (2022) ‘The Evolution of Topic Modeling’, ACM Computing Surveys, 54(10s), pp. 1–35. doi: 10.1145/3507900
Devlin, J. et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available at: https://arxiv.org/pdf/1810.04805.
Nguyen, N.K. et al. (2022) Utilizing Keywords Evolution in Context for Emerging Trend Detection in Scientific Publications. The 11th International Symposium on Information and Communication Technology (SoICT 2022). Hanoi, Vietnam, December 1-3.
Thompson, L. and Mimno, D. (2020) Topic Modeling with Contextualized Word Representation Clusters. Available at: http://arxiv.org/pdf/2010.12626v1.
Keywords | Sustainable Urban Development; Topic Modeling; Large Language Models; Semantic Similarity |
---|---|
Best Congress Paper Award | Yes |