CEMFI Summer School
Data Science for Economics: Mastering Unstructured Data
Instructors
Dates
17-21 August 2026
Hours
9:30 to 13:00 CEST
Format
In person
Practical Classes
There will be some voluntary sessions in the afternoon (from 15:00 to 17:00) led by a teaching assistant. Exact dates will be announced before the beginning of the course.
Intended for
Academic researchers, policy analysts, data analysts, and consultants who are using, or who wish to use, unstructured data sources such as text, detailed surveys, images, or speech data in their work.
Prerequisites
A basic familiarity with probability and statistics at advanced undergraduate level. The hands-on classes will require students to work through Python notebooks that will be prepared in advance. Extension problems will involve the modification of these notebooks, which requires familiarity with the basics of Python. An introductory session to Python will be provided by a teaching assistant. Therefore, previous programming experience in other languages is sufficient.
Overview
Over the past decade, unstructured data such as text, images, and audio have become increasingly important in economics and related fields. At the same time, the rapid development of large language models (LLMs) and generative AI has fundamentally changed how researchers work with these data. Tools such as ChatGPT and related models now allow economists not only to classify text or measure sentiment, but also to generate text, build forecasts, and adapt models to specific research contexts through fine-tuning. Together, these advances have opened up new possibilities for extracting information, automating research tasks, and developing AI-augmented economic models.
This course equips participants with the conceptual understanding and practical skills needed to work effectively with unstructured data, large language models, and generative AI in economic research. The emphasis is on combining intuitive theoretical insights with hands-on implementation. Participants will gain experience fine-tuning LLMs, building AI-based analytical pipelines, and applying generative models to modern empirical questions in economics.
The course is organized around five main components:
1. Analytical techniques. Students are introduced to key statistical and machine-learning methods used to analyze unstructured data. The course will cover predictive models such as neural networks and random forests. Rather than focusing on formal derivations, the course emphasizes intuition and practical relevance, with particular attention to applications in natural-language processing and generative AI.
2. Large language models and generative AI. This component covers the architecture, training, and use of modern LLMs in economic research. Topics include text embeddings and transformer models (such as BERT, GPT, and LLaMA), fine-tuning and prompt design for domain-specific applications, and the use of generative models to create synthetic data for simulation and forecasting. The course also addresses key challenges, including interpretability, bias, and ethical concerns.
3. Economic applications. We will explore how these tools can be applied to concrete research problems, including topic modeling and sentiment analysis of policy debates and financial markets, fine-tuned LLMs for economic forecasting and macroeconomic analysis, and generative AI for survey design, automation, and synthetic data generation.
4. Hands-on implementation. Through guided coding sessions, participants will apply the methods covered in class to real-world datasets. Using Python and tools such as Hugging Face Transformers, they will build custom NLP pipelines, fine-tune language models on domain-specific corpora (for example, financial reports or policy documents), and integrate AI-based methods into their own research workflows.
5. Data collection and preparation. The final component focuses on the practical challenges of working with unstructured data. Topics include web scraping and API-based data collection, data cleaning and preprocessing pipelines, structuring data for machine learning models, and ethical considerations when working with AI-generated content.
Topics
- Text analysis, including tf-idf and topic modeling (Latent Dirichlet Allocation - LDA)
- Transformer-based language models, pretraining, and fine-tuning for economic tasks
- Evaluation of AI-based predictions, including accuracy, precision-recall, and interpretability
- Image analysis and classification using convolutional neural networks and transfer learning
- Web scraping and automated extraction of online data
- Speech-to-text methods and sentiment analysis of spoken language
- Generative AI for text generation, synthetic data, and AI-assisted research
Christopher Rauh is an ATRAE Distinguished Researcher at the IAE (CSIC) and a Professor of Economics and Data Science at the University of Cambridge. He is also an Affiliated Professor at the Barcelona School of Economics, and a Research Affiliate at CEPR, HCEO, and IZA. His research focuses on designing surveys and analyzing unstructured data using machine learning. He serves as co-founder and co-director at conflictforecast.org and EconAI and has led multiple projects in collaboration with the FCDO, the German Foreign Office, OECD, and the IMF. Additionally, he is an Associate Editor at the Economic Journal and has published in a wide range of economics and political science journals, as well as in international media outlets.