Contribution Summary
The paper offers a documented open-source workflow for AI-assisted inductive codebook development and provides initial validation evidence using synthetic qualitative datasets with known underlying themes.
Draft enrichment generated from extracted publication text; pending human review.
Plain-Language Summary
This paper introduces GATOS, an open-source workflow that combines embeddings, clustering, and generative text models to help researchers build inductive qualitative codebooks for large text datasets. The study validates the workflow on synthetic datasets where the embedded themes are known in advance, allowing the authors to test whether the method recovers the intended patterns.
Research Question
To what extent can open-source generative text models be used in a workflow to approximate steps in thematic analysis in social science research?
Methods
- Generated three synthetic qualitative datasets modeled on teammate feedback, organizational cultures of ethical behavior, and return-to-office perspectives.
- Used the GATOS workflow to summarize text, embed and cluster summary points, generate candidate codes, and organize codes into themes.
- Compared GATOS-generated codes and themes with the known themes and sub-themes used to generate the synthetic datasets.
Key Findings
- Across the three synthetic datasets, the workflow generated themes that closely matched most of the original sub-themes.
- The workflow produced fewer new codes as it processed more clusters, suggesting it could avoid creating a new code for every cluster.
- The paper identifies remaining limitations around synthetic validation, redundant codes, abstraction level, scalability, and the need for human-in-the-loop refinement.
Implications
Researchers working with large qualitative corpora can use workflows like GATOS to make codebook development more scalable while retaining explicit points for human review.
The validation strategy creates a bridge between qualitative method development and controlled synthetic-data testing.
The workflow should be treated as decision support for qualitative researchers, not as a replacement for interpretive judgment.
Research Artifacts
Abstract
This paper introduces a novel method for developing qualitative codebooks using open-source generative AI and machine learning techniques, enabling more systematic and reproducible inductive thematic analysis.