Thematic analysis with open-source generative AI and machine learning: A new method for inductive qualitative codebook development

Contribution Summary

The paper offers a documented open-source workflow for AI-assisted inductive codebook development and provides initial validation evidence using synthetic qualitative datasets with known underlying themes.

Draft enrichment generated from extracted publication text; pending human review.

Plain-Language Summary

This paper introduces GATOS, an open-source workflow that combines embeddings, clustering, and generative text models to help researchers build inductive qualitative codebooks for large text datasets. The study validates the workflow on synthetic datasets where the embedded themes are known in advance, allowing the authors to test whether the method recovers the intended patterns.

Research Question

To what extent can open-source generative text models be used in a workflow to approximate steps in thematic analysis in social science research?

Methods

Generated three synthetic qualitative datasets modeled on teammate feedback, organizational cultures of ethical behavior, and return-to-office perspectives.
Used the GATOS workflow to summarize text, embed and cluster summary points, generate candidate codes, and organize codes into themes.
Compared GATOS-generated codes and themes with the known themes and sub-themes used to generate the synthetic datasets.

Key Findings

Across the three synthetic datasets, the workflow generated themes that closely matched most of the original sub-themes.
The workflow produced fewer new codes as it processed more clusters, suggesting it could avoid creating a new code for every cluster.
The paper identifies remaining limitations around synthetic validation, redundant codes, abstraction level, scalability, and the need for human-in-the-loop refinement.

Implications

Researchers working with large qualitative corpora can use workflows like GATOS to make codebook development more scalable while retaining explicit points for human review.

The validation strategy creates a bridge between qualitative method development and controlled synthetic-data testing.

The workflow should be treated as decision support for qualitative researchers, not as a replacement for interpretive judgment.

Research Artifacts

protocolGATOS workflowA multi-step process for moving from raw qualitative text to summaries, clusters, candidate codes, and themes.

Abstract

This paper introduces a novel method for developing qualitative codebooks using open-source generative AI and machine learning techniques, enabling more systematic and reproducible inductive thematic analysis.

Related Projects

Using Large Language Models and Generative AI to Scale Qualitative Data Analysis

How can researchers combine qualitative judgment with open-source generative AI to scale thematic analysis without hiding methodological choices?

Project

All publications