reproducible-workflow-guide

Reproducible Research Workflow Guide

This guide outlines best practices for creating reproducible research workflows in the IDEEAS Lab. Following these practices ensures that your research can be understood, verified, and built upon by others.

Core Principles

1. Everything is Documented

Document your thinking, decisions, and processes
Write code comments that explain the "why," not just the "what"
Keep a research log of key decisions and discoveries
Document your computational environment

2. Everything is Version Controlled

Use Git for all code, documentation, and text files
Make frequent, meaningful commits with clear messages
Use branches for experimental work
Tag important milestones and releases

3. Everything is Automated

Use scripts instead of manual processes
Create reproducible computational environments
Automate data processing and analysis pipelines
Use continuous integration when appropriate

4. Everything is Accessible

Organize files logically and consistently
Use clear, descriptive file and variable names
Provide clear instructions for running your code
Share code and data when possible and appropriate

Project Structure Template

project-name/
├── README.md                   # Project overview and quick start
├── LICENSE                     # License for code and data
├── .gitignore                 # Files to ignore in version control
├── environment.yml            # Conda environment specification
├── requirements.txt           # Python package requirements
├── Makefile                   # Automation scripts
│
├── data/
│   ├── raw/                   # Original, immutable data
│   ├── interim/               # Intermediate processed data
│   ├── processed/             # Final, analysis-ready data
│   └── external/              # External datasets
│
├── notebooks/
│   ├── exploratory/           # Jupyter notebooks for exploration
│   ├── reports/               # Notebooks that generate reports
│   └── archive/               # Old notebooks for reference
│
├── src/                       # Source code for the project
│   ├── __init__.py
│   ├── data/                  # Scripts to download or generate data
│   ├── features/              # Scripts to turn raw data into features
│   ├── models/                # Scripts to train models and make predictions
│   ├── visualization/         # Scripts to create visualizations
│   └── utils/                 # Utility functions and helpers
│
├── models/                    # Trained and serialized models
├── reports/                   # Generated analysis reports
│   ├── figures/               # Generated graphics and figures
│   └── tables/                # Generated tables
│
├── docs/                      # Documentation
│   ├── data-dictionary.md     # Description of data variables
│   ├── methodology.md         # Detailed methodology
│   └── analysis-plan.md       # Pre-registered analysis plan
│
└── tests/                     # Unit tests for your code
    ├── __init__.py
    ├── test_data.py
    ├── test_features.py
    └── test_models.py

Version Control Best Practices

Repository Setup

# Initialize repository
git init
git add README.md
git commit -m "Initial commit: Add README"

# Set up remote repository
git remote add origin https://github.com/ideeas-lab/project-name.git
git push -u origin main

Commit Message Guidelines

Use clear, descriptive commit messages:

# Good examples
Add data cleaning script for survey responses
Fix bug in statistical analysis function
Update README with installation instructions

# Poor examples
Update
Fix stuff
Changes

Branching Strategy

# Create feature branch
git checkout -b feature/data-analysis
# Work on feature
git add .
git commit -m "Add initial data analysis script"
# Push and create pull request
git push origin feature/data-analysis

.gitignore Template

# Data files (add specific exceptions as needed)
data/raw/*
data/interim/*
data/processed/*
!data/raw/.gitkeep
!data/interim/.gitkeep
!data/processed/.gitkeep

# Jupyter Notebook checkpoints
.ipynb_checkpoints/

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
.venv/

# IDE
.vscode/
.idea/
*.swp
*.swo

# OS
.DS_Store
Thumbs.db

# Temporary files
*.tmp
*.temp
*~

# Sensitive information
.env
config/secrets.yml

Environment Management

Conda Environment

Create environment.yml:

name: project-name
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pandas
  - numpy
  - matplotlib
  - seaborn
  - scikit-learn
  - jupyter
  - pip
  - pip:
    - specific-pip-package==1.0.0

Setup and activation:

# Create environment
conda env create -f environment.yml

# Activate environment
conda activate project-name

# Update environment file
conda env export > environment.yml

Python Requirements

Create requirements.txt:

pandas==1.3.3
numpy==1.21.2
matplotlib==3.4.3
seaborn==0.11.2
scikit-learn==1.0.1
jupyter==1.0.0

Code Organization

Function Documentation

def clean_survey_data(df, remove_incomplete=True):
    """
    Clean survey response data by handling missing values and outliers.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Raw survey data with responses
    remove_incomplete : bool, default True
        Whether to remove rows with incomplete responses
        
    Returns
    -------
    pandas.DataFrame
        Cleaned survey data
        
    Examples
    --------
    >>> cleaned_data = clean_survey_data(raw_data, remove_incomplete=False)
    """
    # Implementation here
    pass

Configuration Management

Create config.py:

"""Configuration settings for the project."""

# Data paths
RAW_DATA_PATH = "data/raw/"
PROCESSED_DATA_PATH = "data/processed/"
FIGURES_PATH = "reports/figures/"

# Analysis parameters
RANDOM_SEED = 42
TEST_SIZE = 0.2
N_FOLDS = 5

# Model parameters
MODEL_PARAMS = {
    'random_forest': {
        'n_estimators': 100,
        'random_state': RANDOM_SEED
    }
}

Logging Setup

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('analysis.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

Data Management

Data Pipeline Structure

def main():
    """Main data processing pipeline."""
    logger.info("Starting data processing pipeline")
    
    # Load raw data
    raw_data = load_raw_data()
    logger.info(f"Loaded {len(raw_data)} raw records")
    
    # Clean data
    clean_data = clean_survey_data(raw_data)
    logger.info(f"Cleaned data: {len(clean_data)} records remaining")
    
    # Feature engineering
    features = create_features(clean_data)
    logger.info(f"Created {features.shape[1]} features")
    
    # Save processed data
    save_processed_data(features)
    logger.info("Data processing complete")

if __name__ == "__main__":
    main()

Data Validation

def validate_data(df):
    """Validate data quality and structure."""
    assert not df.empty, "DataFrame is empty"
    assert 'participant_id' in df.columns, "Missing participant_id column"
    assert df['participant_id'].nunique() == len(df), "Duplicate participant IDs"
    
    # Check for expected value ranges
    assert df['age'].between(18, 100).all(), "Age values out of expected range"
    
    logger.info("Data validation passed")

Analysis Documentation

Analysis Plan Template

Create docs/analysis-plan.md:

# Analysis Plan

## Research Questions
1. Primary research question
2. Secondary research questions

## Hypotheses
- H1: [Specific hypothesis]
- H2: [Specific hypothesis]

## Variables
### Dependent Variables
- Variable 1: [Description, measurement]
- Variable 2: [Description, measurement]

### Independent Variables
- Variable 1: [Description, measurement]
- Variable 2: [Description, measurement]

## Statistical Analysis Plan
### Descriptive Statistics
- [What descriptive analyses will be conducted]

### Inferential Statistics
- [What statistical tests will be used]
- [Multiple comparison corrections]
- [Effect size measures]

### Model Specifications
- [Specific models to be fit]
- [Model assumptions to be checked]

## Sample Size and Power
- [Power analysis results]
- [Minimum detectable effect size]

Results Documentation

def document_results(results, filename):
    """Document analysis results in a structured format."""
    with open(f"reports/{filename}", 'w') as f:
        f.write("# Analysis Results\n\n")
        f.write(f"Analysis conducted on: {datetime.now()}\n\n")
        
        for test_name, result in results.items():
            f.write(f"## {test_name}\n")
            f.write(f"- Test statistic: {result['statistic']:.3f}\n")
            f.write(f"- p-value: {result['p_value']:.3f}\n")
            f.write(f"- Effect size: {result['effect_size']:.3f}\n\n")

Testing and Validation

Unit Testing

Create tests/test_data.py:

import unittest
import pandas as pd
from src.data.clean import clean_survey_data

class TestDataCleaning(unittest.TestCase):
    
    def setUp(self):
        """Set up test data."""
        self.sample_data = pd.DataFrame({
            'participant_id': [1, 2, 3, 4],
            'age': [25, 30, None, 35],
            'response': ['A', 'B', 'C', None]
        })
    
    def test_clean_survey_data_removes_missing(self):
        """Test that missing data is handled correctly."""
        result = clean_survey_data(self.sample_data, remove_incomplete=True)
        self.assertEqual(len(result), 2)  # Should remove rows with missing data
    
    def test_clean_survey_data_keeps_missing(self):
        """Test that missing data is kept when specified."""
        result = clean_survey_data(self.sample_data, remove_incomplete=False)
        self.assertEqual(len(result), 4)  # Should keep all rows

if __name__ == '__main__':
    unittest.main()

Run Tests

# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test_data.py

# Run with coverage
python -m pytest --cov=src tests/

Automation and Reproducibility

Makefile Template

.PHONY: data features models reports clean test

# Default target
all: data features models reports

# Data processing
data:
	python src/data/download_data.py
	python src/data/clean_data.py

# Feature engineering
features: data
	python src/features/build_features.py

# Model training
models: features
	python src/models/train_model.py

# Generate reports
reports: models
	python src/visualization/make_plots.py
	jupyter nbconvert --to html notebooks/reports/final_report.ipynb

# Run tests
test:
	python -m pytest tests/

# Clean generated files
clean:
	rm -rf data/interim/*
	rm -rf data/processed/*
	rm -rf models/*
	rm -rf reports/figures/*

# Set up environment
setup:
	conda env create -f environment.yml
	conda activate project-name
	pip install -e .

README Template

# Project Name

Brief description of the project and its goals.

## Getting Started

### Prerequisites
- Python 3.9+
- Conda or pip

### Installation
```bash
# Clone repository
git clone https://github.com/ideeas-lab/project-name.git
cd project-name

# Set up environment
conda env create -f environment.yml
conda activate project-name

# Install package in development mode
pip install -e .

Usage

# Run full analysis pipeline
make all

# Run individual steps
make data
make features
make models
make reports

Project Structure

[Describe the organization of files and directories]

Data

[Describe the data sources and structure]

Methods

[Brief overview of methodology]

Results

[Summary of key findings]

Contributing

[Guidelines for contributors]

License

[License information]


---

## Sharing and Publication

### Pre-publication Checklist
- [ ] All code is documented and tested
- [ ] Data is properly documented with data cards
- [ ] Analysis is reproducible from raw data
- [ ] Sensitive information is removed or protected
- [ ] Code repository is clean and organized
- [ ] README provides clear instructions
- [ ] License is specified
- [ ] Dependencies are clearly specified

### Data and Code Sharing
- Use appropriate repositories (GitHub, Zenodo, etc.)
- Include DOIs for permanent citation
- Follow journal and funder requirements
- Consider embargo periods if needed
- Provide clear usage guidelines

---

**Remember**: Reproducibility is not just about the final product - it's about creating sustainable practices that make your research more efficient, reliable, and impactful throughout the entire process.