Overview of Python resources for data cleaning and project ideas, including a cookbook for data cleaning, low-code libraries, and a comprehensive framework for profiling. Project ideas span various topics like file I/O, error handling, OOP, and web requests, offering practical applications such as a CLI unit converter, data quality dashboards, and visualization tools.
📖 1. Python Data Cleaning Cookbook
A comprehensive, recipe-style codebase supporting the Python Data Cleaning Cookbook by Packt. It includes practical examples for handling:
- Missing values
- Outliers
- Data profiling
- Reusable functions and classes for pipelines Reddit+11GitHub+11GitHub+11
✅ What to adopt:
- Recipe-based structure — organize code in small, understandable files.
- Use of functions and classes for reusability.
⚙️ 2. sfu-db/dataprep
An open-source, low-code data preparation library:
- Offers modules like
dataprep.cleanfor cleaning tasks - Clean, consistent API design
- Focus on ease-of-use, modularity, and documentation arXiv+3Stack Overflow+3Reddit+3GitHubGitHub
✅ What to adopt:
- Clear module breakdown (e.g.,
.clean,.eda) - Well-documented functions and consistent naming conventions.
🧹 3. VIDA-NYU/openclean
A full-featured data profiling and cleaning framework:
- Toolkit for profiling and building data cleaning pipelines
- Extensible and modular design, with comprehensive APIs Reddit+3Stack Overflow+3Reddit+3Stack Overflow+5GitHub+5arXiv+5GitHub+1Reddit+1
✅ What to adopt:
- Pipeline architecture—chain modular operations.
- Separation of concerns: profiling vs. cleaning modules.
Project Ideas by Module
1) Python Basics: Syntax, I/O, Control Flow
- CLI Unit Converter with history log and config file
- Expense Splitter that handles rounding and edge cases
- Text-based Portfolio Rebalancer that outputs trades
- Log Analyzer for server or app logs with simple report
2) Data Structures: Lists, Dicts, Sets, Tuples
- Frequency Analyzer for transactions or words with top-K queries
- Mini LRU Cache implementation with benchmarks
- Contact Book with fuzzy search and CSV import
- Market Basket Analysis toy app with association rules
3) Functions and Modules
- Utility Library for date, currency, and number formatting
- Pipeline Runner that composes steps with simple DAG-like config
- Retry and Backoff Decorators library with tests
- CLI Toolset packaged as installable module
4) File I/O and OS
- Folder Watcher that organizes downloads by rules
- CSV to Parquet Converter with schema inference
- Secure Secrets Manager using OS keyring and .env
- Incremental Backup script with checksum verification
5) Error Handling and Logging
- Robust Downloader with retries, circuit breaker, and structured logs
- Transaction Importer that validates, quarantines bad rows, and reports
- Audit Logger that produces JSON logs and rotates files
- Alert Router that routes errors to email or Teams
6) OOP
- Trade Order Engine simulator with Orders, Fills, Positions
- Task Scheduler with pluggable strategies and observers
- Inventory System with inheritance and composition patterns
- Shapes Library with polymorphic area/price calculations
7) Virtual Envs, Packaging, CLI
- Cookiecutter Template for data projects with Makefile and tox
- Versioned CLI App with semantic versioning and release notes
- Plugin System using entry points for extensibility
- Config-driven App that merges defaults, env, and CLI flags
8) Testing (unittest/pytest)
- Test-Driven Kata: Roman numerals, bowling, or bank account
- Property-Based Tests for CSV parsers or pricing rules
- Golden Master Tests for a refactor of a legacy function
- Mutation Testing demo to improve test quality
9) Standard Library Highlights
- datetime: Trading Calendar calculator with holidays and business days
- pathlib: Project-wide file refactor with safety checks
- functools: Memoized expensive function with cache invalidation
- collections: Deque-based streaming window metrics
10) Datetime and Timezones
- FX Rates Normalizer aligning feeds to UTC with drift checks
- Market Session Classifier labeling trades pre/post market
- SLA Tracker computing response times across timezones
- iCal Generator for recurring study sessions or releases
11) Regular Expressions
- Bank Statement Parser extracting payee, amount, and memo
- Log Redactor masking PII with configurable patterns
- Ticker/ISIN Extractor from messy text
- Markdown Linter fixing common syntax issues
12) JSON, YAML, Configs
- Schema Validator using jsonschema with humanized errors
- Portfolio JSON Diff tool with semantic comparison
- Config Merger with layered overrides and validation
- REST Mock Server serving JSON fixtures
13) Web Requests and APIs
- Price Fetcher with caching and rate limiting
- News Aggregator with deduplication and sentiment tags
- Simple Webhook Receiver that verifies signatures
- API Client SDK wrapper with pagination helpers
14) Data Handling: csv, sqlite3, pandas
- ETL: CSV → SQLite → Pandas report with profiling
- Factor Calculator computing rolling metrics and z-scores
- Data Quality Dashboard showing nulls, uniques, ranges
- Reconciliation Tool comparing two datasets with diffs
15) Visualization: matplotlib, seaborn, plotly
- Market Regime Dashboard with rolling volatility plots
- Outlier Explorer using box, violin, and scatter plots
- KPI Mini-Board with sparklines and targets
- Correlation Heatmap with interactive filtering
16) Concurrency: threading, multiprocessing, asyncio
- Concurrent Web Scraper with bounded concurrency
- Batch Backtester running scenarios in parallel
- Async Price Stream consumer producing rolling metrics
- Worker Pool for CPU-bound simulations
17) Packaging, Publishing, CI
- Publish a pip package of helpers to TestPyPI
- Pre-commit setup for linting and formatting
- GitHub Actions to run tests and build wheels
- Release Automation tagging and generating changelogs
18) OOP + Design Patterns
- Strategy-based Risk Model switchable at runtime
- Observer-based Event Bus for signals and listeners
- Adapter wrapping two quote providers under one interface
- Builder for complex order creation
19) CLI + TUI Apps
- TUI Portfolio Watchlist with color and sorting
- TUI Log Tail with filters and highlights
- TUI Kanban for personal tasks stored in SQLite
- TUI Habit Tracker with streaks and charts
20) Small Web App (FastAPI or Flask)
- “CSV to Insights” app: upload file, compute stats, plots
- “What-if” Calculator for loans or DCA strategies
- Personal Budget Analyzer with categories and trends
- Signal Explorer plotting technical indicators
21) Scheduling and Automation
- Daily Data Pull with backfill logic and success marker
- Report Emailer generating PDFs or HTML and sending
- SLA Monitor that escalates on breaches
- Rolling Backup with retention policy
22) Security and Secrets
- .env + keyring integration demo with rotation script
- Signed Webhook Verifier library
- Simple RBAC for a CLI app
- Hashing and checksum utilities
23) Documentation and Repos
- MkDocs or Sphinx site with API docs and examples
- Example Gallery notebook collection
- CHANGELOG with Keep a Changelog format
- CONTRIBUTING and issue templates
Project 1 — CLI Unit Converter (pyunit)
Below is a repo-ready starter you can copy into a GitHub repository. It includes a detailed README, packaging config, source code, tests, and pre-commit setup.
Repository structure
pyunit/
├─ README.md
├─ pyproject.toml
├─ .pre-commit-config.yaml
├─ src/
│ └─ pyunit/
│ ├─ __init__.py
│ ├─ cli.py
│ ├─ convert.py
│ ├─ config.py
│ └─ history.py
└─ tests/
├─ test_convert.py
├─ test_cli.py
└─ data/README.md
Create and activate env
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
Install in editable mode
pip install -U pip
pip install -e .
Run
pyunit --from m --to ft --value 3
pyunit --from C --to F --value 25 --precision 2
See help
pyunit --help
## Configuration
pyunit looks for a config file in this order:
1. `--config /path/to/config.toml` if provided
2. `PYUNIT_CONFIG` env var
3. `~/.config/pyunit/config.toml` (Linux/macOS) or `%APPDATA%\\pyunit\\config.toml` (Windows)
Example `config.toml`:[defaults]
precision = 3
from = "m"
to = "ft"
pip install -e .[dev]
pytest -q
Key tests:
- Numeric stability and round-trip sanity for temperature
- Known reference conversions for length and mass
- CLI argument validation and error messages
## Development
- Formatting and linting are managed via `pre-commit`pip install pre-commit
pre-commit install
pre-commit run --all-files
## Roadmap
- Currency conversion via provider interface
- Interactive mode with prompt and history search
- Additional unit systems and custom user-defined unitspyproject.toml
.pre-commit-config.yaml
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.7
hooks:
- id: ruff
args: ["--fix"]
- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
language_version: python3src/pyunit/init.py
__all__ = ["convert", "config", "history"]src/pyunit/convert.py
src/pyunit/config.py
src/pyunit/history.py
src/pyunit/cli.py
tests/test_convert.py
tests/test_cli.py
Tip: Copy each code block to your repo files using the same paths. After creating the repo, run the quick start commands in the README.