Overview of Python resources for data cleaning and project ideas, including a cookbook for data cleaning, low-code libraries, and a comprehensive framework for profiling. Project ideas span various topics like file I/O, error handling, OOP, and web requests, offering practical applications such as a CLI unit converter, data quality dashboards, and visualization tools.
📖 1. Python Data Cleaning Cookbook
A comprehensive, recipe-style codebase supporting the Python Data Cleaning Cookbook by Packt. It includes practical examples for handling:
- Missing values
- Outliers
- Data profiling
- Reusable functions and classes for pipelines Reddit+11GitHub+11GitHub+11
✅ What to adopt:
- Recipe-based structure — organize code in small, understandable files.
- Use of functions and classes for reusability.
⚙️ 2. sfu-db/dataprep
An open-source, low-code data preparation library:
- Offers modules like
dataprep.cleanfor cleaning tasks - Clean, consistent API design
- Focus on ease-of-use, modularity, and documentation arXiv+3Stack Overflow+3Reddit+3GitHubGitHub
✅ What to adopt:
- Clear module breakdown (e.g.,
.clean,.eda) - Well-documented functions and consistent naming conventions.
🧹 3. VIDA-NYU/openclean
A full-featured data profiling and cleaning framework:
- Toolkit for profiling and building data cleaning pipelines
- Extensible and modular design, with comprehensive APIs Reddit+3Stack Overflow+3Reddit+3Stack Overflow+5GitHub+5arXiv+5GitHub+1Reddit+1
✅ What to adopt:
- Pipeline architecture—chain modular operations.
- Separation of concerns: profiling vs. cleaning modules.
Project Ideas by Module
1) Python Basics: Syntax, I/O, Control Flow
- CLI Unit Converter with history log and config file
- Expense Splitter that handles rounding and edge cases
- Text-based Portfolio Rebalancer that outputs trades
- Log Analyzer for server or app logs with simple report
2) Data Structures: Lists, Dicts, Sets, Tuples
- Frequency Analyzer for transactions or words with top-K queries
- Mini LRU Cache implementation with benchmarks
- Contact Book with fuzzy search and CSV import
- Market Basket Analysis toy app with association rules
3) Functions and Modules
- Utility Library for date, currency, and number formatting
- Pipeline Runner that composes steps with simple DAG-like config
- Retry and Backoff Decorators library with tests
- CLI Toolset packaged as installable module
4) File I/O and OS
- Folder Watcher that organizes downloads by rules
- CSV to Parquet Converter with schema inference
- Secure Secrets Manager using OS keyring and .env
- Incremental Backup script with checksum verification
5) Error Handling and Logging
- Robust Downloader with retries, circuit breaker, and structured logs
- Transaction Importer that validates, quarantines bad rows, and reports
- Audit Logger that produces JSON logs and rotates files
- Alert Router that routes errors to email or Teams
6) OOP
- Trade Order Engine simulator with Orders, Fills, Positions
- Task Scheduler with pluggable strategies and observers
- Inventory System with inheritance and composition patterns
- Shapes Library with polymorphic area/price calculations
7) Virtual Envs, Packaging, CLI
- Cookiecutter Template for data projects with Makefile and tox
- Versioned CLI App with semantic versioning and release notes
- Plugin System using entry points for extensibility
- Config-driven App that merges defaults, env, and CLI flags
8) Testing (unittest/pytest)
- Test-Driven Kata: Roman numerals, bowling, or bank account
- Property-Based Tests for CSV parsers or pricing rules
- Golden Master Tests for a refactor of a legacy function
- Mutation Testing demo to improve test quality
9) Standard Library Highlights
- datetime: Trading Calendar calculator with holidays and business days
- pathlib: Project-wide file refactor with safety checks
- functools: Memoized expensive function with cache invalidation
- collections: Deque-based streaming window metrics
10) Datetime and Timezones
- FX Rates Normalizer aligning feeds to UTC with drift checks
- Market Session Classifier labeling trades pre/post market
- SLA Tracker computing response times across timezones
- iCal Generator for recurring study sessions or releases
11) Regular Expressions
- Bank Statement Parser extracting payee, amount, and memo
- Log Redactor masking PII with configurable patterns
- Ticker/ISIN Extractor from messy text
- Markdown Linter fixing common syntax issues
12) JSON, YAML, Configs
- Schema Validator using jsonschema with humanized errors
- Portfolio JSON Diff tool with semantic comparison
- Config Merger with layered overrides and validation
- REST Mock Server serving JSON fixtures
13) Web Requests and APIs
- Price Fetcher with caching and rate limiting
- News Aggregator with deduplication and sentiment tags
- Simple Webhook Receiver that verifies signatures
- API Client SDK wrapper with pagination helpers
14) Data Handling: csv, sqlite3, pandas
- ETL: CSV → SQLite → Pandas report with profiling
- Factor Calculator computing rolling metrics and z-scores
- Data Quality Dashboard showing nulls, uniques, ranges
- Reconciliation Tool comparing two datasets with diffs
15) Visualization: matplotlib, seaborn, plotly
- Market Regime Dashboard with rolling volatility plots
- Outlier Explorer using box, violin, and scatter plots
- KPI Mini-Board with sparklines and targets
- Correlation Heatmap with interactive filtering
16) Concurrency: threading, multiprocessing, asyncio
- Concurrent Web Scraper with bounded concurrency
- Batch Backtester running scenarios in parallel
- Async Price Stream consumer producing rolling metrics
- Worker Pool for CPU-bound simulations
17) Packaging, Publishing, CI
- Publish a pip package of helpers to TestPyPI
- Pre-commit setup for linting and formatting
- GitHub Actions to run tests and build wheels
- Release Automation tagging and generating changelogs
18) OOP + Design Patterns
- Strategy-based Risk Model switchable at runtime
- Observer-based Event Bus for signals and listeners
- Adapter wrapping two quote providers under one interface
- Builder for complex order creation
19) CLI + TUI Apps
- TUI Portfolio Watchlist with color and sorting
- TUI Log Tail with filters and highlights
- TUI Kanban for personal tasks stored in SQLite
- TUI Habit Tracker with streaks and charts
20) Small Web App (FastAPI or Flask)
- “CSV to Insights” app: upload file, compute stats, plots
- “What-if” Calculator for loans or DCA strategies
- Personal Budget Analyzer with categories and trends
- Signal Explorer plotting technical indicators
21) Scheduling and Automation
- Daily Data Pull with backfill logic and success marker
- Report Emailer generating PDFs or HTML and sending
- SLA Monitor that escalates on breaches
- Rolling Backup with retention policy
22) Security and Secrets
- .env + keyring integration demo with rotation script
- Signed Webhook Verifier library
- Simple RBAC for a CLI app
- Hashing and checksum utilities
23) Documentation and Repos
- MkDocs or Sphinx site with API docs and examples
- Example Gallery notebook collection
- CHANGELOG with Keep a Changelog format
- CONTRIBUTING and issue templates
Project 1 — CLI Unit Converter (pyunit)
Below is a repo-ready starter you can copy into a GitHub repository. It includes a detailed README, packaging config, source code, tests, and pre-commit setup.
Repository structure
pyunit/
├─ README.md
├─ pyproject.toml
├─ .pre-commit-config.yaml
├─ src/
│ └─ pyunit/
│ ├─ __init__.py
│ ├─ cli.py
│ ├─ convert.py
│ ├─ config.py
│ └─ history.py
└─ tests/
├─ test_convert.py
├─ test_cli.py
└─ data/README.md
# pyunit — A simple, testable CLI unit converter
pyunit converts between common units for length, mass, and temperature. It demonstrates solid Python fundamentals: CLI design, control flow, file I/O, configuration, packaging, and testing.
## Features
- Convert length, mass, temperature with clear errors for unsupported pairs
- CLI with `argparse` and helpful `--help`
- Config file for defaults (precision, default from/to units)
- JSON history log for every conversion with timestamps
- Packaged with `pyproject.toml` and console script entry point
- Test suite with `pytest` and property-based tests for temperature sanity
- Optional: currency via pluggable provider interface (mock adapter by default)
## Quick startCreate and activate env
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
Install in editable mode
pip install -U pip
pip install -e .
Run
pyunit --from m --to ft --value 3
pyunit --from C --to F --value 25 --precision 2
See help
pyunit --help
## Configuration
pyunit looks for a config file in this order:
1. `--config /path/to/config.toml` if provided
2. `PYUNIT_CONFIG` env var
3. `~/.config/pyunit/config.toml` (Linux/macOS) or `%APPDATA%\\pyunit\\config.toml` (Windows)
Example `config.toml`:[defaults]
precision = 3
from = "m"
to = "ft"
## History log
A JSON Lines file records each conversion:
- Default location: `~/.local/share/pyunit/history.jsonl` (Linux/macOS) or `%APPDATA%\\pyunit\\history.jsonl` (Windows)
- One record per line with ISO timestamp, from unit, to unit, input value, output value
## Units supported
- Length: m, km, cm, mm, in, ft, yd, mi
- Mass: g, kg, lb, oz
- Temperature: C, F, K
Unsupported pairs return a clear error. Temperature conversions are nonlinear and handled specially.
## Design
- `convert.py` — conversion registry and algorithms
- `config.py` — config discovery and parsing (TOML)
- `history.py` — append-only JSONL logger
- `cli.py` — argument parsing, I/O, and wiring
## Testingpip install -e .[dev]
pytest -q
Key tests:
- Numeric stability and round-trip sanity for temperature
- Known reference conversions for length and mass
- CLI argument validation and error messages
## Development
- Formatting and linting are managed via `pre-commit`pip install pre-commit
pre-commit install
pre-commit run --all-files
## Roadmap
- Currency conversion via provider interface
- Interactive mode with prompt and history search
- Additional unit systems and custom user-defined unitspyproject.toml
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "pyunit"
version = "0.1.0"
description = "A simple, testable CLI unit converter"
authors = [{ name = "Teslim Adeyanju" }]
requires-python = ">=3.10"
readme = "README.md"
license = { text = "MIT" }
dependencies = [
"tomli; python_version < '3.11'"
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"pytest-cov>=4.1",
"hypothesis>=6.0",
"ruff>=0.5.0",
]
[project.scripts]
pyunit = "pyunit.cli:main"
[tool.setuptools]
package-dir = {"" = "src"}
[tool.setuptools.packages.find]
where = ["src"]
[tool.ruff]
line-length = 100
select = ["E", "F", "I"].pre-commit-config.yaml
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.7
hooks:
- id: ruff
args: ["--fix"]
- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
language_version: python3src/pyunit/init.py
__all__ = ["convert", "config", "history"]src/pyunit/convert.py
from __future__ import annotations
from typing import Callable, Dict, Tuple
# --- Length conversions relative to meters ---
_LENGTH_TO_M = {
"m": 1.0,
"km": 1000.0,
"cm": 0.01,
"mm": 0.001,
"in": 0.0254,
"ft": 0.3048,
"yd": 0.9144,
"mi": 1609.344,
}
# --- Mass conversions relative to grams ---
_MASS_TO_G = {
"g": 1.0,
"kg": 1000.0,
"lb": 453.59237,
"oz": 28.349523125,
}
# Temperature conversions use explicit formulas
def _c_to_f(x: float) -> float: return x * 9.0 / 5.0 + 32.0
def _f_to_c(x: float) -> float: return (x - 32.0) * 5.0 / 9.0
def _c_to_k(x: float) -> float: return x + 273.15
def _k_to_c(x: float) -> float: return x - 273.15
def _f_to_k(x: float) -> float: return _c_to_k(_f_to_c(x))
def _k_to_f(x: float) -> float: return _c_to_f(_k_to_c(x))
_TEMP_FUNCS: Dict[Tuple[str, str], Callable[[float], float]] = {
("C", "F"): _c_to_f,
("F", "C"): _f_to_c,
("C", "K"): _c_to_k,
("K", "C"): _k_to_c,
("F", "K"): _f_to_k,
("K", "F"): _k_to_f,
}
_LENGTH_UNITS = set(_LENGTH_TO_M.keys())
_MASS_UNITS = set(_MASS_TO_G.keys())
_TEMP_UNITS = {"C", "F", "K"}
class ConversionError(ValueError):
pass
def convert(value: float, unit_from: str, unit_to: str) -> float:
"""Convert value from unit_from to unit_to.
Supports length (m, km, cm, mm, in, ft, yd, mi), mass (g, kg, lb, oz), and
temperature (C, F, K). Raises ConversionError for unsupported pairs.
"""
uf, ut = unit_from.strip(), unit_to.strip()
# Temperature
if uf in _TEMP_UNITS and ut in _TEMP_UNITS:
if uf == ut:
return float(value)
try:
return float(_TEMP_FUNCS[(uf, ut)](float(value)))
except KeyError as exc:
raise ConversionError(f"Unsupported temperature conversion: {uf} -> {ut}") from exc
# Length
if uf in _LENGTH_UNITS and ut in _LENGTH_UNITS:
# to base (meters) then to target
meters = float(value) * _LENGTH_TO_M[uf]
return meters / _LENGTH_TO_M[ut]
# Mass
if uf in _MASS_UNITS and ut in _MASS_UNITS:
grams = float(value) * _MASS_TO_G[uf]
return grams / _MASS_TO_G[ut]
raise ConversionError(f"Unsupported conversion: {uf} -> {ut}")src/pyunit/config.py
from __future__ import annotations
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Optional, Dict, Any
try:
import tomllib # Python 3.11+
except ModuleNotFoundError: # pragma: no cover
import tomli as tomllib
XDG_CONFIG = Path(os.environ.get("XDG_CONFIG_HOME", Path.home() / ".config"))
APPDATA = Path(os.environ.get("APPDATA", XDG_CONFIG))
DEFAULT_CONFIG_PATHS = [
Path(os.environ.get("PYUNIT_CONFIG", "")),
XDG_CONFIG / "pyunit" / "config.toml",
APPDATA / "pyunit" / "config.toml",
]
@dataclass
class Settings:
precision: int = 3
default_from: Optional[str] = None
default_to: Optional[str] = None
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "Settings":
d = data.get("defaults", {}) if isinstance(data, dict) else {}
return cls(
precision=int(d.get("precision", 3)),
default_from=d.get("from"),
default_to=d.get("to"),
)
def load_settings(explicit_path: Optional[Path] = None) -> Settings:
candidates = []
if explicit_path:
candidates.append(Path(explicit_path))
candidates.extend(DEFAULT_CONFIG_PATHS)
for p in candidates:
if not p:
continue
if Path(p).is_file():
with open(p, "rb") as f:
return Settings.from_dict(tomllib.load(f))
return Settings()src/pyunit/history.py
from __future__ import annotations
import json
import os
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
XDG_DATA = Path(os.environ.get("XDG_DATA_HOME", Path.home() / ".local" / "share"))
APPDATA = Path(os.environ.get("APPDATA", XDG_DATA))
DEFAULT_HISTORY = XDG_DATA / "pyunit" / "history.jsonl"
WINDOWS_HISTORY = APPDATA / "pyunit" / "history.jsonl"
@dataclass
class Record:
ts: str
unit_from: str
unit_to: str
value_in: float
value_out: float
def _history_path() -> Path:
# Prefer Windows APPDATA when available
if os.name == "nt":
return WINDOWS_HISTORY
return DEFAULT_HISTORY
def append_record(unit_from: str, unit_to: str, value_in: float, value_out: float) -> None:
path = _history_path()
path.parent.mkdir(parents=True, exist_ok=True)
rec = Record(
ts=datetime.now(timezone.utc).isoformat(),
unit_from=unit_from,
unit_to=unit_to,
value_in=float(value_in),
value_out=float(value_out),
)
with open(path, "a", encoding="utf-8") as f:
f.write(json.dumps(asdict(rec)) + "\n")src/pyunit/cli.py
from __future__ import annotations
import argparse
import sys
from .convert import convert, ConversionError
from .config import load_settings
from .history import append_record
def build_parser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(prog="pyunit", description="CLI unit converter")
p.add_argument("--from", dest="unit_from", required=False, help="source unit e.g. m, C, kg")
p.add_argument("--to", dest="unit_to", required=False, help="target unit e.g. ft, F, lb")
p.add_argument("--value", type=float, required=False, help="numeric value to convert")
p.add_argument("--precision", type=int, help="decimal places for output")
p.add_argument("--config", type=str, help="path to config.toml")
return p
def main(argv: list[str] | None = None) -> int:
argv = argv or sys.argv[1:]
parser = build_parser()
args = parser.parse_args(argv)
settings = load_settings(args.config)
unit_from = args.unit_from or settings.default_from
unit_to = args.unit_to or settings.default_to
if unit_from is None or unit_to is None:
parser.error("--from/--to required (or set defaults in config)")
if args.value is None:
parser.error("--value is required")
precision = args.precision if args.precision is not None else settings.precision
try:
out = convert(args.value, unit_from, unit_to)
except ConversionError as e:
print(f"Error: {e}", file=sys.stderr)
return 2
append_record(unit_from, unit_to, args.value, out)
print(f"{out:.{precision}f}")
return 0
if __name__ == "__main__": # pragma: no cover
raise SystemExit(main())tests/test_convert.py
from __future__ import annotations
import pytest
from pyunit.convert import convert, ConversionError
def test_length_reference_cases():
assert pytest.approx(convert(1, "m", "cm"), rel=1e-12) == 100
assert pytest.approx(convert(1, "km", "m"), rel=1e-12) == 1000
assert pytest.approx(convert(1, "ft", "in"), rel=1e-12) == 12
def test_mass_reference_cases():
assert pytest.approx(convert(1000, "g", "kg"), rel=1e-12) == 1
assert pytest.approx(convert(16, "oz", "lb"), rel=1e-12) == 1
def test_temperature_round_trip():
# C -> F -> C round-trip
c = 37.5
f = convert(c, "C", "F")
back = convert(f, "F", "C")
assert pytest.approx(back, abs=1e-9) == c
# K -> C -> K round-trip
k = 250.0
c2 = convert(k, "K", "C")
back2 = convert(c2, "C", "K")
assert pytest.approx(back2, abs=1e-9) == k
def test_unsupported_pair():
with pytest.raises(ConversionError):
convert(1, "m", "kg")tests/test_cli.py
from __future__ import annotations
import sys
import subprocess
# We invoke the module directly so an editable install isn't mandatory for tests.
def run_cli(args):
return subprocess.run([sys.executable, "-m", "pyunit.cli", *args], capture_output=True, text=True)
def test_cli_basic_success(tmp_path):
cfg = tmp_path / "config.toml"
cfg.write_text(
"""
[defaults]
precision = 2
from = "m"
to = "ft"
""".strip()
)
res = run_cli(["--value", "3", "--config", str(cfg)])
assert res.returncode == 0
assert res.stdout.strip() # numeric output present
def test_cli_requires_args(tmp_path):
res = run_cli(["--value", "2"]) # missing from/to and no config
assert res.returncode != 0 or "required" in res.stderr.lower()Tip: Copy each code block to your repo files using the same paths. After creating the repo, run the quick start commands in the README.