Status
Done
4. Statistical Summaries
df['series'].describedf['series'].countdf['series'].sumdf['series'].meandf['series'].mediandf['series'].modedf['series'].mindf['series'].maxdf['series'].stddf['series'].vardf['series'].nuniquedf['series'].uniquedf['series'].value_countsdf['series'].idxmaxdf['series'].idxmin
We'll use the "diamonds" dataset from Seaborn, which has better numerical columns for statistical analysis.
python
‣
1. Summary Statistics
‣
2. Unique Values & Counting
‣
3. Index Locations
Summary Table
Task | Method | Key Notes |
Quick stats overview | describe() | Includes count/mean/std/min/max/quartiles |
Count non-NA values | count() | Useful for data quality checks |
Sum of values | sum() | Total aggregation |
Average value | mean() | Sensitive to outliers |
Middle value | median() | Robust to outliers |
Most frequent value | mode() | May return multiple results |
Minimum value | min() | |
Maximum value | max() | |
Standard deviation | std() | Measures spread |
Variance | var() | std² |
Number of unique values | nunique() | |
List unique values | unique() | Returns NumPy array |
Frequency counts | value_counts() | Best for categorical data |
Index of max value | idxmax() | Returns position, not value |
Index of min value | idxmin() |
Key Insights from Diamonds Dataset
- Price Distribution:
- Mean price ($3,933) > Median ($2,401) → Right-skewed distribution (few expensive diamonds pull up the average).
- Huge std ($3,989) indicates high price variability.
- Cut Quality:
value_counts()shows "Ideal" is most common cut (21,551 diamonds).- Extreme Values:
- Cheapest diamond is at index 0 ($326).
- Most expensive is at index 27749 ($18,823).
When to Use What?
- Exploratory Analysis: Start with
describe(). - Data Quality: Check
count()vs dataset size for missing values. - Categorical Data:
value_counts()+nunique(). - Outlier Detection: Compare
mean()vsmedian(). - Locating Extremes:
idxmax()/idxmin()to find records.