Status
Done
4. Statistical Summaries
df['series'].describedf['series'].countdf['series'].sumdf['series'].meandf['series'].mediandf['series'].modedf['series'].mindf['series'].maxdf['series'].stddf['series'].vardf['series'].nuniquedf['series'].uniquedf['series'].value_countsdf['series'].idxmaxdf['series'].idxmin
We'll use the "diamonds" dataset from Seaborn, which has better numerical columns for statistical analysis.
python
import pandas as pd
import seaborn as sns
# Load diamonds dataset
df = sns.load_dataset('diamonds')
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64
[53940 rows x 10 columns]‣
1. Summary Statistics
‣
2. Unique Values & Counting
‣
3. Index Locations
Summary Table
Task | Method | Key Notes |
Quick stats overview | describe() | Includes count/mean/std/min/max/quartiles |
Count non-NA values | count() | Useful for data quality checks |
Sum of values | sum() | Total aggregation |
Average value | mean() | Sensitive to outliers |
Middle value | median() | Robust to outliers |
Most frequent value | mode() | May return multiple results |
Minimum value | min() | |
Maximum value | max() | |
Standard deviation | std() | Measures spread |
Variance | var() | std² |
Number of unique values | nunique() | |
List unique values | unique() | Returns NumPy array |
Frequency counts | value_counts() | Best for categorical data |
Index of max value | idxmax() | Returns position, not value |
Index of min value | idxmin() |
Key Insights from Diamonds Dataset
- Price Distribution:
- Mean price ($3,933) > Median ($2,401) → Right-skewed distribution (few expensive diamonds pull up the average).
- Huge std ($3,989) indicates high price variability.
- Cut Quality:
value_counts()Â shows "Ideal" is most common cut (21,551 diamonds).- Extreme Values:
- Cheapest diamond is at index 0 ($326).
- Most expensive is at index 27749 ($18,823).
When to Use What?
- Exploratory Analysis: Start withÂ
describe(). - Data Quality: CheckÂ
count()Â vs dataset size for missing values. - Categorical Data:Â
value_counts()Â +Ânunique(). - Outlier Detection: CompareÂ
mean()Â vsÂmedian(). - Locating Extremes:Â
idxmax()/idxmin()Â to find records.