x
CHAPTER 2: DESCRIPTIVE STATISTICS
Descriptive statistics involves the summarisation of data to give an insight into the distribution and variability of the data. it basically measures the central tendency and dispersion of data.
Types of Data
- Categoterical data (qualitative data)
- Numerical (quantitative data)
Measures of centre: The most common measures of central tendency is the mean and median.
- Mean is prone to be impacted by outliers
- Median is not impacted with outlier
- Always analyse data using both the mean and median
Suppose you’re part of an NBA team trying to negotiate salaries. If you represent the owners, you want to show how much everyone is making and how much you’re spending, so you want to take into account those superstar players who earns high and because of outliers in your data and report the average.
But if you’re on the side of the players, you want to report the median, because that’s more representative of what the players in the middle are making. Fifty percent of the players make a salary above the median, and 50% make a salary below the median.
Measures of Variability
Variability is the measures of how the data vary from the average. Variation always exists in a data set, regardless of which characteristic you’re measuring, because not every individual data will have the same exact value for every characteristic you measure. Without a mea- sure of variability you can’t compare two data sets effectively.
A two set of data can have the mean and median, but have different variability. Example the data sets 199, 200, 201, and 0, 200, 400 both have the same average, which is 200, and the same median, which is also 200.
Yet they have very different amounts of variability. The first data set has a very small amount of variability compared to the second.
By far the most commonly used measure of variability is the standard deviation. The standard deviation of a data set, denoted by s, represents the typical distance from any point in the data set to the center. It’s roughly the average distance from the center, and in this case, the center is the average.
Here are some properties that can help you when interpreting a standard deviation:
- The standard deviation can never be a negative number.
- The smallest possible value for the standard deviation is 0 (when every number in the data set is exactly the same).
- Standard deviation is affected by outliers, as it’s based on distance from the mean, which is affected by outliers.
- The standard deviation has the same units as the original data, while variance is in square units.
Percentiles
Percentiles show a relational position of the number to within the data set. A percentile is the percentage of individuals in the data set who are below where your particular number is located. If your exam score is at the 90th percentile, for example, that means 90% of the people taking the exam with you scored lower than you did (it also means that 10 percent scored higher than you did.)
The median is the 50th percentile, the point in the data where the 50% of the data fall below that point and 50% fall above it.
The US usually report the household income in percentiles. The Bureau also reported various percentiles for household income, including the 10th, 20th, 50th, 80th, 90th, and 95th. Table 2-1 shows the values of each of these percentiles.
- Looking at these percentiles, you can see that the bottom half of the incomes are closer together than are the top half.
- The difference between the 50th percentile and the 20th percentile is about $24,000, whereas the spread between the 50th percentile and the 80th percentile is more like $41,000.
- And the difference between the 10th and 50th percentiles is only about $31,000, whereas the difference between the 90th and the 50th percentiles is a whopping $74,000.
The Five Number Summary
The five-number summary is a set of five descriptive statistics that provides a concise overview of how data in a dataset is distributed. It helps you understand the:
- Center: Where the middle of the data lies (represented by the median).
- Spread: How much the data points are scattered around the center (represented by the quartiles).
- Range: The overall span of the data (represented by the minimum and maximum values).
Here's a breakdown of the five numbers in the summary:
- Minimum: The smallest value in the dataset.
- First Quartile (Q1): The value at which 25% of the data falls below it.
- Median: The middle value when the data is ordered from least to greatest. It represents the 50th percentile.
- Third Quartile (Q3): The value at which 75% of the data falls below it.
- Maximum: The largest value in the dataset.
By analyzing these five numbers, you can gain valuable insights into the characteristics of your data without getting bogged down in every single data point.
import numpy as np
numbers = [43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99]
#1 minimum
min = np.min(numbers)
#2 Percentiles
perecentile_25 = np.percentile(numbers, 25)
#3 Median
median = np.median(numbers)
#4 Percentiles
perecentile_75 = np.percentile(numbers, 75)
#5 maximum
max = np.max(numbers)
print(f"Minimum: {min}")
print(f"Percentile 25: {perecentile_25}")
print(f"Median: {median}")
print(f"Percentile 75: {perecentile_75}")
print(f"Maximum: {max}")minimum: 43
Percentile 25: 68.
Median: 77.0
Percentile 75: 89.
Maximum: 99CHAPTER 3: Charts and Graphs
PIE CHART
- This shows the percentage of the each individual items in relation to the total
- The sum of the all the slice in the pie chart should be 100%
- Because a pie chart is a circle, categories can easily be compared and contrasted to one another.
- A pie chart only shows the percentage in each group, not the number in each group. Always ask for or look for a report of the total size of the data set.
HISTOGRAM
- One of the features of the histogram is to show the shape of the data (how the data is distributed among the groups). There are 3 major shape in data set.. symmetric, skewed right and skewed left.
- Another insight that can be gotten from a histogram is the variability of the data.
- if the data is quite flat with the bar close to the same height, this indicates less variability but the opposite is the true.
- A histogram with a big lump in the middle and tails on the sides indicates more data in the middle bars than the outer bars, so the data are actually closer together.
- But when the heights of histogram bars appear flat (uniform), this shows values spread out uni- formly over many groups, indicating a great deal of variability in the data at one point in time.
BOXPLOT
- A boxplots is a one-dimensional graph of numerical data based on the five-number summary of descriptive stastitics.
- A boxplots can show information about the distribution, vari- ability, and center of a data set.
A boxplots, also called a box and whisker plot, is a visualization tool used to understand the distribution of data in a dataset. It provides a quick summary of how the data points are spread out and reveals potential outliers. Here's how a boxplot represents data distribution with examples:
Elements of a Boxplots:
- Box: The box represents the middle 50% of the data. The line in the middle of the box is the median, which divides the data into two halves with an equal number of points on either side.
- Whiskers: The lines extending from the box are called whiskers. They typically represent the range that covers the next 25% of the data points on either side of the box. There are two interpretations of whiskers:
- Upper whisker: Extends to the data point at the 75th percentile (Q3).
- Lower whisker: Extends to the data point at the 25th percentile (Q1).
- Outliers: Data points that fall outside the range of the whiskers are considered outliers and are typically depicted as individual points beyond the whiskers.
Understanding Distribution through a Boxplot:
By looking at a boxplot, you can gain valuable insights about the data distribution:
- Center: The position of the median line indicates the central tendency of the data. Is it skewed towards the left (lower median) or right (higher median)?
- Spread: The size of the box shows how spread out the middle 50% of the data is. A larger box indicates a wider spread, while a smaller box suggests the data is more concentrated around the median.
- Symmetry: The lengths of the whiskers can tell you if the data is symmetrical or skewed. If the upper and lower whiskers are roughly equal in length, the data is likely symmetrical. If one whisker is significantly longer, the data is skewed in that direction.
- Outliers: The presence of outliers can indicate extreme values in the data that may require further investigation.
The boxplot you generated visualizes the distribution of the exam scores in the data set. Here's a detailed breakdown of what the plot reveals:
Center:
- The median line (horizontal line inside the box) sits around the value of 77. This indicates that half the scores fall below 77 and the other half fall above 77.
Spread:
- The box itself extends from approximately 66 to 89. This shows that the middle 50% of the scores are concentrated within this range. There's a moderate spread in the data, but it's not extremely wide.
Symmetry:
- The whiskers (the lines extending from the box) are roughly similar in length. The upper whisker reaches around 93 and the lower whisker goes down to about 61. This suggests that the distribution of scores is fairly symmetrical. There might be a slight skew towards higher scores (since the upper whisker is slightly longer), but it's not a significant skew.
Outliers:
- There are a few data points beyond the whiskers, which are considered outliers. These outliers are at 43, 95, 96, 98, and 99. The presence of outliers indicates that there are a few scores that fall outside the typical range of the data.
Overall, the boxplot suggests:
- The scores are fairly clustered around a central tendency of 77.
- There's a moderate spread in the data, but it's not extremely wide.
- The distribution is somewhat symmetrical, with a slight possible skew towards higher scores.
- There are a few outliers on both the lower and higher ends of the score range.
This information can be helpful for understanding the overall performance in the exam and identifying any potential areas of concern, such as a significant number of low scores or a large gap between the typical scores and the outliers.
CHAPTER 4: THE BINOMIAL DISTRIBUTION
The knowledge of a random number is important to concept of Binomial distribution. Simply, a random variable is a characteristic, measurement, or count that changes randomly according to some set of probabilities; its notation is X, Y, Z, and so on.
Applying the concept of probabilities to the random number, then, it can be said that “A list of all possible values of a random variable, along with their probabilities is called a probability distribution.” One of the most well-known probability distributions is the binomial. Binomial means “two names” and is associated with situations involving two outcomes: success or failure (hitting a red light or not; developing a side effect or not).
Characteristics of Binomial Distribution
- There are a fixed number of trials (n).
- Each trial has two possible outcomes: success or failure.
- The probability of success (call it p) is the same for each trial.
- The trials are independent, meaning the outcome of one trial doesn’t influence that of any other.
Application
- How many days do I study in week (n = 7)
- Possible outcome: Read = success and not read = Failure
- Probability of success (p) = 1 and failure = 1 - p
- Independent trial as each day study time did not affect the other day.
QUESTION:
Studies shown that colour blindness affect about 8% of men. Random sample of 10 men were selected.
SOLUTION
Examining the question base on the 4 features of binomial distribution,
- Number of trial = 10
- Possible outcome: Colour blindness indicates success and otherwise means failure
- Probability of success (p): 8%, and Probability of failure(q): 92%
- Independent trial of each men having colour blindness.
Find the probability that
- All 10 men are colour blind
- No men are colour blind
- Exactly 2 men are colour blind
- At least 2 men are colour blind