- UNIT-I
.
Unit-1 MCQ's
Data Science and Machine Learning
Basic EDA Concepts
-
What is the primary goal of Exploratory Data Analysis (EDA)?
a) Predicting future outcomes
b) Summarizing main characteristics of data
c) Building machine learning models
d) Automating data collectionAnswer: b) Summarizing main characteristics of data
-
Which Python library is widely used for EDA?
a) TensorFlow
b) OpenCV
c) Pandas
d) Scikit-learnAnswer: c) Pandas
-
Which of the following plots is most suitable for visualizing the distribution of a numerical variable?
a) Pie chart
b) Bar chart
c) Histogram
d) Line chartAnswer: c) Histogram
-
What is a common way to check for missing values in a Pandas DataFrame?
a)df.describe()
b)df.isnull().sum()
c)df.sort_values()
d)df.corr()
Answer: b)
df.isnull().sum()
-
Which measure of central tendency is most resistant to outliers?
a) Mean
b) Median
c) Mode
d) Standard deviationAnswer: b) Median
-
Which visualization tool is best for detecting outliers?
a) Histogram
b) Bar chart
c) Scatter plot
d) Box plotAnswer: d) Box plot
-
Which function provides summary statistics for a Pandas DataFrame?
a)df.head()
b)df.describe()
c)df.shape
d)df.dtypes
Answer: b)
df.describe()
-
Which of the following helps in detecting multicollinearity?
a) Heatmap
b) Box plot
c) Scatter plot
d) HistogramAnswer: a) Heatmap
-
Which of the following can be used to detect skewness in data?
a) Scatter plot
b) Histogram
c) Box plot
d) Line plotAnswer: b) Histogram
-
What does a correlation coefficient of 0 indicate?
a) Perfect negative correlation
b) Perfect positive correlation
c) No correlation
d) Strong correlation
Answer: c) No correlation
Advanced EDA Concepts
-
Which statistical measure quantifies the spread of data?
a) Mean
b) Standard deviation
c) Median
d) Mode
Answer: b) Standard deviation
-
Which Pandas function is used to detect duplicate rows in a dataset?
a)df.dropna()
b)df.duplicated()
c)df.isnull()
d)df.fillna()
Answer: b) df.duplicated()
-
What does a right-skewed histogram indicate?
a) The mean is less than the median
b) The mean is greater than the median
c) The data is evenly distributed
d) The data has no skew
Answer: b) The mean is greater than the median
-
Which type of variable is best represented by a bar chart?
a) Continuous
b) Categorical
c) Numerical
d) Interval
Answer: b) Categorical
-
Which method is used to replace missing values with the median in Pandas?
a)df.fillna(df.mean())
b)df.fillna(df.median())
c)df.dropna()
d)df.replace()
Answer: b) df.fillna(df.median())
-
Which function in Pandas is used to check data types of columns?
a)df.head()
b)df.dtypes
c)df.shape
d)df.describe()
Answer: b) df.dtypes
-
Which visualization technique is best for showing the relationship between two numerical variables?
a) Box plot
b) Histogram
c) Scatter plot
d) Bar chart
Answer: c) Scatter plot
-
Which measure is useful for identifying the dispersion of data?
a) Mean
b) Standard deviation
c) Mode
d) Median
Answer: b) Standard deviation
-
Which of the following is a measure of shape in EDA?
a) Variance
b) Skewness
c) Standard deviation
d) Median
Answer: b) Skewness
-
Which method is used to normalize a dataset?
a) Min-max scaling
b) Standard deviation calculation
c) Calculating the mean
d) Removing duplicates
Answer: a) Min-max scaling
21. The mean of a dataset is also known as:
a) Median
b) Average
c) Mode
d) Range
Answer: b) Average
22. What is the mean of the numbers 10, 20, 30, 40, and 50?
a) 30
b) 25
c) 35
d) 40
Answer: a) 30
23. If all values in a dataset are increased by 5, how does the mean change?
a) Increases by 5
b) Decreases by 5
c) Remains the same
d) Increases by 10
Answer: a) Increases by 5
24. The median of a dataset is:
a) The most frequently occurring value
b) The middle value when arranged in order
c) The sum of all values divided by the total count
d) The difference between maximum and minimum values
Answer: b) The middle value when arranged in order
25. If a dataset has an even number of values, the median is:
a) The smallest value
b) The average of the two middle values
c) The largest value
d) The mode
Answer: b) The average of the two middle values
26. Which measure of central tendency is most affected by outliers?
a) Mean
b) Median
c) Mode
d) Range
Answer: a) Mean
27. Which measure of central tendency is best for skewed data?
a) Mean
b) Median
c) Mode
d) Variance
Answer: b) Median
28. The mode of a dataset is:
a) The most frequently occurring value
b) The middle value
c) The sum of all values divided by the count
d) The range
Answer: a) The most frequently occurring value
29. A dataset with two modes is called:
a) Unimodal
b) Bimodal
c) Multimodal
d) Non-modal
Answer: b) Bimodal
30. Which measure of central tendency can have more than two values?
a) Mean
b) Median
c) Mode
d) Range
Answer: c) Mode
Measures of Dispersion (Standard Deviation, Variance, Range, IQR)
31. The range of a dataset is calculated as:
a) The sum of all values divided by total count
b) The difference between the highest and lowest values
c) The middle value
d) The most frequently occurring value
Answer: b) The difference between the highest and lowest values
32. Standard deviation measures:
a) The central value of a dataset
b) The dispersion of data points from the mean
c) The most frequent value
d) The correlation between variables
Answer: b) The dispersion of data points from the mean
33. A dataset with a standard deviation of 0 means:
a) The data has high variation
b) All values are the same
c) The data has outliers
d) The mean is zero
Answer: b) All values are the same
34. Variance is:
a) The square root of the standard deviation
b) The square of the standard deviation
c) The sum of all values
d) The difference between the highest and lowest values
Answer: b) The square of the standard deviation
35. If all values in a dataset are increased by 10, how does the standard deviation change?
a) Increases
b) Decreases
c) Remains the same
d) Becomes zero
Answer: c) Remains the same
36. If all values in a dataset are multiplied by 3, the standard deviation:
a) Increases 3 times
b) Remains the same
c) Decreases
d) Becomes zero
Answer: a) Increases 3 times
37. Which of the following is NOT a measure of dispersion?
a) Standard deviation
b) Variance
c) Median
d) Range
Answer: c) Median
38. A dataset has a variance of 16. What is its standard deviation?
a) 2
b) 4
c) 8
d) 16
Answer: b) 4
39. A smaller standard deviation indicates:
a) Higher spread in data
b) Lower spread in data
c) More outliers
d) A higher mean
Answer: b) Lower spread in data
40. The interquartile range (IQR) is:
a) Q3 - Q1
b) Q2 - Q1
c) Q3 - Mean
d) Q3 + Q1
Answer: a) Q3 - Q1
Application-Based Questions
41. Which is more resistant to outliers:
a) Mean
b) Median
c) Standard deviation
d) Variance
Answer: b) Median
42. A higher variance means:
a) Data points are closer to the mean
b) Data points are spread out
c) The dataset is symmetric
d) The mean is zero
Answer: b) Data points are spread out
43. If the standard deviation of a dataset is 5, what is the variance?
a) 5
b) 10
c) 25
d) 50
Answer: c) 25
44. In a normal distribution, approximately 68% of data falls within:
a) 1 standard deviation from the mean
b) 2 standard deviations from the mean
c) 3 standard deviations from the mean
d) No standard deviations
Answer: a) 1 standard deviation from the mean
45. A standard deviation of 0 means:
a) No variation in data
b) Data is highly spread out
c) Data is negatively skewed
d) The mean is zero
Answer: a) No variation in data
46. In a perfectly symmetrical dataset, the mean and median are:
a) Always different
b) Always equal
c) Sometimes equal
d) Unrelated
Answer: b) Always equal
47. Which is not affected by extreme values?
a) Mean
b) Median
c) Standard deviation
d) Variance
Answer: b) Median
48. The standard deviation of {5, 5, 5, 5, 5} is:
a) 0
b) 5
c) 25
d) 10
Answer: a) 0
49. The sum of squared deviations from the mean is used to calculate:
a) Range
b) Variance
c) Median
d) Mode
Answer: b) Variance
50. A lower standard deviation means:
a) Higher consistency in data
b) More variability
c) More outliers
d) A higher range
Answer: a) Higher consistency in data
Skewness MCQs
51. Skewness measures the ______________ of a dataset.
a) Spread
b) Symmetry
c) Central tendency
d) Variability
Answer: b) Symmetry
52. If a distribution has a long right tail, it is called:
a) Positively skewed
b) Negatively skewed
c) Normally distributed
d) Symmetric
Answer: a) Positively skewed
53. In a negatively skewed distribution:
a) The mean is greater than the median
b) The median is greater than the mean
c) The mean and median are equal
d) The data has no outliers
Answer: b) The median is greater than the mean
54. In a perfectly symmetrical distribution, the skewness value is:
a) 1
b) -1
c) 0
d) Undefined
Answer: c) 0
55. Which of the following distributions is most likely to have a skewness value close to zero?
a) A uniform distribution
b) A normal distribution
c) A bimodal distribution
d) An exponential distribution
Answer: b) A normal distribution
56. A left-skewed distribution has:
a) A longer tail on the right
b) A longer tail on the left
c) No tails
d) Equal tails on both sides
Answer: b) A longer tail on the left
57. If the skewness of a dataset is greater than 1, the distribution is:
a) Heavily skewed
b) Symmetric
c) Normally distributed
d) Bimodal
Answer: a) Heavily skewed
58. Which measure is least affected by skewness?
a) Mean
b) Median
c) Mode
d) Variance
Answer: b) Median
59. Which formula is used to calculate skewness?
a) Karl Pearson’s coefficient of skewness
b) Interquartile range formula
c) Central limit theorem
d) Least squares method
Answer: a) Karl Pearson’s coefficient of skewness
60. In a negatively skewed distribution, which measure of central tendency is the largest?
a) Mean
b) Median
c) Mode
d) None of the above
Answer: c) Mode
Kurtosis MCQs
61. Kurtosis measures the ______________ of a dataset.
a) Central tendency
b) Spread
c) Shape of the tails
d) Mean deviation
Answer: c) Shape of the tails
62. A normal distribution has a kurtosis value of:
a) 1
b) 2
c) 3
d) 0
Answer: c) 3
63. If a distribution has kurtosis greater than 3, it is called:
a) Platykurtic
b) Mesokurtic
c) Leptokurtic
d) Symmetric
Answer: c) Leptokurtic
64. A leptokurtic distribution has:
a) Thin tails
b) Fat tails
c) No tails
d) Uniform shape
Answer: b) Fat tails
65. A platykurtic distribution has:
a) Higher peaks and thicker tails
b) Lower peaks and thinner tails
c) Equal tails on both sides
d) A perfect bell shape
Answer: b) Lower peaks and thinner tails
66. Which distribution is considered mesokurtic?
a) Uniform distribution
b) Normal distribution
c) Exponential distribution
d) Poisson distribution
Answer: b) Normal distribution
67. The kurtosis of a normal distribution is also known as:
a) Excess kurtosis
b) Standard kurtosis
c) Absolute kurtosis
d) Moderate kurtosis
Answer: a) Excess kurtosis
68. A dataset with kurtosis less than 3 is classified as:
a) Leptokurtic
b) Platykurtic
c) Mesokurtic
d) Skewed
Answer: b) Platykurtic
69. Which of the following is true about a distribution with high kurtosis?
a) It has outliers far from the mean
b) It is always symmetrical
c) It has a flat peak
d) It is negatively skewed
Answer: a) It has outliers far from the mean
70. Which statistic measures whether a dataset has light or heavy tails?
a) Mean
b) Standard deviation
c) Kurtosis
d) Variance
Answer: c) Kurtosis
Combination MCQs (Skewness & Kurtosis)
71. Which of the following does NOT affect skewness and kurtosis?
a) Outliers
b) Data distribution
c) Sample size
d) Mean
Answer: d) Mean
72. A normal distribution has:
a) Skewness of 0 and kurtosis of 3
b) Skewness of 1 and kurtosis of 0
c) Skewness of -1 and kurtosis of 1
d) Skewness of 2 and kurtosis of 4
Answer: a) Skewness of 0 and kurtosis of 3
73. A distribution with high skewness and high kurtosis has:
a) A long tail and many extreme values
b) A short tail and no outliers
c) No skewness
d) Equal probabilities for all values
Answer: a) A long tail and many extreme values
74. What happens if a dataset has a high positive skewness and high kurtosis?
a) The dataset has a long right tail and many extreme values
b) The dataset is perfectly symmetric
c) The dataset is normally distributed
d) The dataset has no outliers
Answer: a) The dataset has a long right tail and many extreme values
75. If a dataset has negative skewness and low kurtosis, the data is:
a) Left-skewed with thin tails
b) Right-skewed with thick tails
c) Normally distributed
d) Symmetric
Answer: a) Left-skewed with thin tails
76. When the mean and median are equal, the skewness is likely to be:
a) Positive
b) Negative
c) Zero
d) Undefined
Answer: c) Zero
77. Which measure helps determine whether a distribution has extreme outliers?
a) Mean
b) Variance
c) Kurtosis
d) Standard deviation
Answer: c) Kurtosis
78. Which of the following distributions would most likely have high kurtosis?
a) A dataset with many extreme outliers
b) A uniform distribution
c) A symmetric, bell-shaped distribution
d) A dataset with no variation
Answer: a) A dataset with many extreme outliers
79. A leptokurtic distribution is more likely to have:
a) Extreme values
b) A flat shape
c) No skewness
d) A symmetrical spread
Answer: a) Extreme values
80. When calculating skewness and kurtosis, which assumption is typically made?
a) The dataset is normally distributed
b) The dataset has equal variance
c) The dataset is unimodal
d) The dataset contains outliers
Answer: a) The dataset is normally distributed
Skewness Numericals
81. Given the following dataset: {10, 15, 20, 25, 80}, what is the mean?
a) 30
b) 20
c) 50
d) 25
Solution:
Answer: a) 30
82. Using the same dataset {10, 15, 20, 25, 80}, what is the median?
a) 20
b) 30
c) 25
d) 35
Solution:
The middle value (when arranged in ascending order) is 20.
Answer: a) 20
83. Using the same dataset, what is the skewness?
a) Positively skewed
b) Negatively skewed
c) Symmetric
d) Cannot be determined
Solution:
The mean (30) is greater than the median (20), indicating a right-skewed (positively skewed) distribution.
Answer: a) Positively skewed
84. A dataset has Mean = 50, Median = 45, Mode = 40. What is the approximate skewness using Karl Pearson’s coefficient?
a) 0.5
b) 1.0
c) -1.0
d) -0.5
Solution:
Since Mean > Median > Mode, it is positively skewed.
Without standard deviation, we estimate positive skewness.
Answer: b) 1.0
Kurtosis Numericals
85. The kurtosis of a dataset is 5. What type of kurtosis does it have?
a) Mesokurtic
b) Platykurtic
c) Leptokurtic
d) Cannot be determined
Solution:
Since kurtosis > 3, the distribution has high kurtosis (leptokurtic).
Answer: c) Leptokurtic
86. A dataset has the following values: {5, 10, 15, 20, 25, 30, 100}. What type of kurtosis is likely?
a) Platykurtic
b) Mesokurtic
c) Leptokurtic
d) None
Solution:
The dataset has an extreme outlier (100), leading to higher kurtosis (leptokurtic).
Answer: c) Leptokurtic
Combination of Skewness and Kurtosis Numericals
87. If the skewness of a dataset is -1.5 and kurtosis is 1.8, the distribution is:
a) Positively skewed and leptokurtic
b) Negatively skewed and platykurtic
c) Normally distributed
d) Cannot be determined
Solution:
-
Negative skewness means it is left-skewed.
-
Kurtosis < 3 means it is platykurtic.
Answer: b) Negatively skewed and platykurtic
88. If a dataset has a skewness of 0.2 and kurtosis of 2.9, the shape of the distribution is:
a) Normal
b) Positively skewed and leptokurtic
c) Negatively skewed and platykurtic
d) Positively skewed and platykurtic
Solution:
-
Skewness ≈ 0 means it is almost symmetric.
-
Kurtosis ≈ 3 means it is mesokurtic (normal distribution).
Answer: a) Normal
89. A box plot is also known as a:
a) Bar chart
b) Whisker plot
c) Histogram
d) Scatter plot
Answer: b) Whisker plot
90. The box in a box plot represents which statistical measure?
a) Mean
b) Range
c) Interquartile Range (IQR)
d) Standard Deviation
Answer: c) Interquartile Range (IQR)
91. In a box plot, the median is represented by:
a) The bottom of the box
b) The top of the box
c) The line inside the box
d) The end of the whiskers
Answer: c) The line inside the box
92. The whiskers in a box plot extend to:
a) The highest and lowest values in the dataset
b) The mean of the dataset
c) The interquartile range
d) 1.5 times the interquartile range (IQR)
Answer: d) 1.5 times the interquartile range (IQR)
93. Which of the following is NOT shown in a box plot?
a) Outliers
b) Median
c) Mean
d) First quartile (Q1)
Answer: c) Mean
94. If a box plot is right-skewed, which of the following is true?
a) The median is closer to Q3
b) The median is closer to Q1
c) The whiskers are equal in length
d) There are no outliers
Answer: b) The median is closer to Q1
95. If the whiskers of a box plot are very long, this suggests that:
a) The data has low variability
b) The data is heavily skewed or has many outliers
c) The dataset follows a normal distribution
d) The dataset has only one unique value
Answer: b) The data is heavily skewed or has many outliers
Box Plot Numerical Questions
96. Given the following five-number summary: {10, 20, 30, 40, 50}, what is the interquartile range (IQR)?
a) 10
b) 20
c) 30
d) 40
Solution:
Answer: b) 20
97. If the Q1 = 25 and Q3 = 75, what are the upper and lower limits for outliers?
a) Lower = -50, Upper = 150
b) Lower = 0, Upper = 100
c) Lower = 50, Upper = 75
d) Lower = 25, Upper = 75
Solution:
Answer: a) Lower = -50, Upper = 150
98. A dataset has a median of 35, Q1 = 20, and Q3 = 50. Which of the following statements is true?
a) The dataset is left-skewed
b) The dataset is right-skewed
c) The dataset is symmetric
d) Cannot be determined
Solution:
Since median (35) is exactly between Q1 (20) and Q3 (50), the data is symmetrical.
Answer: c) The dataset is symmetric
99. If a box plot shows that Q1 = 15, Q3 = 45, and the maximum whisker extends to 70, what is the likely presence of outliers?
a) No outliers
b) Outliers exist beyond 70
c) The data is symmetric
d) The data follows a normal distribution
Solution:
-
IQR = 45 - 15 = 30
-
Upper limit = Q3 + 1.5 × IQR = 45 + (1.5 × 30) = 45 + 45 = 90
-
Lower limit = Q1 - 1.5 × IQR = 15 - 45 = -30
-
Since the max value 70 is within the range (-30, 90), there are no outliers.
Answer: a) No outliers
100. A dataset has the five-number summary: {12, 18, 25, 30, 60}. Which statement is true?
a) The data is right-skewed
b) The data is left-skewed
c) The data is symmetric
d) Cannot be determined
Solution:
-
Q1 = 18, Q3 = 30, Median = 25
-
The upper whisker (60) is much farther from Q3 (30) than the lower whisker (12) is from Q1 (18).
-
This suggests positive (right) skewness.
Answer: a) The data is right-skewed
Pivot Table MCQs
101. A Pivot Table is used for:
a) Data visualization
b) Data summarization
c) Data cleaning
d) Data encryption
Answer: b) Data summarization
102. In a Pivot Table, which of the following fields is used to categorize data?
a) Values
b) Filters
c) Rows and Columns
d) All of the above
Answer: d) All of the above
103. What function is commonly used in a Pivot Table to summarize numerical data?
a) SUM
b) COUNT
c) AVERAGE
d) All of the above
Answer: d) All of the above
104. A Pivot Table can be created in:
a) Microsoft Excel
b) Google Sheets
c) Python (Pandas)
d) All of the above
Answer: d) All of the above
Correlation Statistics MCQs
105. The correlation coefficient (r) measures the relationship between:
a) Two categorical variables
b) Two numerical variables
c) One numerical and one categorical variable
d) None of the above
Answer: b) Two numerical variables
106. If the correlation coefficient (r) is -1, the relationship between variables is:
a) Perfectly positive
b) Perfectly negative
c) No correlation
d) Weak correlation
Answer: b) Perfectly negative
107. If two variables have no correlation, the correlation coefficient (r) is:
a) -1
b) 0
c) 1
d) Undefined
Answer: b) 0
108. Which correlation coefficient represents the strongest linear relationship?
a) r = -0.8
b) r = 0.5
c) r = -0.3
d) r = 0.1
Answer: a) r = -0.8
109. If correlation is positive, it means:
a) As one variable increases, the other decreases
b) As one variable increases, the other also increases
c) The variables are not related
d) The data has outliers
Answer: b) As one variable increases, the other also increases
110. Which method is commonly used to calculate correlation?
a) Pearson’s correlation
b) Spearman’s rank correlation
c) Kendall’s tau correlation
d) All of the above
Answer: d) All of the above
Correlation Numericals
111. If the covariance between X and Y is 15 and the standard deviations of X and Y are 3 and 5, what is the Pearson correlation coefficient?
a) 1.5
b) 1.0
c) 0.5
d) 2.5
Solution:
Answer: b) 1.0
112. If two datasets have a correlation of -0.85, what does it indicate?
a) Strong positive correlation
b) Strong negative correlation
c) No correlation
d) Weak correlation
Answer: b) Strong negative correlation
ANOVA (Analysis of Variance) MCQs
113. The purpose of ANOVA is to compare:
a) Two means
b) Three or more means
c) Standard deviations
d) Medians
Answer: b) Three or more means
114. Which of the following is an assumption of ANOVA?
a) Normality of data
b) Homogeneity of variance
c) Independence of observations
d) All of the above
Answer: d) All of the above
115. What does a low p-value (< 0.05) in an ANOVA test indicate?
a) The groups have similar means
b) At least one group mean is significantly different
c) The test is invalid
d) The data is not normal
Answer: b) At least one group mean is significantly different
116. Which statistic is used in ANOVA to determine significance?
a) t-statistic
b) F-statistic
c) Chi-square
d) z-score
Answer: b) F-statistic
ANOVA Numericals
117. Given the following sample means:
-
Group A: 15
-
Group B: 20
-
Group C: 30
The grand mean (overall mean) is:
a) 20
b) 22.5
c) 25
d) 30
Solution:
Answer: b) 22.5
118. If the between-group variance = 50 and within-group variance = 10, what is the F-ratio?
a) 2.5
b) 5.0
c) 10.0
d) 50.0
Solution:
Answer: b) 5.0
119. A one-way ANOVA is used when comparing:
a) One group’s variance
b) Two independent groups
c) Three or more independent groups
d) Paired samples
Answer: c) Three or more independent groups
120. In an ANOVA test, a large F-statistic means:
a) The variances within groups are large
b) The means of the groups are significantly different
c) There is no significant difference
d) The test is not valid
Answer: b) The means of the groups are significantly different
No comments:
Post a Comment