Exploratory Data Analysis on Beers & Breweries Datasets icon Exploratory Data Analysis on Beers & Breweries Datasets
Personal Projects #Data Science#Python

Data Cleaning#

Merge the two datasets on brewery_id and id to create a joint dataset. Then, determine the column categories needed in a dataframe by identifying unique columns. The categories are:

ColumnType
abvnumeric
ibunumeric
idnumeric
namecategorical
stylecategorical
brewery_idnumeric
ouncesnumeric

Descriptive Statistics of Numerical Values#

Numerical Values in Columns#

Determine the number of numerical values in each column for the series by finding the length of each column. brewery_id and id were not included.

ColumnLength
ibu2410
abv2410
ounces2410

Non-Null Values#

Find the number of non-null values in each numerical column (brewery_id and id not included) by using count().

ColumnNon-Null Values
ibu1405
abv2348
ounces2410

Then, determine the percentage of null values in each numerical column using the following calculation:

((length_of_column - count_of_missing_values)/length_of_column)*100

ColumnPercentage of Missing Values
ibu41.7%
abv2.6%
ounces0%

Minumum & Maximum Values#

Min and Max values of each numerical column (brewery_id and id not included) are determined using min() and max().

ColumnMinMax
ibu4.020.0
abv0.0010.128
ounces8.432.0

Mean, Median, Mode, & Standard Deviations of Numerical Columns#

Mean, median, mode and standard deviations of each numerical column (brewery_id and id not included) are determined using the following functions: mean(), median(), mode(), std().

ColumnMeanMedianModeStandard Deviation
ibu42.71335.020.025.954
abv0.05980.0560.050.0135
ounces13.59212.012.02.352

Quantile Statistics#

Quantile statistics are used to determine data spread, skewedness and outliers. To determine quantile statistics, split the data into equal sized groups using cut points (quantiles). For example: beers['ibu'].quantile([0.25,0.5,0.75])

Column0.250.50.75
ibu21.035.064.0
abv0.0500.0560.067
ounces12.012.016.0

Frequency Distribution Plots#

For each numerical column, use seaborn and pyplot to create distribution plots of numerical columns while dropping missing values.

ibu Distribution#

abv Distribution#

ounces Distribution#

Correlations of Numerical Values#

Correlations between numerical values are determined using Pearson’s correlation coefficient and the corr() function.

abvibuounces
abv1.00.6700.172
ibu0.6701.00.054
ounces0.1720.0541.0

This indicates that ibu and abv are moderately correlated with 0.670, while ibu and ounces are not very correlated with 0.054.

Descriptive Statistics of Categorical Values#

namestyle
count24102405
unique230599
topNonstop Hef HopAmerican IPA
freq12424
← Back to Projects