Founded in 2014 by Urška Sršen and Sando Mur, Bellabeat is a technology-driven wellness company that pioneered wearables designed specifically for women. The company has rapidly expanded, establishing a global presence and launching a range of health-focused smart products through online retailers and its e-commerce platform. While Bellabeat utilizes traditional advertising, it primarily invests in digital marketing, including maintaining a strong social media presence and running video and display ads. Despite its success, Bellabeat has the potential to expand its reach and influence in the competitive smart device market.
Urška Sršen, co-founder and Chief Product Officer believes that a detailed analysis of usage data from competitor smart devices could unlock new growth opportunities for the company. This case study delves into such data to derive actionable insights to guide Bellabeat’s future marketing strategies and strengthen its market position.
Analyze user engagement and behavior patterns from competitor smart devices to identify trends and insights that inform and optimize Bellabeat’s marketing strategies for the Bellabeat app.
The Fitbit Fitness Tracker Data is a public dataset on Kaggle made available through Mobius. It comprises personal tracker data from ~30 Fitbit users, including output for physical activity, heart rate, and sleep monitoring. The data was collected via a survey on Amazon Mechanical Turk over a two-month period from March 12 to May 12, 2016. The variation in the data reflects the use of different Fitbit models and individual tracking preferences, offering insights into user behaviour across several tables.
The metadata confirms that the dataset is licensed under CC0: Public Domain, effectively placing the work in the public domain and removing all copyright restrictions worldwide. As it is open-sourced, anyone can copy, modify, distribute, and use the dataset, even for commercial purposes, without permission.
Although the dataset is sourced from a verified open-source platform, Kaggle, it presents certain limitations. It includes data from only ~30 FitBit users and lacks demographic information, which could result in sampling bias, rendering the sample potentially unrepresentative of the broader population. Additionally, the data was collected over a brief two-month period. This limited timeframe may undermine the dataset’s comprehensiveness, potentially rendering it insufficient for thoroughly addressing research questions or identifying solutions without the support of supplementary datasets.
The dataset comprises 29 CSV files, each containing qualitative data on users’ physical activity, heart rate, and sleep across various Fitbit models. The data is presented in long format, where each row represents a single observation. This format is evidenced by each column representing one variable (e.g., “Id”, “Activity Date”), and identifiers like Id repeating across rows, yet each combination of identifiers and variables is unique.
After reviewing our data, we found that Month 3 includes 11 datasets while Month 4 includes 18, totaling 29 data sets. For our analysis, we will focus on the datasets common to both months. The consistent collection of these metrics across both periods allows for a greater number of observations per user, enhancing the reliability of our analysis and providing a more accurate understanding of user engagement and behavior patterns with their Fitbit devices over an extended period.
We then streamlined our dataset by eliminating duplicates that contained similar types of data across different time intervals. For instance, we opted for hourly data over minute data to simplify analysis and better highlight trends over larger intervals. Additionally, we removed the weight data due to insufficient submissions—only 11 users in Month 3 and 8 in Month 4, which do not provide a robust basis for meaningful analysis. We also excluded the METs (Metabolic Equivalent of Task) dataset, considering the intensity dataset already sufficiently represents users’ physical activity levels.
This refinement process identified six key types of datasets, each available for both Month 3 and Month 4, totaling 12 datasets for our analysis:
The next step of our analysis involves importing, merging and joining the datasets from both Month 3 and Month 4. For each category—Daily Activity, Hourly Calories, Hourly Intensities, Hourly Steps, Minute Sleep, and Seconds Heart Rate—we will merge the data from the two months to create a single comprehensive dataset.
## # Downloading packages -------------------------------------------------------
## - Downloading tidyverse from CRAN ... OK [688.1 Kb in 0.58s]
## - Downloading broom from CRAN ... OK [629.3 Kb in 0.5s]
## - Downloading backports from CRAN ... OK [30 Kb in 0.43s]
## - Downloading purrr from CRAN ... OK [215.7 Kb in 0.39s]
## - Downloading stringr from CRAN ... OK [172.5 Kb in 0.48s]
## - Downloading stringi from CRAN ... OK [11.4 Mb in 0.62s]
## - Downloading tidyr from CRAN ... OK [790.1 Kb in 0.85s]
## - Downloading conflicted from CRAN ... OK [16.7 Kb in 0.35s]
## - Downloading dbplyr from CRAN ... OK [752.6 Kb in 0.45s]
## - Downloading blob from CRAN ... OK [10.4 Kb in 0.53s]
## - Downloading DBI from CRAN ... OK [1.1 Mb in 0.82s]
## - Downloading dtplyr from CRAN ... OK [147.4 Kb in 0.63s]
## - Downloading data.table from CRAN ... OK [5.2 Mb in 0.62s]
## - Downloading forcats from CRAN ... OK [287.3 Kb in 0.61s]
## - Downloading googledrive from CRAN ... OK [1.5 Mb in 0.51s]
## - Downloading gargle from CRAN ... OK [612.9 Kb in 0.59s]
## - Downloading httr from CRAN ... OK [115.7 Kb in 0.38s]
## - Downloading curl from CRAN ... OK [910.9 Kb in 0.38s]
## - Downloading openssl from CRAN ... OK [1.2 Mb in 0.42s]
## - Downloading askpass from CRAN ... OK [5.9 Kb in 0.44s]
## - Downloading sys from CRAN ... OK [19.5 Kb in 0.41s]
## - Downloading uuid from CRAN ... OK [78.8 Kb in 0.37s]
## - Downloading googlesheets4 from CRAN ... OK [227.1 Kb in 0.38s]
## - Downloading cellranger from CRAN ... OK [62.4 Kb in 0.35s]
## - Downloading rematch from CRAN ... OK [4.7 Kb in 0.32s]
## - Downloading ids from CRAN ... OK [89.1 Kb in 0.35s]
## - Downloading rematch2 from CRAN ... OK [13.1 Kb in 0.35s]
## - Downloading haven from CRAN ... OK [306 Kb in 0.47s]
## - Downloading hms from CRAN ... OK [42.4 Kb in 0.37s]
## - Downloading readr from CRAN ... OK [291.1 Kb in 0.38s]
## - Downloading clipr from CRAN ... OK [21.4 Kb in 0.34s]
## - Downloading crayon from CRAN ... OK [39.4 Kb in 0.36s]
## - Downloading vroom from CRAN ... OK [732.8 Kb in 0.38s]
## - Downloading bit64 from CRAN ... OK [132.2 Kb in 0.35s]
## - Downloading bit from CRAN ... OK [839 Kb in 0.55s]
## - Downloading tzdb from CRAN ... OK [572.6 Kb in 0.41s]
## - Downloading progress from CRAN ... OK [29.8 Kb in 0.33s]
## - Downloading prettyunits from CRAN ... OK [95.2 Kb in 0.37s]
## - Downloading modelr from CRAN ... OK [118.6 Kb in 0.37s]
## - Downloading ragg from CRAN ... OK [418.9 Kb in 0.38s]
## - Downloading systemfonts from CRAN ... OK [83.6 Kb in 0.34s]
## - Downloading textshaping from CRAN ... OK [37.1 Kb in 0.38s]
## - Downloading readxl from CRAN ... OK [2 Mb in 0.53s]
## - Downloading reprex from CRAN ... OK [1 Mb in 0.42s]
## - Downloading callr from CRAN ... OK [101.9 Kb in 0.36s]
## - Downloading processx from CRAN ... OK [160.1 Kb in 0.39s]
## - Downloading ps from CRAN ... OK [160.5 Kb in 0.46s]
## - Downloading rstudioapi from CRAN ... OK [115.7 Kb in 0.35s]
## - Downloading rvest from CRAN ... OK [113.2 Kb in 0.35s]
## - Downloading selectr from CRAN ... OK [40.4 Kb in 0.36s]
## - Downloading xml2 from CRAN ... OK [287.8 Kb in 0.38s]
## Successfully downloaded 51 packages in 29 seconds.
##
## The following package(s) will be installed:
## - askpass [1.2.1]
## - backports [1.5.0]
## - bit [4.5.0.1]
## - bit64 [4.5.2]
## - blob [1.2.4]
## - broom [1.0.7]
## - callr [3.7.6]
## - cellranger [1.1.0]
## - clipr [0.8.0]
## - conflicted [1.2.0]
## - crayon [1.5.3]
## - curl [6.0.1]
## - data.table [1.16.2]
## - DBI [1.2.3]
## - dbplyr [2.5.0]
## - dtplyr [1.3.1]
## - forcats [1.0.0]
## - gargle [1.5.2]
## - googledrive [2.1.1]
## - googlesheets4 [1.1.1]
## - haven [2.5.4]
## - hms [1.1.3]
## - httr [1.4.7]
## - ids [1.0.1]
## - modelr [0.1.11]
## - openssl [2.2.2]
## - prettyunits [1.2.0]
## - processx [3.8.4]
## - progress [1.2.3]
## - ps [1.8.1]
## - purrr [1.0.2]
## - ragg [1.3.3]
## - readr [2.1.5]
## - readxl [1.4.3]
## - rematch [2.0.0]
## - rematch2 [2.1.2]
## - reprex [2.1.1]
## - rstudioapi [0.17.1]
## - rvest [1.0.4]
## - selectr [0.4-2]
## - stringi [1.8.4]
## - stringr [1.5.1]
## - sys [3.4.3]
## - systemfonts [1.1.0]
## - textshaping [0.4.0]
## - tidyr [1.3.1]
## - tidyverse [2.0.0]
## - tzdb [0.4.0]
## - uuid [1.2-1]
## - vroom [1.6.5]
## - xml2 [1.3.6]
## These packages will be installed into "~/work/bellabeat-case-study/bellabeat-case-study/renv/library/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu".
##
## # Installing packages --------------------------------------------------------
## - Installing backports ... OK [built from source and cached in 1.7s]
## - Installing purrr ... OK [built from source and cached in 3.9s]
## - Installing stringi ... OK [built from source and cached in 59s]
## - Installing stringr ... OK [built from source and cached in 2.2s]
## - Installing tidyr ... OK [built from source and cached in 7.4s]
## - Installing broom ... OK [built from source and cached in 5.8s]
## - Installing conflicted ... OK [built from source and cached in 1.6s]
## - Installing blob ... OK [built from source and cached in 1.6s]
## - Installing DBI ... OK [built from source and cached in 3.3s]
## - Installing dbplyr ... OK [built from source and cached in 10s]
## - Installing data.table ... OK [built from source and cached in 18s]
## - Installing dtplyr ... OK [built from source and cached in 3.6s]
## - Installing forcats ... OK [built from source and cached in 2.0s]
## - Installing curl ... OK [built from source and cached in 3.5s]
## - Installing sys ... OK [built from source and cached in 1.3s]
## - Installing askpass ... OK [built from source and cached in 1.2s]
## - Installing openssl ... OK [built from source and cached in 5.0s]
## - Installing httr ... OK [built from source and cached in 3.2s]
## - Installing gargle ... OK [built from source and cached in 3.7s]
## - Installing uuid ... OK [built from source and cached in 4.1s]
## - Installing googledrive ... OK [built from source and cached in 3.5s]
## - Installing rematch ... OK [built from source and cached in 1.0s]
## - Installing cellranger ... OK [built from source and cached in 1.3s]
## - Installing ids ... OK [built from source and cached in 1.2s]
## - Installing rematch2 ... OK [built from source and cached in 2.0s]
## - Installing googlesheets4 ... OK [built from source and cached in 3.7s]
## - Installing hms ... OK [built from source and cached in 1.7s]
## - Installing clipr ... OK [built from source and cached in 1.1s]
## - Installing crayon ... OK [built from source and cached in 1.7s]
## - Installing bit ... OK [built from source and cached in 5.0s]
## - Installing bit64 ... OK [built from source and cached in 5.4s]
## - Installing tzdb ... OK [built from source and cached in 10s]
## - Installing prettyunits ... OK [built from source and cached in 1.3s]
## - Installing progress ... OK [built from source and cached in 1.9s]
## - Installing vroom ... OK [built from source and cached in 56s]
## - Installing readr ... OK [built from source and cached in 29s]
## - Installing haven ... OK [built from source and cached in 14s]
## - Installing modelr ... OK [built from source and cached in 2.5s]
## - Installing systemfonts ... OK [built from source and cached in 19s]
## - Installing textshaping ... OK [built from source and cached in 9.3s]
## - Installing ragg ... OK [built from source and cached in 1.8m]
## - Installing readxl ... OK [built from source and cached in 13s]
## - Installing ps ... OK [built from source and cached in 4.4s]
## - Installing processx ... OK [built from source and cached in 4.0s]
## - Installing callr ... OK [built from source and cached in 3.4s]
## - Installing rstudioapi ... OK [built from source and cached in 1.6s]
## - Installing reprex ... OK [built from source and cached in 2.0s]
## - Installing selectr ... OK [built from source and cached in 5.9s]
## - Installing xml2 ... OK [built from source and cached in 7.0s]
## - Installing rvest ... OK [built from source and cached in 2.7s]
## - Installing tidyverse ... OK [built from source and cached in 2.7s]
## Successfully installed 51 packages in 7.8 minutes.
##
## Attaching package: 'renv'
## The following objects are masked from 'package:stats':
##
## embed, update
## The following objects are masked from 'package:utils':
##
## history, upgrade
## The following objects are masked from 'package:base':
##
## autoload, load, remove, use
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Month 3
minute_METs_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/minuteMETsNarrow_merged.csv")
minute_sleep_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/minuteSleep_merged.csv")
minute_steps_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/minuteStepsNarrow_merged.csv")
weight_info_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/weightLogInfo_merged.csv")
minute_intensities_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/minuteIntensitiesNarrow_merged.csv")
minute_calories_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/minuteCaloriesNarrow_merged.csv")
hourly_steps_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/hourlySteps_merged.csv")
hourly_intensities_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/hourlyIntensities_merged.csv")
hourly_calories_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/hourlyCalories_merged.csv")
seconds_heart_rate_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/heartrate_seconds_merged.csv")
daily_activity_month_3 = read.csv("fitbit_dataset/fitbit_export_2016:03:12-2016:04:11/dailyActivity_merged.csv")
Month 4
daily_activity_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/dailyActivity_merged.csv")
daily_calories_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/dailyCalories_merged.csv")
daily_intensities_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/dailyIntensities_merged.csv")
daily_steps_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/dailySteps_merged.csv")
seconds_heart_rate_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/heartrate_seconds_merged.csv")
hourly_calories_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/hourlyCalories_merged.csv")
hourly_intensities_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/hourlyIntensities_merged.csv")
hourly_steps_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/hourlySteps_merged.csv")
minute_calories_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/minuteCaloriesNarrow_merged.csv")
minute_intensities_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/minuteIntensitiesNarrow_merged.csv")
minute_METs_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/minuteMETsNarrow_merged.csv")
minute_sleep_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/minuteSleep_merged.csv")
minute_steps_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/minuteStepsNarrow_merged.csv")
daily_sleep_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/sleepDay_merged.csv")
weight_info_month_4 = read.csv("fitbit_dataset/fitbit_export_2016:04:12-2016:05:12/weightLogInfo_merged.csv")
Here we will take the following actions:
distinct() function to identify the number of
distinct users in Month 3 and 4.intersect() function to identify the number of
distinct users in both datasets.union_all() function to merge data from Month 3
and 4, ensuring compatibility by using the names() function
to verify that the columns in both datasets are consistent and can be
merged.inner_join() function to join all user data
from Months 3 and 4 (prepared in step 3) with only the distinct users in
both datasets (prepared in step 2). This ensures ensures that we include
only those users who are present in both months, allowing us to analyze
a full two months’ worth of data per user. For further rationale, see Section 3.1.Month 3 has 35 distinct users and Month 4 has 33 distinct users.
distinct_users_daily_activity_month_3 <- distinct(daily_activity_month_3,Id)
distinct_users_daily_activity_month_4 <- distinct(daily_activity_month_4,Id)
knitr::kable(
list(distinct_users_daily_activity_month_3, distinct_users_daily_activity_month_4),
)
|
|
head(distinct_users_daily_activity_month_3)
head(distinct_users_daily_activity_month_4)
Month 3 and 4 have 33 unique common users.
unique_common_users_daily_activity = intersect(
distinct_users_daily_activity_month_3,
distinct_users_daily_activity_month_4,
)
head(unique_common_users_daily_activity)
Merging Months 3 and 4.
names(daily_activity_month_3)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
names(daily_activity_month_4)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
names(daily_activity_month_3) == names(daily_activity_month_4)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
daily_activity_with_all_users = union_all(
daily_activity_month_3,
daily_activity_month_4
)
head(daily_activity_with_all_users)
Joining Months 3 and 4.
daily_activity = inner_join(
daily_activity_with_all_users,
unique_common_users_daily_activity,
by = join_by(Id)
)
head(daily_activity)
Month 3 has 34 distinct users and Month 4 has 33 distinct users.
distinct_users_hourly_calories_month_3 <- distinct(hourly_calories_month_3,Id)
distinct_users_hourly_calories_month_4 <- distinct(hourly_calories_month_4,Id)
knitr::kable(
list(distinct_users_hourly_calories_month_3,distinct_users_hourly_calories_month_4),
)
|
|
head(distinct_users_hourly_calories_month_3)
head(distinct_users_hourly_calories_month_4)
Month 3 and 4 have 32 unique common users.
unique_common_users_hourly_calories = intersect(
distinct_users_hourly_calories_month_3,
distinct_users_hourly_calories_month_4,
)
head(unique_common_users_hourly_calories)
Merging Months 3 and 4.
names(hourly_calories_month_3)
## [1] "Id" "ActivityHour" "Calories"
names(hourly_calories_month_4)
## [1] "Id" "ActivityHour" "Calories"
names(hourly_calories_month_3) == names(hourly_calories_month_4)
## [1] TRUE TRUE TRUE
hourly_calories_with_all_users = union_all(
hourly_calories_month_3,
hourly_calories_month_4
)
head(hourly_calories_with_all_users)
Joining Months 3 and 4.
hourly_calories = inner_join(
hourly_calories_with_all_users,
unique_common_users_hourly_calories,
by = join_by(Id)
)
head(hourly_calories)
Month 3 has 34 distinct users and Month 4 has 33 distinct users.
distinct_users_hourly_intensities_month_3 <- distinct(hourly_intensities_month_3,Id)
distinct_users_hourly_intensities_month_4 <- distinct(hourly_intensities_month_4,Id)
knitr::kable(
list(distinct_users_hourly_intensities_month_3,distinct_users_hourly_intensities_month_4),
)
|
|
head(distinct_users_hourly_intensities_month_3)
head(distinct_users_hourly_intensities_month_4)
Month 3 and 4 have 32 unique common users.
unique_common_users_hourly_intensities = intersect(
distinct_users_hourly_intensities_month_3,
distinct_users_hourly_intensities_month_4,
)
head(unique_common_users_hourly_intensities)
Merging Months 3 and 4.
names(hourly_intensities_month_3)
## [1] "Id" "ActivityHour" "TotalIntensity" "AverageIntensity"
names(hourly_intensities_month_4)
## [1] "Id" "ActivityHour" "TotalIntensity" "AverageIntensity"
names(hourly_intensities_month_3) == names(hourly_intensities_month_4)
## [1] TRUE TRUE TRUE TRUE
hourly_intensities_with_all_users = union_all(
hourly_intensities_month_3,
hourly_intensities_month_4
)
head(hourly_intensities_with_all_users)
Joining Months 3 and 4.
hourly_intensities = inner_join(
hourly_intensities_with_all_users,
unique_common_users_hourly_intensities,
by = join_by(Id)
)
head(hourly_intensities)
Month 3 has 34 distinct users and Month 4 has 33 distinct users.
distinct_users_hourly_steps_month_3 <- distinct(hourly_steps_month_3,Id)
distinct_users_hourly_steps_month_4 <- distinct(hourly_steps_month_4,Id)
knitr::kable(
list(distinct_users_hourly_steps_month_3,distinct_users_hourly_steps_month_4),
)
|
|
head(distinct_users_hourly_steps_month_3)
head(distinct_users_hourly_steps_month_4)
Month 3 and 4 have 32 unique common users.
unique_common_users_hourly_steps = intersect(
distinct_users_hourly_steps_month_3,
distinct_users_hourly_steps_month_4,
)
head(unique_common_users_hourly_steps)
Merging Months 3 and 4.
names(hourly_steps_month_3)
## [1] "Id" "ActivityHour" "StepTotal"
names(hourly_steps_month_4)
## [1] "Id" "ActivityHour" "StepTotal"
names(hourly_steps_month_3) == names(hourly_steps_month_4)
## [1] TRUE TRUE TRUE
hourly_steps_with_all_users = union_all(
hourly_steps_month_3,
hourly_steps_month_4
)
head(hourly_steps_with_all_users)
Joining Months 3 and 4.
hourly_steps = inner_join(
hourly_steps_with_all_users,
unique_common_users_hourly_steps,
by = join_by(Id)
)
head(hourly_steps)
Month 3 has 23 distinct users and Month 4 has 24 distinct users. Although both datasets contain fewer than 30 distinct users, we will proceed with using them due to the high number of observations in each (~180,000). Also, these are the only sleep datasets that maintain consistent time intervals (minutes) across Months 3 and 4, aligning with the analysis criteria outlined in Section 3.1.
distinct_users_minute_sleep_month_3 <- distinct(minute_sleep_month_3,Id)
distinct_users_minute_sleep_month_4 <- distinct(minute_sleep_month_4,Id)
knitr::kable(
list(distinct_users_minute_sleep_month_3,distinct_users_minute_sleep_month_4),
)
|
|
head(distinct_users_minute_sleep_month_3)
head(distinct_users_minute_sleep_month_4)
Month 3 and 4 have 22 unique common users.
unique_common_users_minute_sleep = intersect(
distinct_users_minute_sleep_month_3,
distinct_users_minute_sleep_month_4,
)
head(unique_common_users_minute_sleep)
Merging Months 3 and 4.
names(minute_sleep_month_3)
## [1] "Id" "date" "value" "logId"
names(minute_sleep_month_4)
## [1] "Id" "date" "value" "logId"
names(minute_sleep_month_3) == names(minute_sleep_month_4)
## [1] TRUE TRUE TRUE TRUE
minute_sleep_with_all_users = union_all(
minute_sleep_month_3,
minute_sleep_month_4
)
head(minute_sleep_with_all_users)
Joining Months 3 and 4.
minute_sleep = inner_join(
minute_sleep_with_all_users,
unique_common_users_minute_sleep,
by = join_by(Id)
)
head(minute_sleep)
Month 3 has 14 distinct users and Month 4 has 14 distinct users. Although both datasets contain fewer than 30 distinct users, we will proceed using the same logic applied to the Minute_Sleep datasets.
distinct_users_seconds_heart_rate_month_3 <- distinct(seconds_heart_rate_month_3,Id)
distinct_users_seconds_heart_rate_month_4 <- distinct(seconds_heart_rate_month_4,Id)
knitr::kable(
list(distinct_users_seconds_heart_rate_month_3,distinct_users_seconds_heart_rate_month_4),
)
|
|
head(distinct_users_seconds_heart_rate_month_3)
head(distinct_users_seconds_heart_rate_month_4)
Month 3 and 4 have 13 unique common users.
unique_common_users_seconds_heart_rate = intersect(
distinct_users_seconds_heart_rate_month_3,
distinct_users_seconds_heart_rate_month_4,
)
head(unique_common_users_seconds_heart_rate)
Merging Months 3 and 4.
names(seconds_heart_rate_month_3)
## [1] "Id" "Time" "Value"
names(seconds_heart_rate_month_4)
## [1] "Id" "Time" "Value"
names(seconds_heart_rate_month_3) == names(seconds_heart_rate_month_4)
## [1] TRUE TRUE TRUE
seconds_heart_rate_with_all_users = union_all(
seconds_heart_rate_month_3,
seconds_heart_rate_month_4
)
head(seconds_heart_rate_with_all_users)
Joining Months 3 and 4.
seconds_heart_rate = inner_join(
seconds_heart_rate_with_all_users,
unique_common_users_seconds_heart_rate,
by = join_by(Id)
)
head(seconds_heart_rate)
In this stage, we will clean our data to ensure it is accurate, consistent, and reliable. We will start by identifying and removing duplicate entries and investigating and handling anomalies. Next, we will handle missing values by identifying and removing them as needed. We will standardize column names to achieve consistency and ensure all date-time fields are correctly parsed. Finally, we will save our cleaned datasets, providing a solid foundation for generating accurate insights in the subsequent analysis phase.
First, we will use the count() function to see how many
rows each dataset has. Then, we will use distinct() and
count() together to check for and remove any duplicate
rows.
0 duplicate rows
daily_activity |> count()
daily_activity |> distinct() |> count()
(daily_activity |> count()) - (daily_activity|> distinct() |> count())
Removed 175 duplicate rows
hourly_calories |> count()
hourly_calories |> distinct() |> count()
(hourly_calories |> count()) - (hourly_calories |> distinct() |> count())
hourly_calories_distinct = hourly_calories |> distinct()
Removed 175 duplicate rows
hourly_intensities |> count()
hourly_intensities |> distinct() |> count()
(hourly_intensities |> count()) - (hourly_intensities|> distinct() |> count())
hourly_intensities_distinct = hourly_intensities |> distinct()
Removed 175 duplicate rows
hourly_steps |> count()
hourly_steps |> distinct() |> count()
(hourly_steps |> count()) - (hourly_steps|> distinct() |> count())
hourly_steps_distinct = hourly_steps |> distinct()
Removed 4,300 duplicate rows
minute_sleep |> count()
minute_sleep |> distinct() |> count()
(minute_sleep |> count()) - (minute_sleep|> distinct() |> count())
minute_sleep_distinct = minute_sleep |> distinct()
Removed 23,424 duplicate rows
seconds_heart_rate |> count()
seconds_heart_rate |> distinct() |> count()
(seconds_heart_rate |> count()) - (seconds_heart_rate|> distinct() |> count())
seconds_heart_rate_distinct = seconds_heart_rate |> distinct()
Given the nature of this dataset, we are taking an additional step to verify the removal of duplicate data by ensuring that the first two columns in each dataset—typically the Id and a date/time variation—are not duplicated. If duplicated, it could lead to multiple records of activity/sleep data for the same user on the same day/hour/minute/second, falsely inflating perceived activity/sleep levels and other metrics derived from these counts. This would skew our analysis, potentially resulting in misleading conclusions about user behaviour and trends.
To prevent this, we will take the following actions:
group_by()
function to group the data by Id and the date/time variant.count() and
filter() functions to ensure that each Id and date/time
combination has only one entry, specifically a count of one.48 rows where the Id and ActivityDate is duplicated
daily_activity |> group_by(Id, ActivityDate) |> count() |> filter(n > 1) |> view()
We will now address the daily_activity dataset, which includes 48 rows with duplicated Id and ActivityDate entries.
First, we will address the SedentaryMinutes column by taking the following actions:
filter() function to identify and remove rows
where SedentaryMinutes > 1440, the total number of minutes in a day.
We’re doing this because if a user’s device reports being sedentary the
entire day, it is likely because they did not wear it. Removing these
entries is important as we are analyzing users’ activity levels, and
data from unworn devices is invalid for our analysis.132 users whose devices reported that they were sedentary for the entire day
daily_activity |> filter(SedentaryMinutes >= 1440) |> count()
Removing these users
daily_activity_removing_complete_sedentary_mins <- daily_activity |> filter(SedentaryMinutes < 1440)
Next, we will address the TotalSteps column by taking the following actions:
filter() function to identify and remove rows
where TotalSteps = 0. Following the same logic as above, we assume this
indicates that the user did not wear their device that day, rendering
the data invalid.126 users whose devices reported 0 steps for the entire day
daily_activity |> filter(TotalSteps == 0) |> count ()
Removing these users
daily_activity_removing_0_steps <- daily_activity_removing_complete_sedentary_mins |> filter(TotalSteps > 0)
We’re now left with 19 users who have duplicate entries for the same day, totaling 38 rows with duplicated Id and ActivityDate entries. Upon closer inspection, we observed an inconsistency: for each duplicate record, whenever a user’s device recorded a higher number of TotalSteps paradoxically showed a higher number of SedentaryMinutes. This presents a clear data anomaly, as logically, increased physical activity (indicated by a higher step count) should correspond with fewer sedentary minutes, not more. This contradiction suggests a possible error in data recording or processing, which would require further investigation. Typically, at this stage we would consult with the business or data provider to clarify the discrepancy.
However, since those options are not available, we will take the following informed actions to address this issue in the data:
summarise() function to aggregate the duplicate
data, selecting the maximum value for all columns except for the
SedentaryMinutes column, for which we will take the minimum value. This
approach ensures that we evaluate users at their most active and least
sedentary states.## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
0 duplicate rows
hourly_calories_distinct |> group_by(Id, ActivityHour) |> count() |> filter(n > 1) |> view()
0 duplicate rows
hourly_intensities_distinct |> group_by(Id, ActivityHour) |> count() |> filter(n > 1) |> view()
0 duplicate rows
hourly_steps_distinct |> group_by(Id, ActivityHour) |> count() |> filter(n > 1) |> view()
0 duplicate rows
minute_sleep_distinct |> group_by(Id, date) |> count() |> filter(n > 1) |> view()
0 duplicate rows
seconds_heart_rate_distinct |> group_by(Id, Time) |> count() |> filter(n > 1) |> view()
Next we will check for NULL values.
0 null values
any(is.na(daily_activity_removing_0_steps))
## [1] FALSE
0 null values
any(is.na(hourly_calories_distinct))
## [1] FALSE
0 null values
any(is.na(hourly_intensities_distinct))
## [1] FALSE
0 null values
any(is.na(hourly_steps_distinct))
## [1] FALSE
0 null values
any(is.na(minute_sleep_distinct))
## [1] FALSE
0 null values
any(is.na(seconds_heart_rate_distinct))
## [1] FALSE
Next we will correct the column names.
## `summarise()` has grouped output by 'Id'. You can override using the `.groups`
## argument.
hourly_calories_renaming_columns <- hourly_calories_distinct |>
rename("Activity_Hour" = "ActivityHour") |>
view()
hourly_intensities_renaming_columns <- hourly_intensities_distinct |>
rename("Activity_Hour" = "ActivityHour") |>
rename("Total_Intensity" = "TotalIntensity") |>
rename("Average_Intensity" = "AverageIntensity") |>
view()
hourly_steps_renaming_columns <- hourly_steps_distinct |>
rename("Activity_Hour" = "ActivityHour") |>
rename("Step_Total" = "StepTotal") |>
view()
minute_sleep_renaming_columns <- minute_sleep_distinct |>
rename("Date" = "date") |>
rename("Value" = "value") |>
rename("Log_Id" = "logId") |>
view()
Columns names do not need to be corrected.
Lastly, we will convert date-time data, currently stored as a character datatype, to a date-time datatype. We do this because character data does not support date-time operations, which can lead to errors during our analysis if not corrected. We will do this in two steps:
mutate() and mdy_hms functions to
create a new column with the parsed date-time data to verify the data is
correctly converted.mutate() followed by select(-) to
replace the old column with the newly parsed date-time column and remove
the old column from the table.daily_activity_renaming_columns |> mutate(Parsed_Activity_Date = mdy(Activity_Date)) |> view()
d_a_datetime <- daily_activity_renaming_columns |> mutate(Parsed_Activity_Date = mdy(Activity_Date)) |> mutate(Activity_Date = Parsed_Activity_Date) |> select(-Parsed_Activity_Date) |> view()
hourly_calories_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> view()
h_c_datetime <- hourly_calories_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> mutate(Activity_Hour = Parsed_Activity_Hour) |> select(-Parsed_Activity_Hour) |> view()
hourly_intensities_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> view()
h_i_datetime <- hourly_intensities_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> mutate(Activity_Hour = Parsed_Activity_Hour) |> select(-Parsed_Activity_Hour) |> view()
hourly_steps_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> view()
h_s_datetime <- hourly_steps_renaming_columns |> mutate(Parsed_Activity_Hour = mdy_hms(Activity_Hour)) |> mutate(Activity_Hour = Parsed_Activity_Hour) |> select(-Parsed_Activity_Hour) |> view()
minute_sleep_renaming_columns |> mutate(Parsed_Date = mdy_hms(Date)) |> view()
m_s_datetime <- minute_sleep_renaming_columns |> mutate(Parsed_Date = mdy_hms(Date)) |> mutate(Date = Parsed_Date) |> select(-Parsed_Date)|> view()
seconds_heart_rate_distinct |> mutate(Parsed_Time = mdy_hms(Time)) |> view()
s_hr_datetime <- seconds_heart_rate_distinct |> mutate(Parsed_Time = mdy_hms(Time)) |> mutate(Time = Parsed_Time) |> select(-Parsed_Time)|> view()
We will now save our cleaned datasets. Yippe!
daily_activity_cleaned <- d_a_datetime
hourly_calories_cleaned <- h_c_datetime
hourly_intensities_cleaned <- h_i_datetime
hourly_steps_cleaned <- h_s_datetime
minute_sleep_cleaned <- m_s_datetime
seconds_heart_rate_cleaned <- s_hr_datetime
As noted in Section 2.3, our data had several limitations. One major challenge was the sample size. Although the dataset is described as containing data from 30 Fitbit users, our analysis in Section 3.2 revealed that the number of users that consistently shared their data across the whole two-month study period varied from 13 to 33, depending on the dataset. This small sample size is generally not statistically significant. Additionally, the absence of demographic information introduces potential sampling bias, possibly making the sample unrepresentative of the broader population. We were further constrained by the two-month time frame, which limited the dataset’s comprehensiveness.
An unexpected anomaly was identified in the daily_activity dataset, where duplicate Id and Activity_Date entries occurred. In these duplicates, a higher number of total steps corresponded with a higher number of sedentary minutes, which is illogical and likely due to a recording or processing error. Addressing these anomalies was crucial for ensuring our data’s accuracy, consistency, and reliability.
To overcome these limitations in the future, analyzing additional datasets from competitor smart devices, such as the Apple Watch or Oura Ring, could be beneficial. These products are comparable to Bellabeat’s wearable devices, and their user data would provide a more comprehensive view of smart device usage and user behavior, enhancing insights for a future study of this nature.
Product Positioning: Since most participants in the study were lightly active, Bellabeat can launch a campaign showcasing how its app and smart devices help users transition from light to more moderate or high levels of activity. This campaign will feature tailored ads demonstrating the incremental benefits of increased activity, appealing to all customer segments. For moderately and highly active users, the campaign will highlight advanced metrics and activity tracking features of the app that enhance training effectiveness, underscoring Bellabeat’s ability to help customers of all fitness levels achieve their health and fitness goals and “activate their best self.”
Diverse Demographic Targeting: The campaign will also focus on reaching a broader and more diverse audience to expand the customer base, which currently consists predominantly of women. This approach will enrich the collection of varied and diverse customer data, improving insights and compensating for the lack of demographic information identified in the competitor data used in our study.
Weekend Activity Campaigns: Observing that activity peaks on Saturdays and dips on Sundays, Bellabeat can launch weekend promotional campaigns that motivate users to meet step count milestones. These interactive challenges will involve progressing through different levels based on the amount of steps taken. Users will need to maintain consistent weekend activity over a period (such as a month) to advance. Each level passed earns points redeemable for Bellabeat products, effectively gamifying weekend activity to enhance overall engagement and help users reach their fitness goals quicker.
Sleep Quality Improvement Features: With most users failing to meet recommended sleep durations, Bellabeat can introduce or enhance features like sleep analysis and personalized sleep coaching, possibly integrating guided meditations or relaxing sleep sounds to improve sleep quality and overall health.
Heart Rate Monitoring Features: Bellabeat could also enhance or introduce new heart rate monitoring features, focusing on their accuracy and the benefits they offer, especially given their proven correlation with calorie expenditure.
Educational Content via Wellness Tips: Bellabeat can produce short-form content in the form of “Wellness Tips,” available through the app. The content can focus on topics like Heart Rate and Activity, educating users on the benefits of incorporating high-intensity physical activity into their fitness regimens and its impact on calorie burn and sleep habits, leveraging data that shows the correlation between activity levels and calorie burn.
Optimizing Scheduling Times: Content upload times should be optimized to coincide with peak activity hours between 12-2 PM and 5-7 PM. This strategy ensures heightened visibility and engagement, allowing users to derive maximum value from the content.
Incentives for Consistent Usage: Bellabeat can offer incentives to encourage continuous use, particularly targeting the 82.9% of users who show medium to high engagement. By implementing loyalty programs and issuing rewards through the app for consistent usage, Bellabeat can promote long-term engagement.
Thank you for viewing my Google Data Analytics Certificate Case Study!