Matt’s Baby Stats - First off: how do I even open Apple Health data?

Opening Apple Health data

I have exported my Apple Health data using the Health app’s export functionality. We now have a folder of .xml files to deal with. Eventually I’ll switch to using my partner’s tracking data–it will be a far better proxy of the baby’s sleep habits–but she is currently at a baby class so I’ll need to grab it later.

Let’s take a look at export.xml which I think should contain the sleep data.

Reading the first few characters of the file

library('readr')
health_filepath <- '../../data/apple_health_export/export.xml'
health_file <- substr(read_file(health_filepath), 1, 1000)
print(health_file)

[1] “<!DOCTYPE HealthData [<!ELEMENT HealthData (ExportDate,Me,(Record|Correlation|Workout|ActivitySummary|ClinicalRecord|Audiogram|VisionPrescription))><!ATTLIST HealthDatalocale CDATA #REQUIRED><!ELEMENT ExportDate EMPTY><!ATTLIST ExportDatevalue CDATA #REQUIRED><!ELEMENT Me EMPTY><!ATTLIST MeHKCharacteristicTypeIdentifierDateOfBirth CDATA #REQUIREDHKCharacteristicTypeIdentifierBiologicalSex CDATA #REQUIREDHKCharacteristicTypeIdentifierBloodType CDATA #REQUIREDHKCharacteristicTypeIdentifierFitzpatrickSkinType CDATA #REQUIREDHKCharacteristicTypeIdentifierCardioFitnessMedicationsUse CDATA #REQUIRED><!ELEMENT Record ((MetadataEntry|HeartRateVariabilityMetadataList))><!ATTLIST Recordtype CDATA #REQUIREDunit CDATA #IMPLIEDvalue CDATA #IMPLIEDsourceName CDATA #REQUIREDsourceVer”

So it looks like Apple use HKCharacteristicTypeIdentifier as a prefix to some data. Perhaps this is just a field name?

Let’s parse the xml file properly and see what we have

# install.packages("XML") # I need this one to work with the xml files
library("XML")

# here are some ones that I think I might need
library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# parse the xml file
health <- xmlParse(health_filepath)
summary(health)

$nameCounts

                          Record      InstantaneousBeatsPerMinute 
                          505685                           171703 
                   MetadataEntry HeartRateVariabilityMetadataList 
                          144203                             2511 
                 ActivitySummary                     WorkoutEvent 
                             410                              172 
               WorkoutStatistics                          Workout 
                             143                               38 
                   FileReference                     WorkoutRoute 
                              30                               30 
                      ExportDate                       HealthData 
                               1                                1 
                              Me 
                               1 

$numNodes
[1] 824928

Hmm, not very helpful?

Instead of spending more time trying to parse the xml file myself, let’s turn to Google. This site seems to have some explanations of what’s going on with the Apple Watch data. It looks like indeed HKQuantityTypeIdentifier is just a prefix for a field name. There should be a field called HKCategoryTypeIdentifierSleepAnalysis which sounds like what I am after!

Exploring the data

This blog gives a snippet that should convert the entire xml import into a dataframe. This has saved a ton of time!

# https://www.johngoldin.com/blog/apple-health-export/2022-07-notes-apple-health-export/index.html
health_df <- XML:::xmlAttrsToDataFrame(health["//Record"], stringsAsFactors = FALSE) |>
        as_tibble() |> mutate(value = as.numeric(value)) |>
        select(-device)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `value = as.numeric(value)`.
Caused by warning:
! NAs introduced by coercion

I don’t really understand what’s going on here yet so let’s try to pull it apart bit by bit.

According to the snippet the information I should be interested in is contained in health['//Record']. Inspecting this with typeof(health['//Record']) I can see that it’s a list so let’s just grab the first element:

health['//Record'][[1]] # remember that R is 1-indexed, unlike Python!

<Record type="HKQuantityTypeIdentifierHeight" sourceName="Health" sourceVersion="15.6.1" unit="ft" creationDate="2022-08-22 21:27:24 +0100" startDate="2022-08-22 21:27:24 +0100" endDate="2022-08-22 21:27:24 +0100" value="6.33333"/>

Ok now we’re talking. It looks like those HKQuantityTypeIdenifiers are identified with some field called type. We also seem to have some metadata about where this data came from - seems like this entry is from the Apple Health app? We have some info about the app itself (perhaps ‘15.6.1’ corresponds to the Health app, or it could be iOS version maybe?).

We have a few datetime fields that I’ll need to dig into, and then a value ‘6.3333’ and a unit ‘ft’. It looks like this entry must have been when I entered my height when I first got my iPhone.

Let’s take a look at a few more:

health['//Record'][42:44]

[[1]]
<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Matthew’s Apple Watch" sourceVersion="9.3.1" device="&lt;&lt;HKDevice: 0x280934230&gt;, name:Apple Watch, manufacturer:Apple Inc., model:Watch, hardware:Watch6,11, software:9.3.1&gt;" unit="count/min" creationDate="2023-03-20 13:59:56 +0100" startDate="2023-03-20 13:59:55 +0100" endDate="2023-03-20 13:59:55 +0100" value="77">
  <MetadataEntry key="HKMetadataKeyHeartRateMotionContext" value="0"/>
</Record> 

[[2]]
<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Matthew’s Apple Watch" sourceVersion="9.3.1" device="&lt;&lt;HKDevice: 0x280934230&gt;, name:Apple Watch, manufacturer:Apple Inc., model:Watch, hardware:Watch6,11, software:9.3.1&gt;" unit="count/min" creationDate="2023-03-20 14:00:01 +0100" startDate="2023-03-20 13:59:57 +0100" endDate="2023-03-20 13:59:57 +0100" value="77">
  <MetadataEntry key="HKMetadataKeyHeartRateMotionContext" value="0"/>
</Record> 

[[3]]
<Record type="HKQuantityTypeIdentifierHeartRate" sourceName="Matthew’s Apple Watch" sourceVersion="9.3.1" device="&lt;&lt;HKDevice: 0x280934230&gt;, name:Apple Watch, manufacturer:Apple Inc., model:Watch, hardware:Watch6,11, software:9.3.1&gt;" unit="count/min" creationDate="2023-03-20 14:00:06 +0100" startDate="2023-03-20 14:00:03 +0100" endDate="2023-03-20 14:00:03 +0100" value="76">
  <MetadataEntry key="HKMetadataKeyHeartRateMotionContext" value="0"/>
</Record>

Three records of my heart rate, by the looks of things. I was right that our previous sourceName corresponded to iPhone Health app - clearly these three entries are from my Apple Watch.

The rest of the earlier snippet just looks like it’s converting these entries into a dataframe using XML:::xmlAttrsToDataFrame(). I can’t easily find documentation for this but I will search harder later. In particular I will want to see whether it has more arguments than just stringsasfactors.

Taking a look at the dataframe we have made:

summary(health_df)

     type            sourceName        sourceVersion          unit          
 Length:505685      Length:505685      Length:505685      Length:505685     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 creationDate        startDate           endDate              value         
 Length:505685      Length:505685      Length:505685      Min.   :   0.000  
 Class :character   Class :character   Class :character   1st Qu.:   0.365  
 Mode  :character   Mode  :character   Mode  :character   Median :   1.266  
                                                          Mean   :  27.002  
                                                          3rd Qu.:  35.696  
                                                          Max.   :1290.000  
                                                          NA's   :17576

dim(health_df)

[1] 505685      8

Okay so we have half a million rows and eight columns. Let’s take a look inside:

health_df %>% 
  tail(10) %>% 
  glimpse

Rows: 10
Columns: 8
$ type          <chr> "HKQuantityTypeIdentifierHeartRateVariabilitySDNN", "HKQ…
$ sourceName    <chr> "Matthew’s Apple Watch", "Matthew’s Apple Watch", "Matth…
$ sourceVersion <chr> "10.0.1", "10.0.1", "10.0.1", "10.0.1", "10.0.1", "10.0.…
$ unit          <chr> "ms", "ms", "ms", "ms", "ms", "ms", "ms", "ms", "ms", "m…
$ creationDate  <chr> "2023-10-25 15:37:04 +0100", "2023-10-25 17:44:53 +0100"…
$ startDate     <chr> "2023-10-25 15:36:03 +0100", "2023-10-25 17:43:52 +0100"…
$ endDate       <chr> "2023-10-25 15:37:02 +0100", "2023-10-25 17:44:45 +0100"…
$ value         <dbl> 27.4134, 52.6730, 20.0241, 38.7226, 22.8867, 39.6330, 58…

We’ve already cast our values to numeric, which is great. Looks like we have some metadata (sourceName, sourceVersion) about how the data was collected. I should take a quick glance at these fields but I can probably ignore them.

The type field looks to contain my different health data, so I’ll want to filter by that. unit is probably nothing I need to think about, but once I’ve found my sleep data I should just check that I can safely ignore the field.

Then we have creationDate, startDate and endDate. I will need to cast these to datetimes (& work out how to work with datetimes in R!) and then work out how they relate to the value field for the sleep data that I am interested in. I assume that the relationship between the datetime fields and value will differ depending on type so there’s not going to be a one-size-fits-all approach here.

In the next post I will start to dig in to these fields!