Thetazero Pubs
datagregdata cst damaging damages cropdmg injury fatalities injuries fatally fatals fatality

SYNOPSIS

This is an analysis of the dataset found in : repdata_data_StormData.csv , downloaded from NOAA as per the assignment instructions. The goal of the analysis is to determine which types of events are most harmful to population health, and which types of events have the greatest economic consequences.


The names of event types are not standardized in this data set, and given more time I would try to normalize them (for instance the different labels for flash floods, wind events, etc. would be combined). However given the overwhelming dominance of the top damaging event types, I don?t feel that the analysis is fatally compromised without doing this.

Overall, tornadoes cause the most fatalities, injuries, and property damage, and hail causes the most crop damage.

DATA PROCESSING

Load in the data file and look at the beginning of it to determine what variables will be relevant to the questions.

download.file("http://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2", destfile = "repdata_data_StormData.csv.bz2")
unzipped <- bzfile("repdata_data_StormData.csv.bz2")
stormdata <- read.csv(unzipped)
head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

It appears that for question 1, the FATALITIES and the INJURIES variables will be most relevant, and for question 2, the PROPDMG and CROPDMG variables will be most relevant. The EVTYPE variable will be relevant for both questions. While other variables dealing with locations and dates might be interesting to examine, they are not necessary for the current analysis.

## subsetting data to relevant variables
stormdatasmall <- stormdata[, c("EVTYPE","FATALITIES","INJURIES","PROPDMG","CROPDMG")] 

Calculate the sums of different damages for each type of event


stormsums <- aggregate(. ~ EVTYPE, data = stormdatasmall, FUN = sum)

Break out the event types with the highest values for each of the relevant variables. NOTE: to determine what qualified as ?many? or ?high? in the code below, I sorted by the relevant variable and considered the top 10-12 values as ?many? or ?high?. This was pretty arbitrary and mainly to make the plots more legible, but it works because the top damaging event types are so much higher than everything else.

fatals <- stormsums[which(stormsums$FATALITIES > 0),] ## Get the observations where there were actually fatalities
bigfatals <- fatals[which(fatals$FATALITIES > 200),] ## Select the observations where there were many fatalities

injuries <- stormsums[which(stormsums$INJURIES > 0),] ## Get the observations where there were actually injuries
biginjuries <- injuries[which(injuries$INJURIES > 1200),] ## Select the observations where there were many injuries

property <- stormsums[which(stormsums$PROPDMG > 0),] ## Get the observations where there was property damage
bigproperty <- property[which(property$PROPDMG > 100000),] ## Select the observations where property damage was high
bigproperty$PROPDMG <- bigproperty$PROPDMG/1000 ## Convert damage from $1s to $1000s for clearer plotting

crop <- stormsums[which(stormsums$CROPDMG > 0),] ## Get the observations where there was crop damage
bigcrops <- crop[which(crop$CROPDMG > 10000),] ## Select the observations where crop damage was high
bigcrops$CROPDMG <- bigcrops$CROPDMG/1000 ## Convert damage from $1s to $1000s for clearer plotting

RESULTS

Plot the top event type for each relevant variable

library(ggplot2)
p1 <- qplot(EVTYPE, FATALITIES, data = bigfatals, main = "High Fatality Events", xlab = "Type of Event", ylab = "Number of Fatalities") + theme(axis.text.x=element_text(angle=90, vjust=0.5)) + geom_bar(stat = "identity", color="red")

p2 <- qplot(EVTYPE, INJURIES, data = biginjuries, main = "High Injury Events", xlab = "Type of Event", ylab = "Number of Injuries") + theme(axis.text.x=element_text(angle=90, vjust=0.5)) + geom_bar(stat = "identity", color = "blue")

p3 <- qplot(EVTYPE, PROPDMG, data = bigproperty, main = "High Property Damage Events", xlab = "Type of Event", ylab = "Property Damage in Thousands of Dollars") + theme(axis.text.x=element_text(angle=90, vjust=0.5)) + geom_bar(stat = "identity", color="brown")

p4 <- qplot(EVTYPE, CROPDMG, data = bigcrops, main = "High Crop Damage Events", xlab = "Type of Event", ylab = "Crop Damage in Thousands of Dollars") + theme(axis.text.x=element_text(angle=90, vjust=0.5)) + geom_bar(stat = "identity", color = "green")

library(grid)
vp <- viewport(layout = grid.layout(1,2))
pushViewport(vp)
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col=1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col=2))

grid.newpage()
vp2 <- viewport(layout = grid.layout(1,2))
pushViewport(vp2)
print(p3, vp = viewport(layout.pos.row = 1, layout.pos.col=1))
print(p4, vp = viewport(layout.pos.row = 1, layout.pos.col=2))

Clearly tornadoes cause by far the most fatalities and the most injuries among the event types reported, as well as the most property damage. Hail is the most damaging type of event to crops. As stated in the synopsis above, there are some issues with the way event types are labeled, probably due to idiosyncracies of the person entering the data, or to changes in nomenclature over time. Cleaning up this data (if time allowed) would make the detailed picture more accurate, but the general results would be similar. Even if, for example, the various types of flood (FLASHFLOOD, FLOOD, etc.) were combined, they wouldn?t unseat tornadoes and hail as the champions of destruction.

Copyright © 2016 thetazero.com All Rights Reserved. Privacy Policy