project:hotornot:r
This is an old revision of the document!
R Analysis
The sample data was analysed with the RStudio Version 0.99.489 by using the randomForest package version 4.6-12 to determine which variables were most important.
The source code is show below. An explaination follows.
# load the raw data # note: randomForst does not allow NA na.strings = c("9999", "notset" raw = read.csv (file="RawData.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fileEncoding = "UTF-8", encoding="UTF-8", stringsAsFactors = TRUE) meta = read.csv (file="MetaData.csv", header = TRUE, sep = ",", quote = "\"", dec = ".", fileEncoding = "UTF-8", encoding="UTF-8", stringsAsFactors = FALSE) # eliminate rows where data is not available raw = raw[complete.cases(raw),] # eliminate columns on type of water heating raw = subset (raw, select = -WaterHeatingType) # eliminate rows where water heater is solar # raw = raw[as.character(raw$WaterHeatingType) != "solar",] # eliminate columns where importance MeanDeacreaseInAccuracy is small raw = subset (raw, select = -QSSmart) raw = subset (raw, select = -HasSmartMeter) raw = subset (raw, select = -QSLamps) raw = subset (raw, select = -QSGender) raw = subset (raw, select = -SentInvitesAccepted) raw = subset (raw, select = -InvitesAccepted) raw = subset (raw, select = -QSAge) raw = subset (raw, select = -NumMeters) raw = subset (raw, select = -SavingTipStatusCount_1) raw = subset (raw, select = -SavingTipStatusCount_3) raw = subset (raw, select = -InvitesSent) raw = subset (raw, select = -QSEcoEnergy) raw = subset (raw, select = -QuizAnswered) raw = subset (raw, select = -DaysLoggedIn) raw = subset (raw, select = -LotteriesParticipated) raw = subset (raw, select = -NumAppliancesEntered) raw = subset (raw, select = -HouseholdType) raw = subset (raw, select = -SavingTipStatusCount_2) # raw = subset (raw, select = -HouseholdMembers) # raw = subset (raw, select = -HadAudit) raw = subset (raw, select = -MobilePhoneEmpty) raw = subset (raw, select = -NumDevices) # raw = subset (raw, select = -QSKnowMeter) raw = subset (raw, select = -DaysReadingEntered) raw = subset (raw, select = -ReadingCount) raw = subset (raw, select = -WeeksReadingEntered) raw = subset (raw, select = -QSInterest) raw = subset (raw, select = -QuizAnsweredCorrectly) raw = subset (raw, select = -NumWeeksMember) raw = subset (raw, select = -NumDaysMember) raw = subset (raw, select = -NumVisit) raw = subset (raw, select = -WeeksLoggedIn) raw = subset (raw, select = -Points) # raw = subset (raw, select = -LivingArea) # raw = subset (raw, select = -MainHeatingType) # split the variable of interest out of the data.frame, change to factors for classification solar = factor (ifelse (raw["QSSolar"] == 28, TRUE, FALSE)) # make a new data.frame without the solar variable data = subset (raw, select = -QSSolar) # load the analysis package library (randomForest) # run the analysis forest = randomForest (x = data, y = solar, importance = TRUE, proximity = TRUE, do.trace = TRUE) # show some results plot (forest, log="y") MDSplot(forest,solar) imp = importance (forest)
Steps
The data set was loaded into R using the built in CSV reader.
Since randomForest does not operate when there are NA values, only complete cases were retained.
The type of water heating system included information about solar heaters, so it was removed from analysis.
Columns were eliminated step by step based on the importance vector to establish the minimal set.
project/hotornot/r.1460199653.txt.gz ยท Last modified: 2016/04/09 13:00 by derrickoswald