Differences

This shows you the differences between two versions of the page.

--- project:hotornot:r [2016/04/09 12:55] – created derrickoswald
+++ project:hotornot:r [2016/04/10 21:03] (current) – finish description derrickoswald
@@ Line 1: / Line 1: @@
 ===R Analysis===
-The sample data was analysed with the [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.
+The sample data was analysed with [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.
-The source code is show below. An explaination follows.
+The source code is show below. An explanation follows.
 <code>
@@ Line 72: / Line 72: @@
 plot (forest, log="y")
 MDSplot(forest,solar)
+print (forest)
 imp = importance (forest)
 </code>
-==Steps==
+===Program Steps===
-The data set was loaded into R using the built in CSV reader.
+The data set and meta data were loaded into R using the built in CSV reader.
-Since randomForest does not operate when there are NA values, only complete cases were retained.
+Since randomForest does not operate when there are NA values (missing data), only complete cases were retained.
+The type of water heating system included information about solar heaters, so it was removed from analysis.
+Using trial and error, columns were eliminated step by step based on the importance vector to establish the minimal set.
+Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers.
+The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis.
+The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out.
+===Analysis===
+The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE.
+The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result.
+It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column.
+For the purposes of demonstration, additional columns that could be asked in a survey were included in the machine learning demonstration program.