Differences

This shows you the differences between two versions of the page.

--- project:hotornot:r [2016/04/09 13:00] – derrickoswald
+++ project:hotornot:r [2016/04/10 21:03] (current) – finish description derrickoswald
@@ Line 1: / Line 1: @@
 ===R Analysis===
-The sample data was analysed with the [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.
+The sample data was analysed with [[http://rstudio.com/|RStudio]] Version 0.99.489 by using the [[https://cran.r-project.org/web/packages/randomForest/index.html|randomForest]] package version 4.6-12 to determine which variables were most important.
-The source code is show below. An explaination follows.
+The source code is show below. An explanation follows.
 <code>
@@ Line 72: / Line 72: @@
 plot (forest, log="y")
 MDSplot(forest,solar)
+print (forest)
 imp = importance (forest)
 </code>
-==Steps==
+===Program Steps===
-The data set was loaded into R using the built in CSV reader.
+The data set and meta data were loaded into R using the built in CSV reader.
-Since randomForest does not operate when there are NA values, only complete cases were retained.
+Since randomForest does not operate when there are NA values (missing data), only complete cases were retained.
 The type of water heating system included information about solar heaters, so it was removed from analysis.
-Columns were eliminated step by step based on the importance vector to establish the minimal set.
+Using trial and error, columns were eliminated step by step based on the importance vector to establish the minimal set.
+Nearly all columns were eliminated. The columns retained have the subset statement commented out, e.g. HouseholdMembers.
+The target vector (QSSolar column) was extracted and removed from the training set. It was converted from it's encoded value (28 or 29) into a two Factors (a factor is a symbol in a R data set) in order to trigger the classification algorithm. Otherwise the randomForest algorithm attempts to do a regression analysis.
+The randomForest library was loaded and the analysis run. A couple of plots were generated in aiding the analysis and the error matrix printed out.
+===Analysis===
+The error rate was consistently below 10%, but this was due in large part to the overwhelming majority of the data being non-solar customers. In other words, with so many customers not having solar power, the best guess is a FALSE.
+The importance vector was extracted from the results and by sorting descending on MeanDecreaseAcccuracy the relative usefulness of a column could be determined. Columns that fell to the bottom of this list contributed little to the accuracy of the result.
+It was found that just the MainHeatingType column was an accurate predictor of the QSSolar column.
+For the purposes of demonstration, additional columns that could be asked in a survey were included in the machine learning demonstration program.