Make jumbo data frame (with the composite variables unlisted) for the test datasetFix the column names of the tr.Unlisted and te.Unlisted datasetsAccidentally omitted a column from the test dataset (MassWeightedSD); get that back!When finished with NAs, make sure the classes are all correct (factor, integer, numeric, etc.)Calculate the true value for Kdp instead of the incorrect all-0 current KdpFor every variable that has missing values, create a separate column that has factors for each type of missing value in case the type of missing value is predictive.Then get rid of the current codes (-99901.0, -99900.0, etc.) to be R's NA so that R doesn't treat -99901.0 as what it looks likeExplore: each day, explore a different variable; start off with TimeToEnd, then DistanceToRadar, and so on. Keep notes and observations on Notes.md- Create new variable noting the number of hours for each Id using TimeToEnd. If TimeToEnd for a partic Id is 59 20 58 40 30, then numHours is 2
Create new variable to note the first measurement of each IdCreate new variable to note not just the first measurement of each Id, but the nth measurement. So if Id 1 has 10 measurements, then this new variable would be 1:10.Create function to create pdf and or cdf from 0 to 69 mmCreate function to "grade" a set of guesses (CRPS)Create the ability to condense tr.Unlisted to tr (so each Id has only one row)Create new dataset that's collapsed like tr and te where there is one measurement per Id but instead of RR1, HydrometeorType, and Zdr, I have RR1.mean, RR1.median, RR1.sd, HydrometeorType.mode (since its categorical), Zdr.mean, Zdr.median, and Zdr.sd. And so on.Write a C++ program to read in the data, and calculate features per Id; too slow in R.Implement more functions in C++ program; especially mode() for the variable HydrometeorTypes since that is a factor, not numerical.When modeling the 69th mm as binomial, instead of mm > 68.5 and < 69.5, it should be strictly greater than 68.5.Extract variables in R that are relevant, i.e. Extract.range doesn't make sense- Someone in forum said to say everything after 10 mm is always 0; seems wrong, but maybe try it?
- Either remove ar2 from Parsimony.R or fix the cross validating option. Currently it returns the error rate returned from cv.glm() which needs to be minimized while the function ar2 is R^2, which is to be maximized.
- Go through the predictive variables of train() (*.mean, *.range, *.diffMean, hydroMode) and one by one see which is predictive of Expected (using cv.glm on the training dataset) and maybe do so only 1 through 10 and mark the rest as 1.
- Use these variables, as long as there isn't a lot of multicolinearity, on the testing dataset
- try using knn to look at similar observations, and averaging by weight their Expecteds