home |
copyright ©2019, tjmenzie@ncsu.edu
syllabus |
src |
submit |
chat
Here's the new work (in red)
Implement new classes Abcd and Sym
- Don't worry about
SymAny
orSymLike
(for this week, anyway).
Also, modify Tbl
so columns can be Num
or Sym
s.
Abcd
is a class to report test results for a classifier. It knows how to report accuracy, false alarms, precision,
recall, the f-measure, and the g-measurea
- For notes on these measures, see [eval101][(eval101.md).
- For sample implementation, see Abcd.
- The code is a little tricky since false alarm, precision, recall, f, g are only defined for binary classifiers. If dealing
with more than two classes, you have to divide those into N binary problems:
- classA and notClassA,
- classB and notClassB,
- etc
- Also, there's one interesting catch. The first time we see a new class, that means that in all times before now
we did not see that class. So the
a
counter has to be updated to the count up to now (see Abcd for details).
To test your code, then something like this:
function _abcd(f,i,j) {
Abcd(i)
for(j=1; j<=6; j++) Abcd1(i,"yes", "yes")
for(j=1; j<=2; j++) Abcd1(i,"no", "no")
for(j=1; j<=5; j++) Abcd1(i,"maybe", "maybe")
Abcd1(i, "maybe","no")
AbcdReport(i)
}
Should print something like this:
db | rx | num | a | b | c | d | acc | pre | pd | pf | f | g | class
---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |-----
data | rx | 14 | 11 | | 1 | 2 | 0.93 | 0.67 | 1.00 | 0.08 | 0.80 | 0.96 | no
data | rx | 14 | 8 | | | 6 | 0.93 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | yes
data | rx | 14 | 8 | 1 | | 5 | 0.93 | 1.00 | 0.83 | 0.00 | 0.91 | 0.91 | maybe
Sym
is like Num
, but it keeps a count cnt
of symbols in a row.
- The most frequently seen symbol is the
mode
. Sym
can report theentropy
of the symbols in a column. Givenn
entries n a column with symbols occuring at frequencyf1,f2,...
then the entropye
of a column is:- pi = fi/n
- e = - ∑i pilog2(pi)
- For code to calculate this, see
SymEnt
in sym.fun, - Entropy is like standard deviation, but for symbols:
- the lower the standard deviation, the less variety in the numbers
- the lower the entropy, the less variety in the symbols
To test your code:
- If your throw the letters aaaabbc into your
Sym
then the entropy should be around 1.38.
Last week you built a table class that only knew about numeric columns.
Now you need to code up a little language for row1 of your data to identify column types.
That language used the following symbols (in regualr expression format):
SKIPCOL = "\\?"
NUMCOL = "[<>\\$]"
GOALCOL = "[<>!]"
LESS = "<"
That is:
- Any column name containing "?" will be skipped over.
- Any column name containing "<" or ">" or "$" will be called
a
Num
- And all other columns will be
Sym
s,
- And all other columns will be
- Any column name containing "<" or ">" or "!" will be called a goal.
- "<" and ">" are "less" or "more" goals; i.e. things we want to minimize or maximize
- "!" and
Sym
columns that are classes that we want to predict. - Anyying that does not use "<>!" is a
xs
column
- All columns will be given a weight of
w=1
- Unless that column name includes "<" in which case
w=-1
.
- Unless that column name includes "<" in which case
For example:
outlook, ?$temp, <humid, wind, !play
rainy, 68, 80, FALSE, yes # comments
sunny, 85, 85, FALSE, no
sunny, 80, 90, TRUE, no
overcast, 83, 86, FALSE, yes
rainy, 70, 96, FALSE, yes
rainy, 65, 70, TRUE, no
overcast, 64, 65, TRUE, yes
sunny, 72, 95, FALSE, no
sunny, 69, 70, FALSE, yes
rainy, 75, 80, FALSE, yes
sunny, 75, 70, TRUE, yes
overcast, 72, 90, TRUE, yes
overcast, 81, 75, FALSE, yes
rainy, 71, 91, TRUE, no
- Goals are columns 3,5
- Xs are columns 1,4
- Column 1 and Column 4 and Column 5 are
Sym
s - Columns 3 is a Num,
- Column 2 will be ignore.
- Columns 3 is a goal to be minimized and will have a weight of -1.
To test your code:
- Read the weather data shown above.
- For the first line, create a
cols
list ofNum
s andSym
s. - Also create lists
my.goals, my.xs, my.nums, my.syms
etc storing indexes of the columns - Print
cols
andmy
. In a language where arrays are indexed 1...max, that looks like the following.- Your results should be nearly the same OR you can find a bug in my code that printed the following.
t.cols
| 1
| | add: Sym1
| | cnt
| | | overcast: 4
| | | rainy: 5
| | | sunny: 5
| | col: 1
| | mode: sunny
| | most: 5
| | n: 14
| | oid: 2
| | txt: outlook
| 3
| | add: Num1
| | col: 3
| | hi: 96
| | lo: 65
| | m2: 1375.21
| | mu: 81.6429
| | n: 14
| | oid: 3
| | sd: 10.2852
| | txt: <humid
| 4
| | add: Sym1
| | cnt
| | | FALSE: 8
| | | TRUE: 6
| | col: 4
| | mode: FALSE
| | most: 8
| | n: 14
| | oid: 4
| | txt: wind
| 5
| | add: Sym1
| | cnt
| | | no: 5
| | | yes: 9
| | col: 5
| | mode: yes
| | most: 9
| | n: 14
| | oid: 5
| | txt: !play
t.my
| class: 5
| goals
| | 3
| | 5
| nums
| | 3
| syms
| | 1
| | 4
| | 5
| w
| | 3: -1
| xnums
| xs
| | 1
| | 4
| xsyms
| | 1
| | 4