-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phase recommend #348
Phase recommend #348
Conversation
stopWordsProfile = new StopWordsProfiler(spark, args); | ||
} | ||
|
||
public void process(Dataset<Row> data) throws ZinggClientException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use of process method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there may be some data profiling in same class in addition to that in other specific classes, if any. This function has been there when things were combined. In DataDocumenter, only this function is enough to call.
|
||
try { | ||
data = PipeUtil.read(spark, false, false, args.getData()); | ||
LOG.info("Read input data : " + data.count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t print counts in info - as it is a performance overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the second try catch? Why not use the outer catch as pipe until prints already that it was not able to read a pipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count() removed.
Outer catch removed. exception, if any, will be passed over to higher in call stack.
Add --column and generate only for that |
Removed an Reflection call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename all classes to recommender/package to recommend instead of profiler
@@ -108,6 +108,7 @@ public class Arguments implements Serializable { | |||
boolean showConcise = false; | |||
float stopWordsCutoff = 0.1f; | |||
long blockSize = 100L; | |||
String column = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other option is "null".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if it doesn’t exist it is null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set it uninitialized that is nothing but a null.
@@ -0,0 +1,25 @@ | |||
package zingg.profiler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this class value adding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the class
@@ -336,6 +336,7 @@ public static Pipe getStopWordsPipe(Arguments args, String fileName) { | |||
p.setFormat(Format.CSV); | |||
p.setProp(FilePipe.HEADER, "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add junit for this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added testcase
import zingg.util.PipeUtil; | ||
|
||
public class StopWordsProfiler extends ProfilerBase { | ||
protected static String name = "zingg.StopWordsProfiler"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
LOG.info("Please provide '--column <columnName>' option at command line to generate stop words for that column."); | ||
} | ||
} else { | ||
LOG.info("No Stop Words document generated"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update log message to "No stopwords generated" or "No stopword recommendations generated"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "No stopwords generated"
import zingg.util.PipeUtil; | ||
|
||
public class DataProfiler extends ProfilerBase { | ||
protected static String name = "zingg.DataProfiler"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using name anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont see a need for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataProfiler and DataColProfiler are the two classes. Removed DataColProfiler and moved applicable stuff in the class DataProfiler,
private Dataset<Row> findStopWords(Dataset<Row> data, String fieldName) { | ||
LOG.debug("Field: " + fieldName); | ||
if(!data.isEmpty()) { | ||
data = data.select(split(data.col(fieldName), "\\s+").as("split")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column names created by Zingg are defined centrally in ColNames - all of them have z_ prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made them local consts only. They are not reusable column names.
data = data.select(split(data.col(fieldName), "\\s+").as("split")); | ||
data = data.select(explode(data.col("split")).as("word")); | ||
data = data.filter(data.col("word").notEqual("")); | ||
data = data.groupBy("word").count().orderBy(desc("count")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we filter the data based on the cutoff and then orderby? it may be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopwords chosen whose count is greaterThan(sum(col(count))* cuttoff);
No more orderBy()
Still Need review of function findStopWords() as the result size may be UN-predictable. Largely depend on how many words are there in dataset"
@@ -18,7 +18,7 @@ | |||
<tr> | |||
<th class="border-right border-white" style="width: 160px;" >Cluster</th> | |||
<#list 3 ..< numColumns as entityIndex> | |||
<th class="border-right border-white" > <a href="docs/${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th> | |||
<th class="border-right border-white" > <a href="${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need the links? what are we showing here when we click?
@@ -0,0 +1,90 @@ | |||
package zingg.profiler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main test here should not be reading/writing files but if we are generating the write stop words. just build a dataset in memory and use that as your dataset for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed such tests. added one wherein dataset is created programmatically.
@@ -92,6 +92,10 @@ else if (args.getJobId() != -1) { | |||
String j = options.get(ClientOptions.SHOW_CONCISE).value; | |||
args.setShowConcise(Boolean.valueOf(j)); | |||
} | |||
if (options.get(ClientOptions.COLUMN)!= null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a test here to see this is getting set correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added the testcase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments, please check
Moved StopWordsXXX from Documenter to Recommender/Profiler
ab0f519
to
906609d
Compare
} | ||
|
||
public void createStopWordsDocuments(Dataset<Row> data) throws ZinggClientException { | ||
if (!data.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test for data being empty, column being blank or invalid should moe to stop words class - also have junits for those cases.
@@ -108,6 +108,7 @@ public class Arguments implements Serializable { | |||
boolean showConcise = false; | |||
float stopWordsCutoff = 0.1f; | |||
long blockSize = 100L; | |||
String column = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if it doesn’t exist it is null
import org.junit.jupiter.api.Test; | ||
|
||
public class TestClient { | ||
public static final Log LOG = LogFactory.getLog(TestClient.class); | ||
|
||
@Test | ||
public void testValidPhase() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this testing valid or invalid phase?
@Test | ||
public void testSetColumnOptionThroughBuildAndSetArguments() { | ||
Arguments arguments = new Arguments(); | ||
String[] args = {ClientOptions.CONF, "configFile", ClientOptions.PHASE, "train", ClientOptions.COLUMN, "columnName", ClientOptions.SHOW_CONCISE, "true", ClientOptions.LICENSE, "licenseFile"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need showConcise?
@@ -0,0 +1,63 @@ | |||
package zingg.recommender; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove ths class and move functionality for stopwords to that class.
} | ||
/* creates a dataframe for given words and their frequency*/ | ||
public Dataset<Row> createDFWithGivenStopWords() { | ||
Map<String, Integer> map = Stream.of(new Object[][] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simply use map.put?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data creation here is leading to a lot of extra loops and object creation. Use of a hash when
I feel a simpler logic would be
- define a class word, count (the, 44)
- define a list of words. just add all the words one by one.
- define wordDistribution - int[][] row is NO_OF_RECORDS, each column represents a word
- fill wordDistribution per word in the list of words
- iterate over the wordDist array and join the strings by looking up the col index.
Also define structType once and reuse?
} | ||
/* Breaks 'n' into 'm' random numbers such that sum(arr[m]) = n */ | ||
int[] randomDistributionList(int m, int n) { | ||
int arr[] = new int[m]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wont this be m-1?
|
||
args.setStopWordsCutoff(0.1f); | ||
Dataset<Row> stopWords = recommender.findStopWords(dataset, COL_STOPWORDS); | ||
stopWords.show(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show is not a test. test has to say which words made it to stopwords and which did not.
new phase "recommend' added
Moved StopWordsXXX from Documenter to Recommender/Profiler
fixed relative location of col documents in model.flth