Phase recommend #348

navinrathore · 2022-06-20T04:56:39Z

new phase "recommend' added
Moved StopWordsXXX from Documenter to Recommender/Profiler
fixed relative location of col documents in model.flth

sonalgoyal · 2022-06-20T05:45:53Z

core/src/main/java/zingg/profiler/DataColProfiler.java

+		stopWordsProfile = new StopWordsProfiler(spark, args);
+	}
+
+	public void process(Dataset<Row> data) throws ZinggClientException {


What is the use of process method?

As there may be some data profiling in same class in addition to that in other specific classes, if any. This function has been there when things were combined. In DataDocumenter, only this function is enough to call.

sonalgoyal · 2022-06-20T05:46:53Z

core/src/main/java/zingg/profiler/DataProfiler.java

+
+			try {
+				data = PipeUtil.read(spark, false, false, args.getData());
+				LOG.info("Read input data : " + data.count());


Please don’t print counts in info - as it is a performance overhead.

Why do we need the second try catch? Why not use the outer catch as pipe until prints already that it was not able to read a pipe

Count() removed.
Outer catch removed. exception, if any, will be passed over to higher in call stack.

core/src/main/java/zingg/profiler/ProfilerBase.java

core/src/main/java/zingg/profiler/StopWordsProfiler.java

core/src/main/java/zingg/profiler/DataProfiler.java

sonalgoyal · 2022-06-20T07:14:47Z

Add --column and generate only for that

navinrathore · 2022-06-21T09:00:40Z

Removed an Reflection call.

sonalgoyal

rename all classes to recommender/package to recommend instead of profiler

sonalgoyal · 2022-06-21T14:38:06Z

client/src/main/java/zingg/client/Arguments.java

@@ -108,6 +108,7 @@ public class Arguments implements Serializable {
 	boolean showConcise = false;
 	float stopWordsCutoff = 0.1f;
 	long blockSize = 100L;
+	String column = "";


Other option is "null".

Yes if it doesn’t exist it is null

Set it uninitialized that is nothing but a null.

sonalgoyal · 2022-06-21T15:08:15Z

core/src/main/java/zingg/profiler/ProfilerBase.java

@@ -0,0 +1,25 @@
+package zingg.profiler;


what is this class value adding?

Removed the class

sonalgoyal · 2022-06-21T15:11:41Z

core/src/main/java/zingg/util/PipeUtil.java

@@ -336,6 +336,7 @@ public static Pipe getStopWordsPipe(Arguments args, String fileName) {
 		p.setFormat(Format.CSV);
 		p.setProp(FilePipe.HEADER, "true");


please add junit for this method

Added testcase

sonalgoyal · 2022-06-21T15:12:20Z

core/src/main/java/zingg/profiler/StopWordsProfiler.java

+import zingg.util.PipeUtil;
+
+public class StopWordsProfiler extends ProfilerBase {
+	protected static String name = "zingg.StopWordsProfiler";


why do you need the name?

sonalgoyal · 2022-06-21T15:14:07Z

core/src/main/java/zingg/profiler/DataColProfiler.java

+				LOG.info("Please provide '--column <columnName>' option at command line to generate stop words for that column.");
+			}
+		} else {
+			LOG.info("No Stop Words document generated");


update log message to "No stopwords generated" or "No stopword recommendations generated"

Updated to "No stopwords generated"

sonalgoyal · 2022-06-21T15:14:57Z

core/src/main/java/zingg/profiler/DataProfiler.java

+import zingg.util.PipeUtil;
+
+public class DataProfiler extends ProfilerBase {
+	protected static String name = "zingg.DataProfiler";


are we using name anywhere?

i dont see a need for this class?

DataProfiler and DataColProfiler are the two classes. Removed DataColProfiler and moved applicable stuff in the class DataProfiler,

sonalgoyal · 2022-06-21T15:16:47Z

core/src/main/java/zingg/profiler/StopWordsProfiler.java

+	private Dataset<Row> findStopWords(Dataset<Row> data, String fieldName) {
+		LOG.debug("Field: " + fieldName);
+		if(!data.isEmpty()) {
+			data = data.select(split(data.col(fieldName), "\\s+").as("split"));


column names created by Zingg are defined centrally in ColNames - all of them have z_ prefix

Made them local consts only. They are not reusable column names.

sonalgoyal · 2022-06-21T15:18:49Z

core/src/main/java/zingg/profiler/StopWordsProfiler.java

+			data = data.select(split(data.col(fieldName), "\\s+").as("split"));
+			data = data.select(explode(data.col("split")).as("word"));
+			data = data.filter(data.col("word").notEqual(""));
+			data = data.groupBy("word").count().orderBy(desc("count"));


can we filter the data based on the cutoff and then orderby? it may be faster

Stopwords chosen whose count is greaterThan(sum(col(count))* cuttoff);
No more orderBy()
Still Need review of function findStopWords() as the result size may be UN-predictable. Largely depend on how many words are there in dataset"

sonalgoyal · 2022-06-21T15:19:42Z

core/src/main/resources/model.ftlh

@@ -18,7 +18,7 @@
  <tr>
    <th class="border-right border-white" style="width: 160px;" >Cluster</th>
    <#list 3 ..< numColumns as entityIndex>
-    <th class="border-right border-white" > <a href="docs/${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th>
+    <th class="border-right border-white" > <a href="${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th>


do we still need the links? what are we showing here when we click?

sonalgoyal · 2022-06-21T15:21:25Z

core/src/test/java/zingg/profiler/TestStopWordsProfiler.java

@@ -0,0 +1,90 @@
+package zingg.profiler;


the main test here should not be reading/writing files but if we are generating the write stop words. just build a dataset in memory and use that as your dataset for testing?

Removed such tests. added one wherein dataset is created programmatically.

sonalgoyal · 2022-06-21T15:22:24Z

client/src/main/java/zingg/client/Client.java

@@ -92,6 +92,10 @@ else if (args.getJobId() != -1) {
 			String j = options.get(ClientOptions.SHOW_CONCISE).value;
 			args.setShowConcise(Boolean.valueOf(j));
 		}
+		if (options.get(ClientOptions.COLUMN)!= null) {


can we have a test here to see this is getting set correctly?

Yes. Added the testcase.

sonalgoyal

added comments, please check

Moved StopWordsXXX from Documenter to Recommender/Profiler

navinrathore · 2022-06-27T11:55:54Z

New phase recommend #336

navinrathore · 2022-06-29T08:43:15Z

Everywhere names comprising profiler changed to that of recommender(individual commit)

sonalgoyal · 2022-06-21T16:02:03Z

core/src/main/java/zingg/profiler/DataColProfiler.java

+	}
+
+	public void createStopWordsDocuments(Dataset<Row> data) throws ZinggClientException {
+		if (!data.isEmpty()) {


the test for data being empty, column being blank or invalid should moe to stop words class - also have junits for those cases.

sonalgoyal · 2022-06-22T06:26:05Z

client/src/main/java/zingg/client/Arguments.java

@@ -108,6 +108,7 @@ public class Arguments implements Serializable {
 	boolean showConcise = false;
 	float stopWordsCutoff = 0.1f;
 	long blockSize = 100L;
+	String column = "";


Yes if it doesn’t exist it is null

sonalgoyal · 2022-07-04T07:41:35Z

client/src/test/java/zingg/client/TestClient.java

 import org.junit.jupiter.api.Test;

 public class TestClient {
+	public static final Log LOG = LogFactory.getLog(TestClient.class);

 	@Test
 	public void testValidPhase() {


is this testing valid or invalid phase?

sonalgoyal · 2022-07-04T07:43:00Z

client/src/test/java/zingg/client/TestClient.java

+	@Test
+	public void testSetColumnOptionThroughBuildAndSetArguments() {
+		Arguments arguments = new Arguments();
+		String[] args = {ClientOptions.CONF, "configFile", ClientOptions.PHASE, "train", ClientOptions.COLUMN, "columnName", ClientOptions.SHOW_CONCISE, "true", ClientOptions.LICENSE, "licenseFile"};


why do you need showConcise?

sonalgoyal · 2022-07-04T07:46:46Z

core/src/main/java/zingg/recommender/DataRecommender.java

@@ -0,0 +1,63 @@
+package zingg.recommender;


remove ths class and move functionality for stopwords to that class.

sonalgoyal · 2022-07-04T08:00:09Z

core/src/test/java/zingg/recommender/TestStopWordsRecommender.java

+	}
+	/* creates a dataframe for given words and their frequency*/
+	public Dataset<Row> createDFWithGivenStopWords() {
+		Map<String, Integer> map = Stream.of(new Object[][] {


why not simply use map.put?

The data creation here is leading to a lot of extra loops and object creation. Use of a hash when

I feel a simpler logic would be

define a class word, count (the, 44)

define a list of words. just add all the words one by one.

define wordDistribution - int[][] row is NO_OF_RECORDS, each column represents a word

fill wordDistribution per word in the list of words

iterate over the wordDist array and join the strings by looking up the col index.

Also define structType once and reuse?

sonalgoyal · 2022-07-04T08:08:36Z

core/src/test/java/zingg/recommender/TestStopWordsRecommender.java

+	}
+	/* Breaks 'n' into 'm' random numbers such that sum(arr[m]) = n */
+	int[] randomDistributionList(int m, int n) {
+		int arr[] = new int[m];


wont this be m-1?

sonalgoyal · 2022-07-04T09:03:03Z

core/src/test/java/zingg/recommender/TestStopWordsRecommender.java

+
+		args.setStopWordsCutoff(0.1f);
+		Dataset<Row> stopWords = recommender.findStopWords(dataset, COL_STOPWORDS);
+		stopWords.show();


show is not a test. test has to say which words made it to stopwords and which did not.

sonalgoyal requested changes Jun 20, 2022

View reviewed changes

sonalgoyal reviewed Jun 21, 2022

View reviewed changes

sonalgoyal requested changes Jun 21, 2022

View reviewed changes

navinrathore added 6 commits June 27, 2022 15:04

new phase "recommend' added

f66fe73

Moved StopWordsXXX from Documenter to Recommender/Profiler

fixed related location of column html doc of model

8f65cc6

Testcase for StopWords reorg

09e109b

added 'column' option for recommend; Code cleanup

e2e8ae3

JsonIgnored StopWordsDir in Arguments

fede3ee

stopwords and recommend tests reorg

906609d

navinrathore force-pushed the PhaseRecommend branch from ab0f519 to 906609d Compare June 27, 2022 09:36

navinrathore mentioned this pull request Jun 27, 2022

New phase recommend #336

Closed

changed naming from profiler to recommender

18575f0

sonalgoyal requested changes Jul 5, 2022

View reviewed changes

sonalgoyal closed this Jul 16, 2022

navinrathore deleted the PhaseRecommend branch July 28, 2022 16:00

		@@ -336,6 +336,7 @@ public static Pipe getStopWordsPipe(Arguments args, String fileName) {
		p.setFormat(Format.CSV);
		p.setProp(FilePipe.HEADER, "true");

Phase recommend #348

Phase recommend #348

Conversation

navinrathore commented Jun 20, 2022

Choose a reason for hiding this comment

navinrathore Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navinrathore Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

sonalgoyal commented Jun 20, 2022

navinrathore commented Jun 21, 2022

sonalgoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonalgoyal Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonalgoyal left a comment

Choose a reason for hiding this comment

navinrathore commented Jun 27, 2022 • edited Loading

navinrathore commented Jun 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navinrathore Jun 21, 2022 •

edited

Loading

navinrathore Jun 21, 2022 •

edited

Loading

sonalgoyal Jun 21, 2022 •

edited

Loading

navinrathore commented Jun 27, 2022 •

edited

Loading