Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase recommend #348

Closed
wants to merge 7 commits into from
Closed

Conversation

navinrathore
Copy link
Contributor

new phase "recommend' added
Moved StopWordsXXX from Documenter to Recommender/Profiler
fixed relative location of col documents in model.flth

stopWordsProfile = new StopWordsProfiler(spark, args);
}

public void process(Dataset<Row> data) throws ZinggClientException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of process method?

Copy link
Contributor Author

@navinrathore navinrathore Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there may be some data profiling in same class in addition to that in other specific classes, if any. This function has been there when things were combined. In DataDocumenter, only this function is enough to call.


try {
data = PipeUtil.read(spark, false, false, args.getData());
LOG.info("Read input data : " + data.count());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t print counts in info - as it is a performance overhead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the second try catch? Why not use the outer catch as pipe until prints already that it was not able to read a pipe

Copy link
Contributor Author

@navinrathore navinrathore Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count() removed.
Outer catch removed. exception, if any, will be passed over to higher in call stack.

core/src/main/java/zingg/profiler/ProfilerBase.java Outdated Show resolved Hide resolved
core/src/main/java/zingg/profiler/StopWordsProfiler.java Outdated Show resolved Hide resolved
core/src/main/java/zingg/profiler/StopWordsProfiler.java Outdated Show resolved Hide resolved
core/src/main/java/zingg/profiler/StopWordsProfiler.java Outdated Show resolved Hide resolved
core/src/main/java/zingg/profiler/DataProfiler.java Outdated Show resolved Hide resolved
@sonalgoyal
Copy link
Member

Add --column and generate only for that

@navinrathore
Copy link
Contributor Author

Removed an Reflection call.

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename all classes to recommender/package to recommend instead of profiler

@@ -108,6 +108,7 @@ public class Arguments implements Serializable {
boolean showConcise = false;
float stopWordsCutoff = 0.1f;
long blockSize = 100L;
String column = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option is "null".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if it doesn’t exist it is null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set it uninitialized that is nothing but a null.

@@ -0,0 +1,25 @@
package zingg.profiler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this class value adding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the class

@@ -336,6 +336,7 @@ public static Pipe getStopWordsPipe(Arguments args, String fileName) {
p.setFormat(Format.CSV);
p.setProp(FilePipe.HEADER, "true");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add junit for this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testcase

import zingg.util.PipeUtil;

public class StopWordsProfiler extends ProfilerBase {
protected static String name = "zingg.StopWordsProfiler";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

LOG.info("Please provide '--column <columnName>' option at command line to generate stop words for that column.");
}
} else {
LOG.info("No Stop Words document generated");
Copy link
Member

@sonalgoyal sonalgoyal Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update log message to "No stopwords generated" or "No stopword recommendations generated"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "No stopwords generated"

import zingg.util.PipeUtil;

public class DataProfiler extends ProfilerBase {
protected static String name = "zingg.DataProfiler";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using name anywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see a need for this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataProfiler and DataColProfiler are the two classes. Removed DataColProfiler and moved applicable stuff in the class DataProfiler,

private Dataset<Row> findStopWords(Dataset<Row> data, String fieldName) {
LOG.debug("Field: " + fieldName);
if(!data.isEmpty()) {
data = data.select(split(data.col(fieldName), "\\s+").as("split"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column names created by Zingg are defined centrally in ColNames - all of them have z_ prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made them local consts only. They are not reusable column names.

data = data.select(split(data.col(fieldName), "\\s+").as("split"));
data = data.select(explode(data.col("split")).as("word"));
data = data.filter(data.col("word").notEqual(""));
data = data.groupBy("word").count().orderBy(desc("count"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we filter the data based on the cutoff and then orderby? it may be faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopwords chosen whose count is greaterThan(sum(col(count))* cuttoff);
No more orderBy()
Still Need review of function findStopWords() as the result size may be UN-predictable. Largely depend on how many words are there in dataset"

@@ -18,7 +18,7 @@
<tr>
<th class="border-right border-white" style="width: 160px;" >Cluster</th>
<#list 3 ..< numColumns as entityIndex>
<th class="border-right border-white" > <a href="docs/${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th>
<th class="border-right border-white" > <a href="${columns[entityIndex]}.html"> ${columns[entityIndex]!} </a></th>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the links? what are we showing here when we click?

@@ -0,0 +1,90 @@
package zingg.profiler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main test here should not be reading/writing files but if we are generating the write stop words. just build a dataset in memory and use that as your dataset for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed such tests. added one wherein dataset is created programmatically.

@@ -92,6 +92,10 @@ else if (args.getJobId() != -1) {
String j = options.get(ClientOptions.SHOW_CONCISE).value;
args.setShowConcise(Boolean.valueOf(j));
}
if (options.get(ClientOptions.COLUMN)!= null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a test here to see this is getting set correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added the testcase.

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments, please check

@navinrathore
Copy link
Contributor Author

navinrathore commented Jun 27, 2022

@navinrathore
Copy link
Contributor Author

}

public void createStopWordsDocuments(Dataset<Row> data) throws ZinggClientException {
if (!data.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test for data being empty, column being blank or invalid should moe to stop words class - also have junits for those cases.

@@ -108,6 +108,7 @@ public class Arguments implements Serializable {
boolean showConcise = false;
float stopWordsCutoff = 0.1f;
long blockSize = 100L;
String column = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if it doesn’t exist it is null

import org.junit.jupiter.api.Test;

public class TestClient {
public static final Log LOG = LogFactory.getLog(TestClient.class);

@Test
public void testValidPhase() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this testing valid or invalid phase?

@Test
public void testSetColumnOptionThroughBuildAndSetArguments() {
Arguments arguments = new Arguments();
String[] args = {ClientOptions.CONF, "configFile", ClientOptions.PHASE, "train", ClientOptions.COLUMN, "columnName", ClientOptions.SHOW_CONCISE, "true", ClientOptions.LICENSE, "licenseFile"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need showConcise?

@@ -0,0 +1,63 @@
package zingg.recommender;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ths class and move functionality for stopwords to that class.

}
/* creates a dataframe for given words and their frequency*/
public Dataset<Row> createDFWithGivenStopWords() {
Map<String, Integer> map = Stream.of(new Object[][] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not simply use map.put?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data creation here is leading to a lot of extra loops and object creation. Use of a hash when

I feel a simpler logic would be

  • define a class word, count (the, 44)
  • define a list of words. just add all the words one by one.
  • define wordDistribution - int[][] row is NO_OF_RECORDS, each column represents a word
  • fill wordDistribution per word in the list of words
  • iterate over the wordDist array and join the strings by looking up the col index.

Also define structType once and reuse?

}
/* Breaks 'n' into 'm' random numbers such that sum(arr[m]) = n */
int[] randomDistributionList(int m, int n) {
int arr[] = new int[m];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont this be m-1?


args.setStopWordsCutoff(0.1f);
Dataset<Row> stopWords = recommender.findStopWords(dataset, COL_STOPWORDS);
stopWords.show();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show is not a test. test has to say which words made it to stopwords and which did not.

@sonalgoyal sonalgoyal closed this Jul 16, 2022
@navinrathore navinrathore deleted the PhaseRecommend branch July 28, 2022 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants