Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockSize - a new config paramter for max size of the block #272

Merged
merged 6 commits into from
May 23, 2022

Conversation

navinrathore
Copy link
Contributor

No description provided.

@@ -622,7 +623,15 @@ public boolean getShowConcise() {
public void setShowConcise(boolean showConcise) {
this.showConcise = showConcise;
}


public long getblockSize() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camel casing is missing - getBlockSize, same for setter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break the json parsing - please also test one end to end case with 120l records and set block size in args. Use debug logs to verify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Camel Case.
Ran with examples/febrl120k/config.json. Output attached. Appropriate block size was selected.


public static long getMaxBlockSize(long totalCount) {
public static final long MIN_SIZE = 8L;
public static long getMaxBlockSize(long totalCount, long blockSizeFromConfig) {
long maxSize = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt this be set to the min size var ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used MIN_SIZE

LOG.debug("**Block size found **" + maxSize);
if (maxSize > 100) maxSize = 100;
LOG.debug("**Block size found **");
if (maxSize > blockSizeFromConfig) maxSize = blockSizeFromConfig;
if (maxSize <= 8) maxSize = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the defined constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used MIN_SIZE

@navinrathore
Copy link
Contributor Author

Output:
examples/febrl120k/config.json

A. ==================================================
Phase: findTrainingData
blockSize(Config): 150

 2022-05-23 12:38:27,107 [main] INFO  zingg.util.Heuristics - **Block size **12 and total count was 12123
 2022-05-23 12:38:27,121 [main] INFO  zingg.util.Heuristics - Heuristics suggest 12

B ==================================================
Phase: findTrainingData
blockSize(Config): 10

 2022-05-23 13:56:30,606 [main] WARN  zingg.TrainingDataFinder - Read training samples 37 neg 43
 2022-05-23 13:56:31,093 [main] INFO  zingg.TrainingDataFinder - Preprocessing DS for stopWords
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.Heuristics - **Block size **10 and total count was 12035
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.Heuristics - Heuristics suggest 10
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 10
 2022-05-23 13:57:00,114 [main] WARN  org.apache.spark.sql.execution.CacheManager - Asked to cache already cached
 
 C ==================================================
Phase: findTrainingData
blockSize(Config): 5
 
  2022-05-23 13:59:39,839 [main] INFO  zingg.TrainingDataFinder - Preprocessing DS for stopWords
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.Heuristics - **Block size **8 and total count was 11993
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.Heuristics - Heuristics suggest 8
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8
 
D ==================================================
Phase: Train
blockSize(Config): 5

 2022-05-23 14:03:44,657 [main] INFO  zingg.util.Heuristics - **Block size **8 and total count was 30406
 2022-05-23 14:03:44,658 [main] INFO  zingg.util.Heuristics - Heuristics suggest 8
 2022-05-23 14:03:44,658 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8
 2022-05-23 14:03:59,095 [main] WARN  zingg.util.PipeUtil - Writing output Pipe [name=null, format=PARQUET, preprocessors=null, props={location=models/101/model/block/zingg.block}, schema=null]
 2022-05-23 14:03:59,095 [main] WARN  zingg.util.PipeUtil - Writing file

E ======================================================
Phase: Train
blockSize(Config): 50

ingType,true))]
 2022-05-23 14:08:30,628 [main] INFO  zingg.util.Heuristics - **Block size **30 and total count was 30186
 2022-05-23 14:08:30,629 [main] INFO  zingg.util.Heuristics - Heuristics suggest 30
 2022-05-23 14:08:30,629 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 30
 2022-05-23 14:08:43,103 [main] WARN  zingg.util.PipeUtil - Writing output Pipe [name=null, format=PARQUET, preprocessors=null, props={location=models/101/model/block/zingg.block}, schema=null]
 2022-05-23 14:08:43,104 [main] WARN  zingg.util.PipeUtil - Writing file
F ======================================================
Phase: Train
blockSize(Config): 25

 2022-05-23 14:11:17,582 [main] INFO  zingg.util.Heuristics - **Block size **25 and total count was 30061
 2022-05-23 14:11:17,582 [main] INFO  zingg.util.Heuristics - Heuristics suggest 25
 2022-05-23 14:11:17,583 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 25
====================================================== 

@navinrathore
Copy link
Contributor Author

Output:
examples/febrl120k/config.json

A. ==================================================
Phase: findTrainingData
blockSize(Config): 150

 2022-05-23 12:38:27,107 [main] INFO  zingg.util.Heuristics - **Block size **12 and total count was 12123
 2022-05-23 12:38:27,121 [main] INFO  zingg.util.Heuristics - Heuristics suggest 12

B ==================================================
Phase: findTrainingData
blockSize(Config): 10

 2022-05-23 13:56:30,606 [main] WARN  zingg.TrainingDataFinder - Read training samples 37 neg 43
 2022-05-23 13:56:31,093 [main] INFO  zingg.TrainingDataFinder - Preprocessing DS for stopWords
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.Heuristics - **Block size **10 and total count was 12035
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.Heuristics - Heuristics suggest 10
 2022-05-23 13:56:50,121 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 10
 2022-05-23 13:57:00,114 [main] WARN  org.apache.spark.sql.execution.CacheManager - Asked to cache already cached
 
 C ==================================================
Phase: findTrainingData
blockSize(Config): 5
 
  2022-05-23 13:59:39,839 [main] INFO  zingg.TrainingDataFinder - Preprocessing DS for stopWords
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.Heuristics - **Block size **8 and total count was 11993
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.Heuristics - Heuristics suggest 8
 2022-05-23 13:59:58,061 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8
 
D ==================================================
Phase: Train
blockSize(Config): 5

 2022-05-23 14:03:44,657 [main] INFO  zingg.util.Heuristics - **Block size **8 and total count was 30406
 2022-05-23 14:03:44,658 [main] INFO  zingg.util.Heuristics - Heuristics suggest 8
 2022-05-23 14:03:44,658 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8
 2022-05-23 14:03:59,095 [main] WARN  zingg.util.PipeUtil - Writing output Pipe [name=null, format=PARQUET, preprocessors=null, props={location=models/101/model/block/zingg.block}, schema=null]
 2022-05-23 14:03:59,095 [main] WARN  zingg.util.PipeUtil - Writing file

E ======================================================
Phase: Train
blockSize(Config): 50

ingType,true))]
 2022-05-23 14:08:30,628 [main] INFO  zingg.util.Heuristics - **Block size **30 and total count was 30186
 2022-05-23 14:08:30,629 [main] INFO  zingg.util.Heuristics - Heuristics suggest 30
 2022-05-23 14:08:30,629 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 30
 2022-05-23 14:08:43,103 [main] WARN  zingg.util.PipeUtil - Writing output Pipe [name=null, format=PARQUET, preprocessors=null, props={location=models/101/model/block/zingg.block}, schema=null]
 2022-05-23 14:08:43,104 [main] WARN  zingg.util.PipeUtil - Writing file
F ======================================================
Phase: Train
blockSize(Config): 25

 2022-05-23 14:11:17,582 [main] INFO  zingg.util.Heuristics - **Block size **25 and total count was 30061
 2022-05-23 14:11:17,582 [main] INFO  zingg.util.Heuristics - Heuristics suggest 25
 2022-05-23 14:11:17,583 [main] INFO  zingg.util.BlockingTreeUtil - Learning indexing rules for block size 25
====================================================== 

long maxSize = 8;
/*if (totalCount > 100 && totalCount < 500){
maxSize = totalCount / 5;
}
else {*/
maxSize = (long) (0.001 * totalCount);
LOG.debug("**Block size found **" + maxSize);
if (maxSize > 100) maxSize = 100;
LOG.debug("**Block size found **");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please print max size here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored

@sonalgoyal sonalgoyal merged commit b4d6fee into zinggAI:main May 23, 2022
@navinrathore navinrathore deleted the BlockingTreeSize259 branch June 1, 2022 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants