Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1868 statistics with missing counts and datasets missing proprties #1873

Merged

Conversation

AdrianOlosutean
Copy link
Contributor

@AdrianOlosutean AdrianOlosutean commented Aug 5, 2021

Closes #1868

To be discussed:

  • if the current implementation of getLatestDatasets could replace summaries
  • the endpoint for getting datasets with missing specified property
  • if the statistics method should use queries instead
  • optimization methods for the statistics endpoint

To do:

  • return just name and version for missing datasets

Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • code reviewed
  • pulled
  • built
  • run just the integTest StatisticsIntegrationSuite

There are a few comments below that may deserve some addressing, but I like it overall 👍

@@ -217,6 +215,15 @@ class DatasetService @Autowired()(datasetMongoRepository: DatasetMongoRepository
}
}

def getLatestVersionsWithMissingProperty(missingProperty: Option[String]): Future[Seq[Dataset]] = {
val missingFilter = missingProperty match {
case Some(missingProp) => Filters.not(Filters.exists(s"properties.${missingProp}"))
Copy link
Contributor

@dk1844 dk1844 Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be useful to be able to fetch this list with multiple missing properties (or all) at-once?

I think we could create such a filter using Filters.and(...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about this too, but it should be the number of datasets missing for each property. If you have Filters.and for all of them it would give the number of datasets missing all of these properties(or datasets which don't have any of these properties)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think such a method would be useful, and it should be doable with a combination of and and or. This one on the other hand is useful elsewhere too, isn't it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braces are redundand.

props: Seq[PropertyDefinition] <- propertyDefService.getLatestVersions()
propertiesWithMissingCounts: Seq[Future[PropertyDefinitionDto]] =
props.map((propertyDef: PropertyDefinition) =>
datasetService.getLatestVersionsWithMissingProperty(Some(propertyDef.name))
Copy link
Contributor

@dk1844 dk1844 Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useful to be able to fetch this list with multiple missing properties (or all) at-once?

because here, the service (thus the mongo query, too) must be called as many times as there are properties. If the method to look up missing properties in the latest version was able to take multiple property names, I think it would be more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that something on the db side should be possible to implement, I need to think about it

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of add to the review, I posted the previous comment. My bad.

@@ -44,6 +46,12 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[Dataset]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a full Dataset. We do not have any pagination and on prod environments, this will be too huge. Could it be just Dataset name + version? Nothing else really matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but probably even with only dataset name and version we will need pagination or something

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run API, seems to work, checked the numbers as well but please remember I passed my math course on the second try 👍

Not an approve as I would like to discuss the comments

Zejnilovic
Zejnilovic previously approved these changes Aug 24, 2021
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I am happy. I went over the APIs

  • dataset/latest
  • dataset/latest?missing_property=true
  • /statistics/properties/missing/

But I am just curious about MenasReference. Maybe the collection should be "Dataset"?

@@ -44,6 +46,13 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time I am ever hearing about MenasReference 👀
Do we know what is that collection field? As it is always empty.

Copy link
Contributor Author

@AdrianOlosutean AdrianOlosutean Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to reuse an existing class, so MenasReference made sense cause I saw it used in UsedIn although sometimes it's used for other purposes too. Probably it would make sense to set collection as "Dataset"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestions:

Suggested change
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
def getLatestVersions(@RequestParam(value = "missing_property", required = false) propertyName: Optional[String]): CompletableFuture[Seq[MenasReference]] = {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it required to use Java's Optional? Is it not possible to directly use Option[String]?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line too long

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to reuse an existing class, so MenasReference made sense cause I saw it used in UsedIn although sometimes it's used for other purposes too. Probably it would make sense to set collection as "Dataset"

Seems the class is used with Mongo predominantly. Its naming is not very enlightening though. I would not use it here, I don't see it used in any other controller. I would use VersionedSummary as it's used in VersionedModelController.

Copy link
Contributor Author

@AdrianOlosutean AdrianOlosutean Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it required to use Java's Optional? Is it not possible to directly use Option[String]?

Yes, there are other examples in the code. I initially tried with Scala Option

I don't have a clear opinion about VersionedSummary vs MenasReference. I just found MenasReference to be closer to other use cases and it has the field name

@AdrianOlosutean AdrianOlosutean force-pushed the feature/1868-missing-properties-missing-datasets-api branch from 3547f02 to a97d33a Compare August 25, 2021 14:48
@benedeki benedeki added the PR:reviewing Only for PR - PR is being reviewed by somebody; blocks merging label Aug 25, 2021
@@ -42,6 +41,10 @@ class PropertyDefinitionService @Autowired()(propertyDefMongoRepository: Propert
}
}

def getDistinctCount(): Future[Int] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why naming it as a getter method. Wouldn't distinctCount be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LandingPageController.landingPageInfo, methods in services have get, for ex runsService.getCount(), runsService.getTodaysRunsStatistics()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We have a mess in these naming.

@@ -217,6 +215,15 @@ class DatasetService @Autowired()(datasetMongoRepository: DatasetMongoRepository
}
}

def getLatestVersionsWithMissingProperty(missingProperty: Option[String]): Future[Seq[Dataset]] = {
val missingFilter = missingProperty match {
case Some(missingProp) => Filters.not(Filters.exists(s"properties.${missingProp}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think such a method would be useful, and it should be doable with a combination of and and or. This one on the other hand is useful elsewhere too, isn't it?

@@ -217,6 +215,15 @@ class DatasetService @Autowired()(datasetMongoRepository: DatasetMongoRepository
}
}

def getLatestVersionsWithMissingProperty(missingProperty: Option[String]): Future[Seq[Dataset]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion:

Suggested change
def getLatestVersionsWithMissingProperty(missingProperty: Option[String]): Future[Seq[Dataset]] = {
def getLatestVersions(missingProperty: Option[String]): Future[Seq[Dataset]] = {

This name makes it more general with the function usage, Eventually more filters can be added too.

@@ -217,6 +215,15 @@ class DatasetService @Autowired()(datasetMongoRepository: DatasetMongoRepository
}
}

def getLatestVersionsWithMissingProperty(missingProperty: Option[String]): Future[Seq[Dataset]] = {
val missingFilter = missingProperty match {
case Some(missingProp) => Filters.not(Filters.exists(s"properties.${missingProp}"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braces are redundand.

@@ -75,6 +75,26 @@ abstract class VersionedMongoRepository[C <: VersionedModel](mongoDb: MongoDatab
collection.aggregate[VersionedSummary](pipeline).toFuture()
}

def getLatestVersionsWithMissingProp(missingProperty: Option[String]): Future[Seq[C]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this function. First of all it's the same as DatasetService.getLatestVersionsWithMissingProperty. Secondly it's not used but in a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a mistake, thanks for pointing out

@@ -44,6 +46,13 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestions:

Suggested change
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
def getLatestVersions(@RequestParam(value = "missing_property", required = false) propertyName: Optional[String]): CompletableFuture[Seq[MenasReference]] = {

@@ -44,6 +46,13 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it required to use Java's Optional? Is it not possible to directly use Option[String]?

@@ -44,6 +46,13 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line too long

@@ -44,6 +46,13 @@ class DatasetController @Autowired()(datasetService: DatasetService)

import scala.concurrent.ExecutionContext.Implicits.global

@GetMapping(Array("/latest"))
@ResponseStatus(HttpStatus.OK)
def getAll(@RequestParam(value = "missing_property", required = false) missingProperty: Optional[String]): CompletableFuture[Seq[MenasReference]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to reuse an existing class, so MenasReference made sense cause I saw it used in UsedIn although sometimes it's used for other purposes too. Probably it would make sense to set collection as "Dataset"

Seems the class is used with Mongo predominantly. Its naming is not very enlightening though. I would not use it here, I don't see it used in any other controller. I would use VersionedSummary as it's used in VersionedModelController.

@benedeki benedeki removed the PR:reviewing Only for PR - PR is being reviewed by somebody; blocks merging label Aug 26, 2021
@AdrianOlosutean AdrianOlosutean force-pushed the feature/1868-missing-properties-missing-datasets-api branch from 2ee2963 to 56ff5af Compare August 26, 2021 17:26
@AdrianOlosutean AdrianOlosutean force-pushed the feature/1868-missing-properties-missing-datasets-api branch from 56ff5af to 27bf2c3 Compare August 26, 2021 17:35
@sonarcloud
Copy link

sonarcloud bot commented Aug 30, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Collaborator

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically with the new fields in landing-page_statistics_v1 collection this is not backward compatible. But it's self-correcting so IMHO good enough. (And I think originally it was done the same way)

import za.co.absa.enceladus.menas.models.LandingPageInformation
import za.co.absa.enceladus.menas.repositories.DatasetMongoRepository
import za.co.absa.enceladus.menas.repositories.LandingPageStatisticsMongoRepository
import za.co.absa.enceladus.menas.repositories.MappingTableMongoRepository
import za.co.absa.enceladus.menas.repositories.SchemaMongoRepository
import za.co.absa.enceladus.menas.services.RunService
import za.co.absa.enceladus.menas.services.{PropertyDefinitionService, RunService, StatisticsService}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: PreprtyDefinitionService not used

@@ -42,6 +41,10 @@ class PropertyDefinitionService @Autowired()(propertyDefMongoRepository: Propert
}
}

def getDistinctCount(): Future[Int] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We have a mess in these naming.

@benedeki benedeki added the PR:no testing needed Only for PR - PR doesn't need to be tested by a tester (person) label Aug 31, 2021
@benedeki
Copy link
Collaborator

API will be tested as part of #1880

@AdrianOlosutean AdrianOlosutean merged commit c96f779 into develop Aug 31, 2021
@AdrianOlosutean AdrianOlosutean deleted the feature/1868-missing-properties-missing-datasets-api branch August 31, 2021 09:52
dk1844 added a commit that referenced this pull request Nov 4, 2021
* Update for next development version 2.24.0-SNAPSHOT

* Suppress download noise in license check

* Suppress compiler warning of obsolete Java (#1892)

* 1868 statistics with missing counts and datasets missing proprties (#1873)

* 1868 statistics with missing counts and datasets missing proprties

* 1843 Summary page for properties (#1880)

* 1843 Home page with properties,  side panel with missing counts and summary page for properties with tab containing datasets missing that particular property

* Feature/1603 mapping table filtering general (#1879)

* #1603 serde tests for CR and MT DataFrameFilters
(mongo-bson-based serde tests for CR and MT DataFrameFilters, mongo-bson-based serde tests extended for CR with a blank mappingTableFilter)

* #1909 Increase the limit of columns shown in menas column selection

* 1903 Add validation for complex default values in mapping tables on import

* Project config and management updates (#1908)

Project config and management updates
* poc issue template
* CODEOWNERS update
* developers update
* Badges to README.md

* 1881 HyperConformance enceladus_info_version from payload  (#1896)

1881 HyperConformance enceladus_info_version from payload

* #1887 defaultTimestampTimeZone can be source type specific (#1899)

#1887 defaultTimestampTimeZone can be source type specific
* `DefaultsByFormat` extends the `Defaults` trait, being able to read defaults from configuration files
* `DefaultsByFormat` offers further granularity by first checking the format specific setting only then taking the global one
* Basic `GlobalDefaults` are not configuration dependent anymore
* Standardization now user `DefaultsByFormat` for its defaults, where rawFormat is used for format parameter
* Switched to configuration path to be `enceladus.defaultTimestampTimeZone.default` and `enceladus.defaultTimestampTimeZone.[rawFormat]` respectively
* `defaultTimestampTimeZone` is still supported/read as an obsolete fallback
Co-authored-by: Daniel K <dk1844@gmail.com>

* #1887 defaultTimestampTimeZone can be source type specific (#1916)

#1887 defaultTimestampTimeZone can be source type specific
* rename of the configuration prefix from `enceladus.` to `standardization.`

* #172 Save original timezone information in metadata file (#1900)

* Upgrade of Atum to 3.6.0
* Writing the default time zones for timestamps and dates into _INFO file

* #1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support (#1912)

* #1894 Spline S3 support via custom persistence factory `HadoopFsPersistenceFactory`.
Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com>

* Update versions for release v2.24.0

* Update for next development version 2.25.0-SNAPSHOT

* #1926 Add executor extra java opts to helper scripts

* #1931 Add switch for running kinit in helper scripts

* #1882 Update Cobrix dependency to v.2.3.0

* #1882 Remove explicit "collapse_root" since it is the default since Cobrix 2.3.0

* #1882 Update Cobrix to 2.4.1 and update Cobol test suite for ASCII files.

* #1882 Bump up Cobrix version to 2.4.2.

* #1927 Spline _LINEAGE and Atum _INFO files permission alignment (#1934)

* #1927 - testing setup: set both spline _LINEAGE and atum _INFO to hdfs file permissions 733 -> the result on EMR HDFS was 711 (due to 022 umask there) -> evidence of working

* #1927 - cleanup of test settings of 733 fs permissions

* #1927 Atum final version 3.7.0 used instead of the snapshot (same code)

* #1927 comment change

* #1927 - default 644 FS permissions for both _INFO and _LINEAGE files.

* 1937 limit output file size (#1941)

* 1937 limit output file size

* 1937 limit output file size

* 1937 renamings + constants

* 1937 more conditions

* 1937 rename params

* 1937 feedback + script params

* 1937 more feedback

* 1937 final feedback

* #1951: Windows Helper scripts - add missing features
* `ADDITIONAL_JVM_EXECUTOR_CONF`
* Kerberos configuration
* Trust store configuration
* kinit execution option
* `--min-processing-block-size` & `--max-processing-block-size`
* logo improvement

* * --min-processing-block-size -> --min-processing-partition-size
* --max-processing-block-size -> --max-processing-partition-size

* #1869: SparkJobs working with LoadBalanced Menas (#1935)

* `menas.rest.retryCount` - configuration, how many times an url should be retried if failing with retry-able error implemented
* `menas.rest.availability.setup` - configuration, how the url list should be handled
* _Standardization_, _Conformance_ and _HyperConformance_ changed to provide retry count and availability setup to Dao, read from configuration
* `ConfigReader` enhanced and unified to read configurations more easily and universally
* Mockito upgraded to 1.16.42

Co-authored-by: Daniel K <dk1844@gmail.com>

* Feature/1863 mapping table filtering (#1929)

* #1863 mapping cr & mt fitler successfully reuses the same fragment (both using the same named model)
 - todo reuse validation, reuse manipulation methods

* #1863 FilterEdit.js allows reusing filterEdit TreeTable logic between mCR and MT editings

* #1863 mCT editing validation enabled (commons from FilterEdit.js)

* #1863 mCT datatype hinting hinting enabled (commons from DataTypeUtils.js)

* #1863 mCR/MT edit dialog default width=950px, some cleanup
* #1863 bugfixes: directly creating MT with filter (fix on accepting the field), UI fix for MT filter model initialization

* #1863 npm audit fix

* #1863 bugfix: adding new mCR (when no edit MCR dialog has been opened yet) did not work - fixed

* #1863 selecting mapping column from MT schema works (for all schema levels) for edit. TODO = Schema type support

 #1863 mCR - schema-based columns suggested for filter, value types filled in silently during submit, too.

* #1863 bugfix: empty MT - schema may be empty

* #1863 bugfix: removing a filter left a null node - cleanup was needed (otherwise view would fail)
logging cleanup

* #1863 select list item now shows valueType as additionalText, cleanup

* #1863 nonEmptyAndNonNullFilled - map->filter bug fixed.

* #1863 typo for null filter

Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com>

* Update versions for release v2.25.0

* [merge] build fix

* [merge] npm audit fix

* [merge] npm audit fix

* [merge] buildfix (menas->rest_api packaging fix)

* [merge] review updates

Co-authored-by: David Benedeki <benedeki@volny.cz>
Co-authored-by: Saša Zejnilović <zejnils@gmail.com>
Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com>
Co-authored-by: Adrian Olosutean <adi.olosutean@gmail.com>
Co-authored-by: Ruslan Iushchenko <yruslan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR:no testing needed Only for PR - PR doesn't need to be tested by a tester (person)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Property Endpoint with Missing counts and Datasets by missing property API
4 participants