Added latest changes

canimus · Mar 10, 2024 · 1819b91 · 1819b91
1 parent c473780
commit 1819b91
Show file tree

Hide file tree

Showing 4 changed files with 66 additions and 31 deletions.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -57,6 +57,25 @@ @article{10.14778/3229863.3229867
 numpages = {14}
 }
 
+
+@inproceedings{10.1145/3529190.3529222,
+author = {Pleimling, Xavier and Shah, Vedant and Lourentzou, Ismini},
+title = {[Data] Quality Lies In The Eyes Of The Beholder},
+year = {2022},
+isbn = {9781450396318},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/3529190.3529222},
+doi = {10.1145/3529190.3529222},
+abstract = {As large-scale machine learning models become more prevalent in assistive and pervasive technologies, the research community has started examining limitations and challenges that arise from training data, e.g., fairness, bias, and interpretability issues. To this end, data-centric approaches are increasingly prevailing over time, showing that high-quality data is a critical component in many applications. Several studies explore methods to define and improve data quality, however, no uniform definition exists. In this work, we present an empirical analysis of the multifaceted problem of evaluating data quality. Our work aims at identifying data quality challenges that are most commonly observed by data users and practitioners. Inspired by the need for generally applicable methods, we select a representative set of quality indicators, that covers a broad spectrum of issues, and investigate the utility of these indicators on a broad range of datasets through inter-annotator agreement analysis. Our work provides insights and presents open challenges in designing improved data life cycles.},
+booktitle = {Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments},
+pages = {118–124},
+numpages = {7},
+keywords = {data annotation, data quality, data quality metrics, data utility, datasets, duplicate data, incomplete data, inconsistent data, incorrect data, user survey},
+location = {Corfu, Greece},
+series = {PETRA '22}
+}
+
 @article{Pearson:2017,
   	url = {http://adsabs.harvard.edu/abs/2017arXiv170304627P},
   	Archiveprefix = {arXiv},

diff --git a/paper/paper.jats b/paper/paper.jats
@@ -158,6 +158,15 @@ a Creative Commons Attribution 4.0 International License (CC BY
 </sec>
 <sec id="checks">
   <title>Checks</title>
+  <p>In <monospace>cuallee</monospace>, checks serve as the fundamental
+  concept. These checks are implemented by <bold>rules</bold>, which
+  specify <italic>quality predicates</italic>. These predicates, when
+  aggregated, form the criteria used to evaluate the quality of a
+  dataset. Efforts to establish a universal quality metric
+  (<xref alt="Pleimling et al., 2022" rid="ref-10.1145U002F3529190.3529222" ref-type="bibr">Pleimling
+  et al., 2022</xref>) typically involve using statistics and combining
+  dimensions to derive a single reference value that encapsulates
+  overall quality attributes.</p>
   <table-wrap>
     <table>
       <colgroup>
@@ -477,35 +486,28 @@ a Creative Commons Attribution 4.0 International License (CC BY
           <monospace>dataframe</monospace> input for validation</td>
           <td><italic>agnostic</italic></td>
         </tr>
-      </tbody>
-    </table>
-  </table-wrap>
-</sec>
-<sec id="controls">
-  <title>Controls</title>
-  <p>This are the controls</p>
-  <table-wrap>
-    <table>
-      <thead>
         <tr>
-          <th>Check</th>
-          <th>Description</th>
-          <th>DataType</th>
+          <td><monospace>iso.iso_4217</monospace></td>
+          <td>currency compliant <monospace>ccy</monospace></td>
+          <td><italic>string</italic></td>
         </tr>
-      </thead>
-      <tbody>
         <tr>
-          <td><monospace>completeness</monospace></td>
-          <td>Zero <monospace>nulls</monospace></td>
+          <td><monospace>iso.iso_3166</monospace></td>
+          <td>country compliant <monospace>country</monospace></td>
+          <td><italic>string</italic></td>
+        </tr>
+        <tr>
+          <td><monospace>Control.completeness</monospace></td>
+          <td>Zero <monospace>nulls</monospace> all columns</td>
           <td><italic>agnostic</italic></td>
         </tr>
         <tr>
-          <td><monospace>percentage_fill</monospace></td>
+          <td><monospace>Control.percentage_fill</monospace></td>
           <td><monospace>% rows</monospace> not empty</td>
           <td><italic>agnostic</italic></td>
         </tr>
         <tr>
-          <td><monospace>percentage_empty</monospace></td>
+          <td><monospace>Control.percentage_empty</monospace></td>
           <td><monospace>% rows</monospace> empty</td>
           <td><italic>agnostic</italic></td>
         </tr>
@@ -603,6 +605,25 @@ a Creative Commons Attribution 4.0 International License (CC BY
       <lpage>1794</lpage>
     </element-citation>
   </ref>
+  <ref id="ref-10.1145U002F3529190.3529222">
+    <element-citation publication-type="paper-conference">
+      <person-group person-group-type="author">
+        <name><surname>Pleimling</surname><given-names>Xavier</given-names></name>
+        <name><surname>Shah</surname><given-names>Vedant</given-names></name>
+        <name><surname>Lourentzou</surname><given-names>Ismini</given-names></name>
+      </person-group>
+      <article-title>[Data] quality lies in the eyes of the beholder</article-title>
+      <source>Proceedings of the 15th international conference on PErvasive technologies related to assistive environments</source>
+      <publisher-name>Association for Computing Machinery</publisher-name>
+      <publisher-loc>New York, NY, USA</publisher-loc>
+      <year iso-8601-date="2022">2022</year>
+      <isbn>9781450396318</isbn>
+      <uri>https://doi.org/10.1145/3529190.3529222</uri>
+      <pub-id pub-id-type="doi">10.1145/3529190.3529222</pub-id>
+      <fpage>118</fpage>
+      <lpage>124</lpage>
+    </element-citation>
+  </ref>
 </ref-list>
 </back>
 </article>
diff --git a/paper/paper.md b/paper/paper.md
@@ -55,8 +55,9 @@ One last argument in favor of using a quality tool is the need to integrate qual
 `cuallee`  employs a heuristic-based approach to define quality rules for each dataset. This prevents the inadvertent duplication of quality predicates, thus reducing the likelihood of human error in defining rules with identical predicates. Several studies have been conducted on the efficiency of these rules, including auto-validation [@10.1145/3580305.3599776] and auto-definition using profilers.
 
 
-# Checks
 
+# Checks
+In `cuallee`, checks serve as the fundamental concept. These checks are implemented by __rules__, which specify _quality predicates_. These predicates, when aggregated, form the criteria used to evaluate the quality of a dataset. Efforts to establish a universal quality metric [@10.1145/3529190.3529222] typically involve using statistics and combining dimensions to derive a single reference value that encapsulates overall quality attributes. 
 
 Check | Description | DataType
  ------- | ----------- | ----
@@ -114,16 +115,10 @@ Check | Description | DataType
 `has_workflow` | Adjacency matrix validation on `3-column` graph, based on `group`, `event`, `order` columns.  | _agnostic_
 `satisfies` | An open `SQL expression` builder to construct custom checks | _agnostic_
 `validate` | The ultimate transformation of a check with a `dataframe` input for validation | _agnostic_
-
-
-# Controls
-This are the controls 
-
-
-Check | Description | DataType
- ------- | ----------- | ----
-`completeness` | Zero `nulls` | _agnostic_
-`percentage_fill` | `% rows` not empty | _agnostic_
-`percentage_empty` | `% rows` empty | _agnostic_
+`iso.iso_4217` | currency compliant `ccy` | _string_
+`iso.iso_3166` | country compliant `country` | _string_
+`Control.completeness` | Zero `nulls` all columns| _agnostic_
+`Control.percentage_fill` | `% rows` not empty | _agnostic_
+`Control.percentage_empty` | `% rows` empty | _agnostic_
 
 # References
diff --git a/paper/paper.pdf b/paper/paper.pdf