Skip to content

Commit

Permalink
Added latest changes
Browse files Browse the repository at this point in the history
  • Loading branch information
canimus committed Mar 10, 2024
1 parent c473780 commit 1819b91
Show file tree
Hide file tree
Showing 4 changed files with 66 additions and 31 deletions.
19 changes: 19 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,25 @@ @article{10.14778/3229863.3229867
numpages = {14}
}


@inproceedings{10.1145/3529190.3529222,
author = {Pleimling, Xavier and Shah, Vedant and Lourentzou, Ismini},
title = {[Data] Quality Lies In The Eyes Of The Beholder},
year = {2022},
isbn = {9781450396318},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3529190.3529222},
doi = {10.1145/3529190.3529222},
abstract = {As large-scale machine learning models become more prevalent in assistive and pervasive technologies, the research community has started examining limitations and challenges that arise from training data, e.g., fairness, bias, and interpretability issues. To this end, data-centric approaches are increasingly prevailing over time, showing that high-quality data is a critical component in many applications. Several studies explore methods to define and improve data quality, however, no uniform definition exists. In this work, we present an empirical analysis of the multifaceted problem of evaluating data quality. Our work aims at identifying data quality challenges that are most commonly observed by data users and practitioners. Inspired by the need for generally applicable methods, we select a representative set of quality indicators, that covers a broad spectrum of issues, and investigate the utility of these indicators on a broad range of datasets through inter-annotator agreement analysis. Our work provides insights and presents open challenges in designing improved data life cycles.},
booktitle = {Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments},
pages = {118–124},
numpages = {7},
keywords = {data annotation, data quality, data quality metrics, data utility, datasets, duplicate data, incomplete data, inconsistent data, incorrect data, user survey},
location = {Corfu, Greece},
series = {PETRA '22}
}

@article{Pearson:2017,
url = {http://adsabs.harvard.edu/abs/2017arXiv170304627P},
Archiveprefix = {arXiv},
Expand Down
59 changes: 40 additions & 19 deletions paper/paper.jats
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,15 @@ a Creative Commons Attribution 4.0 International License (CC BY
</sec>
<sec id="checks">
<title>Checks</title>
<p>In <monospace>cuallee</monospace>, checks serve as the fundamental
concept. These checks are implemented by <bold>rules</bold>, which
specify <italic>quality predicates</italic>. These predicates, when
aggregated, form the criteria used to evaluate the quality of a
dataset. Efforts to establish a universal quality metric
(<xref alt="Pleimling et al., 2022" rid="ref-10.1145U002F3529190.3529222" ref-type="bibr">Pleimling
et al., 2022</xref>) typically involve using statistics and combining
dimensions to derive a single reference value that encapsulates
overall quality attributes.</p>
<table-wrap>
<table>
<colgroup>
Expand Down Expand Up @@ -477,35 +486,28 @@ a Creative Commons Attribution 4.0 International License (CC BY
<monospace>dataframe</monospace> input for validation</td>
<td><italic>agnostic</italic></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="controls">
<title>Controls</title>
<p>This are the controls</p>
<table-wrap>
<table>
<thead>
<tr>
<th>Check</th>
<th>Description</th>
<th>DataType</th>
<td><monospace>iso.iso_4217</monospace></td>
<td>currency compliant <monospace>ccy</monospace></td>
<td><italic>string</italic></td>
</tr>
</thead>
<tbody>
<tr>
<td><monospace>completeness</monospace></td>
<td>Zero <monospace>nulls</monospace></td>
<td><monospace>iso.iso_3166</monospace></td>
<td>country compliant <monospace>country</monospace></td>
<td><italic>string</italic></td>
</tr>
<tr>
<td><monospace>Control.completeness</monospace></td>
<td>Zero <monospace>nulls</monospace> all columns</td>
<td><italic>agnostic</italic></td>
</tr>
<tr>
<td><monospace>percentage_fill</monospace></td>
<td><monospace>Control.percentage_fill</monospace></td>
<td><monospace>% rows</monospace> not empty</td>
<td><italic>agnostic</italic></td>
</tr>
<tr>
<td><monospace>percentage_empty</monospace></td>
<td><monospace>Control.percentage_empty</monospace></td>
<td><monospace>% rows</monospace> empty</td>
<td><italic>agnostic</italic></td>
</tr>
Expand Down Expand Up @@ -603,6 +605,25 @@ a Creative Commons Attribution 4.0 International License (CC BY
<lpage>1794</lpage>
</element-citation>
</ref>
<ref id="ref-10.1145U002F3529190.3529222">
<element-citation publication-type="paper-conference">
<person-group person-group-type="author">
<name><surname>Pleimling</surname><given-names>Xavier</given-names></name>
<name><surname>Shah</surname><given-names>Vedant</given-names></name>
<name><surname>Lourentzou</surname><given-names>Ismini</given-names></name>
</person-group>
<article-title>[Data] quality lies in the eyes of the beholder</article-title>
<source>Proceedings of the 15th international conference on PErvasive technologies related to assistive environments</source>
<publisher-name>Association for Computing Machinery</publisher-name>
<publisher-loc>New York, NY, USA</publisher-loc>
<year iso-8601-date="2022">2022</year>
<isbn>9781450396318</isbn>
<uri>https://doi.org/10.1145/3529190.3529222</uri>
<pub-id pub-id-type="doi">10.1145/3529190.3529222</pub-id>
<fpage>118</fpage>
<lpage>124</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</article>
19 changes: 7 additions & 12 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ One last argument in favor of using a quality tool is the need to integrate qual
`cuallee` employs a heuristic-based approach to define quality rules for each dataset. This prevents the inadvertent duplication of quality predicates, thus reducing the likelihood of human error in defining rules with identical predicates. Several studies have been conducted on the efficiency of these rules, including auto-validation [@10.1145/3580305.3599776] and auto-definition using profilers.


# Checks

# Checks
In `cuallee`, checks serve as the fundamental concept. These checks are implemented by __rules__, which specify _quality predicates_. These predicates, when aggregated, form the criteria used to evaluate the quality of a dataset. Efforts to establish a universal quality metric [@10.1145/3529190.3529222] typically involve using statistics and combining dimensions to derive a single reference value that encapsulates overall quality attributes.

Check | Description | DataType
------- | ----------- | ----
Expand Down Expand Up @@ -114,16 +115,10 @@ Check | Description | DataType
`has_workflow` | Adjacency matrix validation on `3-column` graph, based on `group`, `event`, `order` columns. | _agnostic_
`satisfies` | An open `SQL expression` builder to construct custom checks | _agnostic_
`validate` | The ultimate transformation of a check with a `dataframe` input for validation | _agnostic_


# Controls
This are the controls


Check | Description | DataType
------- | ----------- | ----
`completeness` | Zero `nulls` | _agnostic_
`percentage_fill` | `% rows` not empty | _agnostic_
`percentage_empty` | `% rows` empty | _agnostic_
`iso.iso_4217` | currency compliant `ccy` | _string_
`iso.iso_3166` | country compliant `country` | _string_
`Control.completeness` | Zero `nulls` all columns| _agnostic_
`Control.percentage_fill` | `% rows` not empty | _agnostic_
`Control.percentage_empty` | `% rows` empty | _agnostic_

# References
Binary file modified paper/paper.pdf
Binary file not shown.

0 comments on commit 1819b91

Please sign in to comment.