Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 120 compatibility with rdt issue72 #121

Merged
merged 47 commits into from
Nov 6, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
1ffdfd5
Removed old PositiveNumberTransformes usage
JDTheRipperPC Oct 4, 2019
36d109b
change rdt hypertransformer usage
JDTheRipperPC Oct 4, 2019
8293bc9
Adapt to the new RDT API
csala Oct 7, 2019
6ad8e1d
Add example notebooks
csala Oct 7, 2019
35535d1
WIP
JDTheRipperPC Oct 7, 2019
c9555f0
Rename DataNavigator to Metadata and isolate metadata inside it
csala Oct 8, 2019
756ce7a
Reduce data loading and remove tables from modeler
csala Oct 8, 2019
d64e1b8
new tests for test_sdv and changes on test_modeler
JDTheRipperPC Oct 11, 2019
b768490
added more tests
JDTheRipperPC Oct 14, 2019
e166c4e
added unittests and fixed linting
JDTheRipperPC Oct 16, 2019
d134271
added save and load SDV intance tests
JDTheRipperPC Oct 17, 2019
802512e
fix docstring and tests
JDTheRipperPC Oct 17, 2019
0037b5d
added metadata test
JDTheRipperPC Oct 17, 2019
7e4151a
added save and load tests on test_modeler
JDTheRipperPC Oct 17, 2019
e7f219a
WIP, added tests, one test is marked as skip
JDTheRipperPC Oct 17, 2019
f7df984
Fix tests.
JDTheRipperPC Oct 21, 2019
8148718
fix test_sampler
JDTheRipperPC Oct 21, 2019
27ef996
Remove distribution args and fix primary key generation
csala Oct 22, 2019
de965a3
Add airbnb example
csala Oct 22, 2019
96493e2
Added mountain and olympic game datasets
JDTheRipperPC Oct 25, 2019
cca88e0
CPA usage refactoring. Skip id columns on model
csala Oct 25, 2019
3c8f706
Remove data from tests folder
csala Oct 25, 2019
91276bf
Reorganize examples
csala Oct 25, 2019
f9f7c71
Remove unused argument
csala Oct 25, 2019
030aefe
Add Quickstart
csala Oct 28, 2019
3e94b96
fixed sdv and metadata test failure, metadata tests wip
JDTheRipperPC Oct 28, 2019
5639a66
Merge branch 'issue_120_compatibility_with_rdt_issue72_fixes' of http…
JDTheRipperPC Oct 28, 2019
014cb96
fixed modeler tests
JDTheRipperPC Oct 29, 2019
75961b0
fixed sampler tests
JDTheRipperPC Oct 29, 2019
3aa3bc1
fix linting and py35 tests
JDTheRipperPC Oct 29, 2019
a8e1bd3
fix isort
JDTheRipperPC Oct 29, 2019
446b4bb
added docstrings
JDTheRipperPC Oct 29, 2019
a15ac9f
docstrings
JDTheRipperPC Oct 30, 2019
a89153c
updated README.md
JDTheRipperPC Oct 30, 2019
107ec44
added tests and sdv demo
JDTheRipperPC Oct 30, 2019
79ccd26
Merge branch 'issue_120_compatibility_with_rdt_issue72_fixes' into is…
JDTheRipperPC Oct 30, 2019
7d75faa
fit tests py35
JDTheRipperPC Oct 30, 2019
d99ebe1
Update docstrings.
pvk-developer Oct 31, 2019
af49713
Isort.
pvk-developer Oct 31, 2019
13cc674
Fix docstrings, tests and examples. Minor code improvements
csala Nov 4, 2019
6864e46
Fix py35 test
csala Nov 4, 2019
d6aba39
Update readme.
pvk-developer Nov 4, 2019
c552367
Update README.md
pvk-developer Nov 4, 2019
f001a34
Improve README
csala Nov 6, 2019
110d9e7
Fix json formatting
csala Nov 6, 2019
75ba5a7
Add clarifications about the data format
csala Nov 6, 2019
d1376b4
Cleanup examples
csala Nov 6, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
510 changes: 162 additions & 348 deletions README.md

Large diffs are not rendered by default.

393 changes: 393 additions & 0 deletions examples/1. Quickstart - Single Table - In Memory.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,393 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"data = pd.DataFrame({\n",
" 'index': [1, 2, 3, 4, 5, 6, 7, 8],\n",
" 'integer': [1, None, 1, 2, 1, 2, 3, 2],\n",
" 'float': [0.1, None, 0.1, 0.2, 0.1, 0.2, 0.3, 0.1],\n",
" 'categorical': ['a', 'b', 'a', 'b', 'a', None, 'c', None],\n",
" 'bool': [False, True, False, True, False, False, False, None],\n",
" 'nullable': [1, None, 3, None, 5, None, 7, None],\n",
" 'datetime': [\n",
" '2010-01-01', '2010-02-01', '2010-01-01', '2010-02-01',\n",
" '2010-01-01', '2010-02-01', '2010-03-01', None\n",
" ]\n",
"})\n",
"data['datetime'] = pd.to_datetime(data['datetime'])\n",
"\n",
"tables = {\n",
" 'data': data\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>integer</th>\n",
" <th>float</th>\n",
" <th>categorical</th>\n",
" <th>bool</th>\n",
" <th>nullable</th>\n",
" <th>datetime</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>0.1</td>\n",
" <td>a</td>\n",
" <td>False</td>\n",
" <td>1.0</td>\n",
" <td>2010-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>b</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>2010-02-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>0.1</td>\n",
" <td>a</td>\n",
" <td>False</td>\n",
" <td>3.0</td>\n",
" <td>2010-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>2.0</td>\n",
" <td>0.2</td>\n",
" <td>b</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" <td>2010-02-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>0.1</td>\n",
" <td>a</td>\n",
" <td>False</td>\n",
" <td>5.0</td>\n",
" <td>2010-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>2.0</td>\n",
" <td>0.2</td>\n",
" <td>None</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>2010-02-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>3.0</td>\n",
" <td>0.3</td>\n",
" <td>c</td>\n",
" <td>False</td>\n",
" <td>7.0</td>\n",
" <td>2010-03-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>2.0</td>\n",
" <td>0.1</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>NaN</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index integer float categorical bool nullable datetime\n",
"0 1 1.0 0.1 a False 1.0 2010-01-01\n",
"1 2 NaN NaN b True NaN 2010-02-01\n",
"2 3 1.0 0.1 a False 3.0 2010-01-01\n",
"3 4 2.0 0.2 b True NaN 2010-02-01\n",
"4 5 1.0 0.1 a False 5.0 2010-01-01\n",
"5 6 2.0 0.2 None False NaN 2010-02-01\n",
"6 7 3.0 0.3 c False 7.0 2010-03-01\n",
"7 8 2.0 0.1 None None NaN NaT"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"metadata = {\n",
" \"tables\": [\n",
" {\n",
" \"fields\": [\n",
" {\n",
" \"name\": \"index\",\n",
" \"type\": \"id\"\n",
" },\n",
" {\n",
" \"name\": \"integer\",\n",
" \"type\": \"numerical\",\n",
" \"subtype\": \"integer\",\n",
" },\n",
" {\n",
" \"name\": \"float\",\n",
" \"type\": \"numerical\",\n",
" \"subtype\": \"float\",\n",
" },\n",
" {\n",
" \"name\": \"categorical\",\n",
" \"type\": \"categorical\",\n",
" \"pii\": False,\n",
" \"pii_category\": \"email\"\n",
" },\n",
" {\n",
" \"name\": \"bool\",\n",
" \"type\": \"boolean\",\n",
" },\n",
" {\n",
" \"name\": \"nullable\",\n",
" \"type\": \"numerical\",\n",
" \"subtype\": \"float\",\n",
" },\n",
" {\n",
" \"name\": \"datetime\",\n",
" \"type\": \"datetime\",\n",
" \"format\": \"%Y-%m-%d\"\n",
" },\n",
" ],\n",
" \"name\": \"data\",\n",
" \"primary_key\": \"index\"\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2019-11-03 16:07:31,541 - INFO - modeler - Modeling data\n",
"2019-11-03 16:07:31,542 - INFO - metadata - Loading transformer NumericalTransformer for field integer\n",
"2019-11-03 16:07:31,543 - INFO - metadata - Loading transformer NumericalTransformer for field float\n",
"2019-11-03 16:07:31,543 - INFO - metadata - Loading transformer CategoricalTransformer for field categorical\n",
"2019-11-03 16:07:31,543 - INFO - metadata - Loading transformer BooleanTransformer for field bool\n",
"2019-11-03 16:07:31,544 - INFO - metadata - Loading transformer NumericalTransformer for field nullable\n",
"2019-11-03 16:07:31,544 - INFO - metadata - Loading transformer DatetimeTransformer for field datetime\n",
"2019-11-03 16:07:31,594 - INFO - modeler - Modeling Complete\n"
]
}
],
"source": [
"from sdv import SDV\n",
"\n",
"sdv = SDV()\n",
"sdv.fit(metadata, tables={'data': data})"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>integer</th>\n",
" <th>float</th>\n",
" <th>categorical</th>\n",
" <th>bool</th>\n",
" <th>nullable</th>\n",
" <th>datetime</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0.155202</td>\n",
" <td>a</td>\n",
" <td>False</td>\n",
" <td>5.632725</td>\n",
" <td>2010-01-14 15:20:28.968422912</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0.148088</td>\n",
" <td>b</td>\n",
" <td>True</td>\n",
" <td>4.338519</td>\n",
" <td>2010-01-23 02:27:17.721717760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>0.201357</td>\n",
" <td>a</td>\n",
" <td>True</td>\n",
" <td>3.055583</td>\n",
" <td>2010-01-27 13:49:01.067935232</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>0.192696</td>\n",
" <td>b</td>\n",
" <td>False</td>\n",
" <td>3.399388</td>\n",
" <td>2010-01-26 18:17:43.376063232</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>0.106991</td>\n",
" <td>a</td>\n",
" <td>False</td>\n",
" <td>3.495486</td>\n",
" <td>2010-01-09 14:46:37.969550592</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index integer float categorical bool nullable \\\n",
"0 0 1 0.155202 a False 5.632725 \n",
"1 1 2 0.148088 b True 4.338519 \n",
"2 2 2 0.201357 a True 3.055583 \n",
"3 3 2 0.192696 b False 3.399388 \n",
"4 4 1 0.106991 a False 3.495486 \n",
"\n",
" datetime \n",
"0 2010-01-14 15:20:28.968422912 \n",
"1 2010-01-23 02:27:17.721717760 \n",
"2 2010-01-27 13:49:01.067935232 \n",
"3 2010-01-26 18:17:43.376063232 \n",
"4 2010-01-09 14:46:37.969550592 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"samples = sdv.sample_all()\n",
"samples['data']"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading