Initial commit, version 0.1

kiwicom · Jul 29, 2019 · 6409395 · 6409395
commit 6409395
Show file tree

Hide file tree

Showing 14 changed files with 1,201 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,203 @@
+#### joe made this: http://goel.io/joe
+
+#####=== Linux ===#####
+*~
+
+# KDE directory preferences
+.directory
+
+# Linux trash folder which might appear on any partition or disk
+.Trash-*
+
+#####=== OSX ===#####
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+#####=== Vim ===#####
+[._]*.s[a-w][a-z]
+[._]s[a-w][a-z]
+*.un~
+Session.vim
+.netrwhist
+*~
+
+#####=== Python ===#####
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+venv/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.cache
+nosetests.xml
+coverage.xml
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+#####=== JetBrains ===#####
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm
+
+*.iml
+
+## Directory-based project format:
+.idea/
+# if you remove the above rule, at least ignore the following:
+
+# User-specific stuff:
+# .idea/workspace.xml
+# .idea/tasks.xml
+# .idea/dictionaries
+
+# Sensitive or high-churn files:
+# .idea/dataSources.ids
+# .idea/dataSources.xml
+# .idea/sqlDataSources.xml
+# .idea/dynamic.xml
+# .idea/uiDesigner.xml
+
+# Gradle:
+# .idea/gradle.xml
+# .idea/libraries
+
+# Mongo Explorer plugin:
+# .idea/mongoSettings.xml
+
+## File-based project format:
+*.ipr
+*.iws
+
+## Plugin-specific files:
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+
+#####=== Vim ===#####
+[._]*.s[a-w][a-z]
+[._]s[a-w][a-z]
+*.un~
+Session.vim
+.netrwhist
+*~
+
+#####=== VirtualEnv ===#####
+# Virtualenv
+# http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
+.Python
+[Bb]in
+[Ii]nclude
+[Ll]ib
+[Ss]cripts
+pyvenv.cfg
+pip-selfcheck.json
+
+#####=== OSX ===#####
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+
+# Thumbnails
+._*
+
+# Files that might appear on external disk
+.Spotlight-V100
+.Trashes
+
+####=== Airflow ===####
+logs
+.env
+.env2
+.env-*
+airflow-webserver.pid
+airflow.db
+
+####=== coala ===####
+.orig
+
+####=== pytest ===####
+.pytest_cache
+
+####=== Docker ===####
+docker-compose.override.yml
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -0,0 +1,19 @@
+stages:
+  - build
+  - test
+
+black:
+  stage: build
+  image: python:3.6
+  before_script:
+    - pip install black
+  script:
+    - black --check . --diff
+
+pytest:
+  stage: test
+  image: python:3.6
+  before_script:
+    - pip install -r requirements-test.txt
+  script:
+    - pytest
diff --git a/README.md b/README.md
@@ -0,0 +1,92 @@
+# pg2avro
+
+Postgres to Avro generator.
+
+## Features
+
+- Generate Avro schema from column definition.
+- Generate  data format consumable for Avro serialization.
+
+# Usage
+
+## Generating schema
+
+Method: `pg2avro.get_avro_schema`
+
+```
+get_avro_schema(
+    "mytable", 
+    "public", 
+    [
+        # Dictionary mode
+        {
+            "name": "column_name_1",
+            "type": "int2",
+            "nullable": False,
+        },
+        # SqlAlchemy mode
+        SqlAlchemyColumn(ARRAY(TEXT), name="column_name_2"),
+        ...
+    ]
+)
+
+```
+
+Schema generator needs the following information:
+- table name
+- namespace (`schema` in SQL, `dataset` in Big Query etc.)
+- columns - iterable of columns, each element with:
+    - name
+    - type - `_` prefix is used to indicate array types
+    - nullable (optional, `True` assumed if not provided)
+- column mapping - optional `ColumnMapping` object with column mappings (see below for more info).
+
+Column data can be passed in multiple formats.
+
+### Supported column formats
+
+- Dictionary with required keys and data
+- SqlAlchemy Column object
+- Any object with compatible attributes and required data
+- Dictionary or object with required data, but without compatible attributes/keys, supplied with ColumnMapping.
+
+Note: this mode supports **generating schema from raw postgres data** - `udt_name` can be used to generate the schema.
+```
+columns = [
+    CustomColumn(name="column_name", udt_name="int2", is_nullable=False),
+]
+
+get_avro_schema(
+    table_name,
+    namespace,
+    columns,
+    ColumnMapping(name="name", type="udt_name", nullable="is_nullable"),
+)
+```
+
+## Generating rows data
+
+Method: `pg2avro.get_avro_row_dict`
+
+This method requires rows data and schema to generate the rows with.
+
+### Supported row formats
+
+- Dictionary with keys corresponding to schema field names
+- Object with keys corresponding to schema field names (works the same as dictionary with corresponding fields)
+- Tuple with data in the same order as fields specified in schema
+
+```
+columns = [
+    {"name": "name", "type": "varchar", "nullable": False},
+    {"name": "number", "type": "float4", "float4", "nullable": False},
+]
+schema = get_avro_schema(table_name, namespace, columns)
+rows = [
+    {"name": "John", "number": 1.0},
+    RowObject(name="Jack", number=2.0),
+    ("Jim", 3.0),
+]
+data = [get_avro_row_dict(row, schema) for row in rows]
+
+```