Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Hooks: Glue Catalog exporter #6653

Merged
merged 16 commits into from
Oct 3, 2023
Merged

Export Hooks: Glue Catalog exporter #6653

merged 16 commits into from
Oct 3, 2023

Conversation

Isan-Rivkin
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin commented Sep 26, 2023

Closes #6575

TODO - will create additional PR and merge into current:

  • Add system test with Glue / AWS
  • Add documentation

Usage

Pre-requisite

Write some table data to lakeFS either parquet or CSV format, snippet of parquet added:

data =[("James","","Smith","36636","M",3000),
              ("Michael","Rose","","40288","M",4000),
              ("Robert","","Williams","42114","M",4000),
              ("Maria","Anne","Jones","39192","F",4000),
              ("Jen","Mary","Brown","","F",-1)]

columns=["firstname","middlename","lastname","dob","gender","salary"]

df=spark.createDataFrame(data,columns)

Define table in _lakefs_tables/employees.yaml

name: employees
type: hive
path: <path/to/parquet/files>
partition_columns: ['gender', 'salary']
schema:
  type: struct
  fields:
    - name: gender
      type: string
      nullable: false
      metadata: {}
    - name: salary
      type: integer
      nullable: false
      metadata: {}
    - name: firstname
      type: string
      nullable: false
      metadata: {}
    - name: middlename
      type: string
      nullable: false
      metadata: {}
    - name: lastname
      type: string
      nullable: false
      metadata: {}
    - name: dob
      type: string
      nullable: false
      metadata: {}

Create _lakefs_action/employee_exporter.yaml

name: glue catalog exporter
on:
  post-create-branch:
    branches: ["main*"]
  post-commit:
    branches: ["main*"]
hooks:
  # symlink_exporter placegolder - can be separated to create symlink file
  - id: symlink_exporter
  - id: glue_exporter
    type: lua
    properties:
      script_path: scripts/glue_export.lua
      args:
        # Export configuration
        aws:
          aws_access_key_id: ''
          aws_secret_access_key: ''
          aws_region: ''
        table_source: '_lakefs_tables/employees.yaml'
        catalog:
          db_name: "<glue-database-name>"
          # csv input example, can be passed from here or from script file
          # table_input:
          #   StorageDescriptor: 
          #     InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat"
          #     # InputFormat: "org.apache.hadoop.mapred.TextInputFormat"
          #     OutputFormat: "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
          #     SerdeInfo:
          #         SerializationLibrary: "org.apache.hadoop.hive.serde2.OpenCSVSerde"
          #         Parameters:
          #             separatorChar: ","
          #   Parameters: 
          #     classification: "csv"

Create export hook in scripts/glue_export.lua

local aws = require("aws")
local exporter = require("lakefs/catalogexport/glue_exporter")

-- config 
local table_path = args.table_source
local db = args.catalog.db_name
local options = {
    debug=true,
}

-- init glue client 
local glue = aws.glue_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)

-- configure parquet
local table_input = {
    StorageDescriptor={
        InputFormat= "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
        OutputFormat= "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat",
        SerdeInfo={
            SerializationLibrary= "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
        },
    },
    Parameters={
        classification= "parquet",
        "parquet.compression"= "SNAPPY",
    }
}

-- export glue table 
local result = exporter.export_glue(glue, db, table_path, table_input, action, options)
print("success exporting table to glue location", result.create_table_input.StorageDescriptor.Location)

@Isan-Rivkin Isan-Rivkin added the include-changelog PR description should be included in next release changelog label Sep 26, 2023
@Isan-Rivkin Isan-Rivkin self-assigned this Sep 26, 2023
@Isan-Rivkin Isan-Rivkin changed the title Export Hooks: Glue Catalog exporter WIP: Export Hooks: Glue Catalog exporter Sep 26, 2023
@Isan-Rivkin Isan-Rivkin changed the title WIP: Export Hooks: Glue Catalog exporter Export Hooks: Glue Catalog exporter Oct 1, 2023
@Isan-Rivkin Isan-Rivkin requested a review from nopcoder October 1, 2023 16:18
@Isan-Rivkin
Copy link
Contributor Author

@nopcoder please notice that PR description added usage (TODO's will be done after this PR code approved)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name the file aws.go as we do for other packages - name the main library as the package.

Comment on lines +66 to +67
config.WithRegion(c.Region),
config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(c.AccessKeyID, c.SecretAccessKey, "")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The credentioals are required, but the region can be empty - verify that LoadDefaultConfig with empty region works. I think you should not pass it to the options in order to use the default configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified, works :)

Comment on lines 70 to 73
table.remove(page, 1)
else
-- remove hidden files from the entry set
table.remove(page, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are doing the same - so you can just move table.remove out of the both if scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +72 to +73
-- create a standard AWS Glue table input (i.e not Apache Iceberg), add input values to base input and configure the rest
local function build_glue_create_table_input(base_input, descriptor, symlink_location, columns, partitions, action_info,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see any real usage for this one in the near future? if not inline it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see usage incase user want's to use opts.override_create_table_input and change certain things in build_glue_create_table_input output.
Actually I added now export for this function as well outside of the module.

Comment on lines 77 to 78
local repo_id = action_info.repository_id
local commit_id = action_info.commit_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not using repo and commit id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

-- create table
local json_input = json.marshal(table_input)
if opts.debug then
print("Creating Glue Table - input: ", json_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you concat the strings

Suggested change
print("Creating Glue Table - input: ", json_input)
print("Creating Glue Table - input:", json_input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 115 to 117
return {
create_table_input=table_input
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not returning just table_input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To leave room for other variables and make it a "result" without breaking function signature.

For example, in a more complex scenario where export will do version updates to the same table we will want the new latest version things like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to table_input

@@ -1,6 +1,18 @@
local url = require("net/url")
local DEFAULT_SHORT_DIGEST_LEN=6

local function clone_table(original_table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest use the deepcopy (and call it deepcopy as it does deep copy for any type) from the formal lua wiki:

function deepcopy(orig)
    local orig_type = type(orig)
    local copy
    if orig_type == 'table' then
        copy = {}
        for orig_key, orig_value in next, orig, nil do
            copy[deepcopy(orig_key)] = deepcopy(orig_value)
        end
        setmetatable(copy, deepcopy(getmetatable(orig)))
    else -- number, string, boolean, etc
        copy = orig
    end
    return copy
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woha took me a while to compile that in my mind, but better for sure - updated thanks.

Comment on lines 109 to 115
// parse table input JSON
var tableInput types.TableInput
err := json.Unmarshal([]byte(tableInputJSON), &tableInput)
if err != nil {
lua.Errorf(l, err.Error())
panic("unreachable")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move parse after the you finish the params extract - I think it will be readable and better structure of the function implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 220 to 222
retArgs := util.DeepPush(l, itemMap)
l.PushBoolean(true)
return retArgs + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange that DeepPush needs to return a number - it expected to already return 1.
Same for this function getTable return - we must return 2 as we like the boolean to be on position 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, changed 👍

@Isan-Rivkin Isan-Rivkin requested a review from nopcoder October 2, 2023 11:41
@Isan-Rivkin
Copy link
Contributor Author

Isan-Rivkin commented Oct 2, 2023

@nopcoder Thanks! Please re-review 🙏

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isan-Rivkin Isan-Rivkin merged commit 7855a18 into master Oct 3, 2023
@Isan-Rivkin Isan-Rivkin deleted the 6575-glue-exporter branch October 3, 2023 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export Hooks: Glue Catalog exporter
2 participants