Skip to content
This repository was archived by the owner on Jul 9, 2022. It is now read-only.

XD-3534 Add HDFS Dataset sink module #58

Closed
wants to merge 7 commits into from

Conversation

trisberg
Copy link
Contributor

@trisberg trisberg commented Nov 3, 2015

  • Ported existing Spring XD hdfs-dataset sink module

This resolves #105.

</parent>

<properties>
<start-class>org.springframework.cloud.stream.module.hdfs.dataset.DatasetSinkApplication</start-class>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like should be this one org.springframework.cloud.stream.module.dataset.sink...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, forgot to change this after changing the package name. Fixing.

Object payload = message.getPayload();
if (payload instanceof Collection<?>) {
Collection<?> payloads = (Collection<?>) payload;
if (logger.isDebugEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others have nitpicked to me that with slf4j logger.isDebugEnabled() is not needed anymore if you use parameterized messages within a log statement. Thought if you don't change debug call then this is needed. Nitpicking... :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changing the logging

@jvalkeal
Copy link
Contributor

jvalkeal commented Nov 5, 2015

I have some issues with idleTimeout and fsUri:

Running local hdfs://localhost:8020 and I simply build the module locally and registered module with:

module register --coordinates org.springframework.cloud.stream.module:hdfs-dataset-sink:jar:exec:1.0.0.BUILD-SNAPSHOT --name hdfs-dataset-sink --type sink

idleTimeout defaults to 0 so it seems that every single write creates a new dataset so I'm getting new avro file every 2 seconds. Being honest I don't remember if XD dataset sink behaves like this. I'd expect idleTimeout=0 to disable flushing and thus simply falling back to batchSize when new avro file is created.

stream create --name "ticktock" --definition "time --fixedDelay=2|hdfs-dataset" --deploy
drwxr-xr-x   - root supergroup          0 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data
drwxr-xr-x   - root supergroup          0 2015-11-05 07:43 /tmp/hdfs-dataset-sink/data/string
drwxr-xr-x   - root supergroup          0 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/.metadata
-rw-r--r--   3 root supergroup        173 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/.metadata/descriptor.properties
-rw-r--r--   3 root supergroup          8 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/.metadata/schema.avsc
drwxr-xr-x   - root supergroup          0 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/.metadata/schemas
-rw-r--r--   3 root supergroup          8 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/.metadata/schemas/1.avsc
-rw-r--r--   3 root supergroup        105 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/0142625d-86f4-4034-be0a-a223ca78fdba.avro
-rw-r--r--   3 root supergroup        105 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/040ce3e4-46c0-4602-a8a7-f8c2f48fb6ea.avro
-rw-r--r--   3 root supergroup        105 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/041535af-c176-438f-8744-ad53c15a9c00.avro
-rw-r--r--   3 root supergroup        116 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/1c38c57d-ac86-4140-819d-0f648f9dc84c.avro
-rw-r--r--   3 root supergroup        105 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/294c5fa0-ebb5-403d-96aa-2a76cf0f01a5.avro
-rw-r--r--   3 root supergroup        105 2015-11-05 07:42 /tmp/hdfs-dataset-sink/data/string/2c0dbc08-e98e-4286-bc37-67d67017e038.avro

Tried to use 20sec idleTimeout but then it waits default batchSize 10000 so didn't get any writes. Expected and didn't want to wait batch to reach its size.

stream create --name "ticktock" --definition "time --fixedDelay=2|hdfs-dataset --idleTimeout=20000" --deploy

With batchSize=10 I get 3 avro files per minute which is expected if idleTimeout is more that 2 sec.

stream create --name "ticktock" --definition "time --fixedDelay=2|hdfs-dataset --idleTimeout=20000 --batchSize=10" --deploy

I don't see fsUri used anywhere in DatasetSinkConfiguration. It seem to use auto-config for Configuration which defaults to hdfs://localhost:8020. So expected stream writes to fail if giving spoofed fsUri. Files were still written under custom location.

stream create --name "ticktock" --definition "time --fixedDelay=2|hdfs-dataset --fsUri=hdfs://localhost:12345 --directory=/tmp/hdfs-dataset-sink-custom" --deploy
root@neo:/usr/local/hadoops# hadoop/bin/hdfs dfs -ls /tmp
Found 2 items
drwxrwxrwx   - root supergroup          0 2015-09-02 08:36 /tmp/hadoop-yarn
drwxr-xr-x   - root supergroup          0 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom
root@neo:/usr/local/hadoops# hadoop/bin/hdfs dfs -ls -R /tmp/hdfs-dataset-sink-custom
drwxr-xr-x   - root supergroup          0 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data
drwxr-xr-x   - root supergroup          0 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string
drwxr-xr-x   - root supergroup          0 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/.metadata
-rw-r--r--   3 root supergroup        180 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/.metadata/descriptor.properties
-rw-r--r--   3 root supergroup          8 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/.metadata/schema.avsc
drwxr-xr-x   - root supergroup          0 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/.metadata/schemas
-rw-r--r--   3 root supergroup          8 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/.metadata/schemas/1.avsc
-rw-r--r--   3 root supergroup        110 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/0274cc54-572b-4809-b11f-ab8e18ceafa6.avro
-rw-r--r--   3 root supergroup        105 2015-11-05 08:11 /tmp/hdfs-dataset-sink-custom/data/string/033fc459-21ff-4763-84a0-2b48f8e27980.avro

@jvalkeal
Copy link
Contributor

jvalkeal commented Nov 6, 2015

Yes, I'm seeing same java.io.IOException: Filesystem closed error. Trying to figure out where it's coming from, in case it's not caused by shutdown hook.

@jvalkeal
Copy link
Contributor

jvalkeal commented Nov 6, 2015

Interesting, it seems that I don't get Filesystem closed error if I do:

stream create --name "ticktock" --definition "time|hdfs-dataset --batchSize=20" --deploy

and wait first file to be written.

If I do:

stream create --name "ticktock" --definition "time|hdfs-dataset" --deploy

there's no files written(if I don't wait full default batch size) and I get error.

It's worth to check if we get this error only if no data is flushed, thus no interaction with kite, thus there would not be any existing open hadoop FileSystem's.

@jvalkeal
Copy link
Contributor

jvalkeal commented Nov 6, 2015

damn, I take it back. This error doesn't happen every stream undeploy.

@jvalkeal
Copy link
Contributor

jvalkeal commented Nov 6, 2015

Right, it has to be FileSystem autoclose which is i.e. called from shutdown hook. I don't see errors if I disable autoclose with:

stream create --name "ticktock" --definition "time|hdfs-dataset --spring.hadoop.config.fs.automatic.close=false" --deploy

- Ported existing Spring XD hdfs-dataset sink module
- Addressing review comments/suggestions
- Addressing some deployment issues
- Fixing tests after we removed the autoconfigured FsShell
- Adding crreation of separate configuration for `--fsUri` option
- Adding bean to control FileSystem close, rather than rely on shutdown hook registered by Hadoop
- Remove unused configuration
@trisberg trisberg force-pushed the XD-3534-hdfs-dataset branch from 8f5f0b5 to 152ef7f Compare November 17, 2015 16:07
@trisberg
Copy link
Contributor Author

Rebased

@jvalkeal
Copy link
Contributor

jvalkeal commented Dec 4, 2015

squashed and merged per 3181e73

@jvalkeal jvalkeal closed this Dec 4, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants