feat: Max Compute Sink #52

ekawinataa · 2024-10-29T05:06:37Z

No description provided.

- add DefaultPartitioningStrategyTest - add TimestampPartitioningStrategyTest - add PartitioningStrategyFactoryTest

rajuGT · 2024-11-26T09:32:17Z

src/main/java/com/gotocompany/depot/config/converter/MaxComputeOdpsGlobalSettingsConverter.java

+
+    @Override
+    public Map<String, String> convert(Method method, String s) {
+        if (Objects.isNull(s) || StringUtils.isEmpty(s.trim())) {


Map<String, String> settings = new HashMap<>(); if (Objects.isNull(s) || StringUtils.isEmpty(s.trim())) { return settings; } String[] pairs = s.split(CONFIG_SEPARATOR); for (String pair : pairs) { ....

src/main/java/com/gotocompany/depot/maxcompute/MaxComputeSinkFactory.java

rajuGT · 2024-11-26T09:42:13Z

src/main/java/com/gotocompany/depot/maxcompute/client/MaxComputeClient.java

+import java.util.List;
+
+@AllArgsConstructor
+@NoArgsConstructor


Is no args constructor required?

Some other dependent class requires MaxComputeClient to be mocked, which requires NoArgsConstructor. I admit that there is no actual usage outside of the test, do you have any suggestion?

rajuGT · 2024-11-26T09:47:01Z

src/main/java/com/gotocompany/depot/maxcompute/client/MaxComputeClient.java

+                .getSchema();
+    }
+
+    public void upsertTable(TableSchema tableSchema) throws OdpsException {


upsert means "update or insert the records" and what we are doing here is "create or update table ddl", we should name it as createOrUpdateTable.

BQ Client uses the same terminology upsert for updating and creating tables. Wdyt, should we maintain naming consistency for this sink as well?

rajuGT · 2024-11-26T09:49:07Z

src/main/java/com/gotocompany/depot/maxcompute/client/ddl/DdlManager.java

+    private final Instrumentation instrumentation;
+    private final MaxComputeMetrics maxComputeMetrics;
+
+    public void upsertTable(TableSchema tableSchema) throws OdpsException {


Same here;

upsert means "update or insert the records" and what we are doing here is "create or update table ddl", we should name it as createOrUpdateTable.

BQ Client uses the same terminology upsert for updating and creating tables. Wdyt, should we maintain naming consistency for this sink as well?

rajuGT · 2024-11-26T10:32:46Z

src/main/java/com/gotocompany/depot/maxcompute/client/insert/NonPartitionedInsertManager.java

+                new TableTunnel.FlushOption()
+                        .timeout(super.getMaxComputeSinkConfig().getMaxComputeRecordPackFlushTimeoutMs()));


this instance can be created in this class constructor and pass the reference.

rajuGT · 2024-11-26T10:36:57Z

...ain/java/com/gotocompany/depot/maxcompute/client/insert/session/StreamingSessionManager.java

+                .build(cacheLoader);
+    }
+
+    public static StreamingSessionManager nonParititonedStreamingSessionManager(TableTunnel tableTunnel, MaxComputeSinkConfig maxComputeSinkConfig) {


Usage will look like StreamingSessionManager.nonParititonedStreamingSessionManager repeatative

I suggest to use createNonPartioned or newNonPartioned or newPartionedInstance

rajuGT · 2024-11-26T10:37:40Z

...ain/java/com/gotocompany/depot/maxcompute/client/insert/session/StreamingSessionManager.java

+        }, maxComputeSinkConfig);
+    }
+
+    public static StreamingSessionManager partitionedStreamingSessionManager(TableTunnel tableTunnel, MaxComputeSinkConfig maxComputeSinkConfig) {


Usage will look like StreamingSessionManager.parititonedStreamingSessionManager repeatative

I suggest to use createPartioned or newPartioned or newPartionedInstance

rajuGT · 2024-11-26T10:39:15Z

...ain/java/com/gotocompany/depot/maxcompute/client/insert/session/StreamingSessionManager.java

+
+public final class StreamingSessionManager {
+
+    private final LoadingCache<String, TableTunnel.StreamUploadSession> sessionCache;


What is the context behind using this LoadingCache?

as there would be only one entry mostly

private static final String NON_PARTITIONED = "non-partitioned";

add docs for this, related to the reasoning of using cache + key used

context: Session object could be reused multiple times for doing streaming insert. This session object is assigned to a specific partition by design. There are chances of insertion process happening throughout several partitions, eg: change of day, event replay, etc.

rajuGT · 2024-11-26T10:48:23Z

src/main/java/com/gotocompany/depot/maxcompute/converter/payload/PayloadConverter.java

+import java.util.List;
+import java.util.stream.Collectors;
+
+public interface PayloadConverter {


Awesome 👏

Add javadoc for the interface and also write what is the expectation of canConvert and convertSingular methods.

ekawinataa · 2024-11-26T11:15:45Z

src/main/java/com/gotocompany/depot/maxcompute/converter/payload/MessagePayloadConverter.java

+                values.add(null);
+                return;
+            }
+            Object mappedInnerValue = payloadConverters.stream()


this one we can optimize using the map lookup later

ekawinataa · 2024-11-26T11:48:21Z

src/main/java/com/gotocompany/depot/maxcompute/schema/SchemaDifferenceUtils.java

+import static com.gotocompany.depot.maxcompute.util.TypeInfoUtils.isStructArrayType;
+import static com.gotocompany.depot.maxcompute.util.TypeInfoUtils.isStructType;
+
+public class SchemaDifferenceUtils {


add javadoc, and mark as deprecated, until ali's team provide the proper table update

Also, we need to verify with them that will it also remove column if the new schema don't have it? or what is the expectation of it?

ekawinataa · 2024-11-26T11:57:42Z

src/main/java/com/gotocompany/depot/maxcompute/converter/type/PrimitiveTypeInfoConverter.java

+        PROTO_TYPE_MAP.put(Descriptors.FieldDescriptor.Type.STRING, TypeInfoFactory.STRING);
+        PROTO_TYPE_MAP.put(Descriptors.FieldDescriptor.Type.ENUM, TypeInfoFactory.STRING);
+        PROTO_TYPE_MAP.put(Descriptors.FieldDescriptor.Type.DOUBLE, TypeInfoFactory.DOUBLE);
+        PROTO_TYPE_MAP.put(Descriptors.FieldDescriptor.Type.FLOAT, TypeInfoFactory.FLOAT);


float -> Infinite NaN, check whether it's supported in MC or not (edited)

ekawinataa · 2024-11-26T12:01:16Z

docs/reference/configuration/maxcompute.md

+This is used for timestamp auto-partitioning feature where the partition column coexists with the original column.
+
+* Example value: `column1`
+* Type: `optional`


add default value __partition_value + link to their docs for the auto partitioning feature

ekawinataa · 2024-11-26T12:03:49Z

docs/reference/configuration/maxcompute.md

+
+* Example value: `7`
+* Type: `required`
+* Default value: `1`


change to 2, consider the case of DAY trunct partitioning

ekawinataa · 2024-11-26T12:14:46Z

src/main/java/com/gotocompany/depot/maxcompute/client/insert/PartitionedInsertManager.java

+
+    @Override
+    public void insert(List<RecordWrapper> recordWrappers) throws TunnelException, IOException {
+        Map<String, List<RecordWrapper>> partitionSpecRecordWrapperMap = recordWrappers.stream()


go with single iteration, outside of the iteration we can track the RecordPack using map and partitionspec as key

rajuGT · 2024-11-27T07:54:41Z

...ain/java/com/gotocompany/depot/maxcompute/client/insert/session/StreamingSessionManager.java

+    private StreamingSessionManager(CacheLoader<String, TableTunnel.StreamUploadSession> cacheLoader,
+                                   MaxComputeSinkConfig maxComputeSinkConfig) {
+        sessionCache = CacheBuilder.newBuilder()
+                .maximumSize(maxComputeSinkConfig.getStreamingInsertMaximumSessionCount())
+                .build(cacheLoader);
+    }


I think instantiating this class complexity is split into two methods, this constructor and the static method. How about this, where the logic is in static method and constructor is simple.

private StreamingSessionManager(LoadingCache<String, TableTunnel.StreamUploadSession> sessionCache) { this.sessionCache = sessionCache; } public static StreamingSessionManager nonParititoned(TableTunnel tableTunnel, MaxComputeSinkConfig maxComputeSinkConfig) { CacheLoader<String, TableTunnel.StreamUploadSession> cacheLoader = new CacheLoader<String, TableTunnel.StreamUploadSession>() { @Override public TableTunnel.StreamUploadSession load(String sessionId) throws TunnelException { return tableTunnel.buildStreamUploadSession( maxComputeSinkConfig.getMaxComputeProjectId(), maxComputeSinkConfig.getMaxComputeTableName()) .allowSchemaMismatch(false) .build(); } }; return new StreamingSessionManager( CacheBuilder.newBuilder() .maximumSize(maxComputeSinkConfig.getStreamingInsertMaximumSessionCount()) .build(cacheLoader)); }

nonParititoned typo is there nonPartitioned

rajuGT · 2024-11-27T13:42:47Z

src/main/java/com/gotocompany/depot/maxcompute/converter/type/PrimitiveTypeInfoConverter.java

+
+    private static final Map<Descriptors.FieldDescriptor.Type, TypeInfo> PROTO_TYPE_MAP;
+    static {
+        PROTO_TYPE_MAP = new HashMap<>();


Additionally, to make it concise and readable and addressing immutability, please do apply the below changes

import static com.google.protobuf.Descriptors.FieldDescriptor.Type.*; public class PrimitiveTypeInfoConverter implements TypeInfoConverter { private static final Map<Descriptors.FieldDescriptor.Type, TypeInfo> PROTO_TYPE_MAP; static { PROTO_TYPE_MAP = ImmutableMap.<Descriptors.FieldDescriptor.Type, TypeInfo>builder() .put(BYTES, TypeInfoFactory.BINARY) .put(STRING, TypeInfoFactory.STRING) .put(ENUM, TypeInfoFactory.STRING) .put(DOUBLE, TypeInfoFactory.DOUBLE) .put(FLOAT, TypeInfoFactory.FLOAT) .put(BOOL, TypeInfoFactory.BOOLEAN) .put(INT64, TypeInfoFactory.BIGINT) .put(UINT64, TypeInfoFactory.BIGINT) .put(INT32, TypeInfoFactory.INT) .put(UINT32, TypeInfoFactory.INT) .put(FIXED64, TypeInfoFactory.BIGINT) .put(FIXED32, TypeInfoFactory.INT) .put(SFIXED32, TypeInfoFactory.INT) .put(SFIXED64, TypeInfoFactory.BIGINT) .put(SINT32, TypeInfoFactory.INT) .put(SINT64, TypeInfoFactory.BIGINT) .build(); }

I think this is more easily readable.

rajuGT · 2024-11-28T07:41:43Z

src/main/java/com/gotocompany/depot/maxcompute/converter/ConverterOrchestrator.java

+    public TypeInfo convert(Descriptors.FieldDescriptor fieldDescriptor) {
+        return typeInfoCache.computeIfAbsent(fieldDescriptor.getFullName(), key -> typeInfoConverters.stream()
+                .filter(converter -> converter.canConvert(fieldDescriptor))
+                .findFirst()
+                .map(converter -> converter.convert(fieldDescriptor))
+                .orElseThrow(() -> new IllegalArgumentException("Unsupported type: " + fieldDescriptor.getType())));
+    }


Also another reason for using map is, say two implementations of converter can convert one particular fieldDescriptor, we know logically that might happen, but from code/architecture wise we're not putting that boundary.

To have exact mapping of which implementation is responsible for which fieldDescriptor, I think we need to have clear hard coded mapping rules, that will also help to identify future issues that which implementation is causing certain behavior.

rajuGT · 2024-11-28T08:32:39Z

src/main/java/com/gotocompany/depot/maxcompute/MaxComputeSink.java

+            maxComputeClient.insert(recordWrappers.getValidRecords());
+        } catch (IOException | TunnelException e) {
+            log.error("Error while inserting records to MaxCompute: ", e);
+            mapInsertionError(recordWrappers.getValidRecords(), sinkResponse, new ErrorInfo(e, ErrorType.SINK_RETRYABLE_ERROR));


line 41 and 46, mapInsertionError is not saying the intent of the call clearly. How about using existing methods in sinkResponse or having a method in sinkResponse?

if (response.hasErrors()) { Map<Long, ErrorInfo> errorInfoMap = BigQueryResponseParser.getErrorsFromBQResponse(records.getValidRecords(), response, bigQueryMetrics, instrumentation); errorInfoMap.forEach(sinkResponse::addErrors); errorHandler.handle(response.getInsertErrors(), records.getValidRecords()); }

Also what is the difference between this catch block and the line 44th catch block? Do DEFAULT_ERROR will not be retried? where is this written in javadoc. I think it should be documented in the Sink interface (you can defer this documentation part to phase-2 as it is mostly enhancement of existing and not w.r.t this MR)

retry configuration is on Firehose side (which errorsto retry and which are not), not on depot. As of now, we have identified that IOException might occurs from schema mismatch and TunnelException might occur from intermittent network issues, so we map that particularly to SINK-RETRYABLE_ERROR explicitly.

Other uncategorized exceptions are mapped as DEFAULT_ERROR.

rajuGT · 2024-11-28T11:25:17Z

src/main/java/com/gotocompany/depot/maxcompute/converter/ConverterOrchestrator.java

+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+public class ConverterOrchestrator {


This feels to generic for this repo.

How about ProtoBufTypeToMaxComputeTypeConverter ?

rajuGT · 2024-11-28T11:43:26Z

src/main/java/com/gotocompany/depot/maxcompute/converter/type/DurationTypeInfoConverter.java

+        List<String> fieldNames = Arrays.asList(SECONDS, NANOS);
+        List<TypeInfo> typeInfos = Arrays.asList(TypeInfoFactory.BIGINT, TypeInfoFactory.INT);


private static final List<String> fieldNames = Arrays.asList(SECONDS, NANOS); private static final List<TypeInfo> typeInfos = Arrays.asList(TypeInfoFactory.BIGINT, TypeInfoFactory.INT);

rajuGT · 2024-11-28T11:56:05Z

src/main/java/com/gotocompany/depot/maxcompute/helper/MaxComputeSchemaHelper.java

+import java.util.stream.Collectors;
+
+@RequiredArgsConstructor
+public class MaxComputeSchemaHelper {


take this out of helper sub-package.

This class is MaxComputeSchemaBuilder which has only one method buildMaxComputeSchema which can be renamed as build

rajuGT · 2024-11-28T13:42:25Z

src/main/java/com/gotocompany/depot/maxcompute/schema/validator/TableValidator.java

+    private static final Pattern VALID_TABLE_NAME_REGEX = Pattern.compile("^[A-Za-z][A-Za-z0-9_]{0,127}$");
+    private static final int MAX_COLUMNS_PER_TABLE = 1200;
+    private static final int MAX_PARTITION_KEYS_PER_TABLE = 6;


These should be configurable and the values you've set can be kept as default if not overriden.

rajuGT · 2024-11-28T13:45:30Z

src/main/java/com/gotocompany/depot/maxcompute/util/MetadataUtil.java

+    private static final String BOOLEAN = "boolean";
+
+    static {
+        METADATA_TYPE_MAP = new HashMap<>();


as suggested in previous comments, use google commons collection method to make it concise and readable.

rajuGT · 2024-11-28T13:53:29Z

src/main/java/com/gotocompany/depot/message/ProtoUnknownFieldValidationType.java

+        }
+    };
+
+    public abstract boolean shouldFilter(Object object);


Its bit ambiguous about the intent of this class. Lets discuss this over call.

either way, shouldFilter is not defining the intent clearly also. Lets discuss.

rajuGT · 2024-11-28T14:04:03Z

src/main/java/com/gotocompany/depot/utils/SinkConfigUtils.java

+
+public class SinkConfigUtils {
+
+    public static String getProtoSchemaClassName(SinkConfig sinkConfig) {


This is used in only one place, MaxComputeSinkFactory, we can move it there.

rajuGT · 2024-11-28T14:05:01Z

...t/java/com/gotocompany/depot/config/converter/MaxComputeOdpsGlobalSettingsConverterTest.java

+
+        Map<String, String> settings = converter.convert(null, odpsGlobalSettings);
+
+        Assertions.assertEquals(2, settings.size());


add static imports for assertXYZ;

…onverter

…nd its implementation

ekawinataa added 8 commits October 11, 2024 14:13

feat: Add Type Info Converter Implementation

12a8fe7

feat: Add Test

27b8a7b

feat: Complete test for MessageTypeInfoConverterTest

512e0b6

feat: Complete test for TimestampTypeInfoConverter

ca51108

feat: Complete test for StructTypeInfoConverter.java

468a00e

feat: Complete test for DurationTypeInfoConverter.java

d0228dd

feat: Complete test for BaseTypeInfoConverterTest.java

14bdd12

feat: get implementation diff from feat branch

31357f1

ekawinataa changed the title ~~Feat max compute sink~~ feat: Max Compute Sink Phase 1 Oct 29, 2024

ekawinataa added 21 commits October 29, 2024 13:51

test: update ConverterOrchestratorTest

bf8af70

test: Add InsertManagerFactoryTest

d1e0609

chore: remove unused constructor

f2e3d68

test:

8a228ea

- add DefaultPartitioningStrategyTest - add TimestampPartitioningStrategyTest - add PartitioningStrategyFactoryTest

fix: PartitioningStrategyFactoryTest

890b5ad

fix: remove unused dependency injection

0fbbeea

test: add non null injected decorator

2947c92

test: add test for MaxComputeSchemaCache

ad863fc

chore: remove unused lombok annotations

157cddf

test: Add MaxComputeSinkTest

57985e6

test: add test for close

1d6fbc5

test: fix typeinfo

e989107

fix: add error handler in MaxComputeSink

e3675bf

add ProtoUnknownFieldValidationType

a0b78ce

test: Add test for ProtoUnknownFieldValidationType

f6f5985

fix: use correct converter

724ddc4

fix: rename method

0c57a19

chore: rename

032e94d

chore: exclude conflicting deps

a364d7b

chore: fix checkstyle main

8c05295

chore: fix checkstyle test

acf9b07

ekawinataa added 14 commits November 25, 2024 20:23

chore: checkstyle main

6a8a850

chore: update docs

48439bc

chore: update schema docs

cfbeb33

chore: reorder annotation in config

2235372

chore: change version to 0.10.0

9d9f678

chore: bump aliyun version

3abc604

chore: remove redundant enum converter class

f570909

feat: use guava cache for streaming session

6279e70

chore: use sessionCache.getUnchecked

f1b7b8d

test: update test

9017f07

chore: checkstyle

4139947

chore: use const instead of literal string

098e9c7

chore: wrap update statement with backtick

c125948

test: add IOException use case

1b55805

ekawinataa requested review from sumitaich1998 and rajuGT November 26, 2024 10:09

ekawinataa added 2 commits November 26, 2024 17:19

chore: refactor RetryUtils to receive exception predicate

ca08d4c

chore: remove validateConfig()

3bf75ea

rajuGT requested changes Nov 26, 2024

View reviewed changes

ekawinataa commented Nov 26, 2024

View reviewed changes

rajuGT requested changes Nov 27, 2024

View reviewed changes

rajuGT requested changes Nov 28, 2024

View reviewed changes

ekawinataa added 3 commits November 28, 2024 23:50

fix: make RecordWrapper immutable

555a91e

chore: optimize variable declaration on MaxComputeOdpsGlobalSettingsC…

2ac3990

…onverter

fix: remove unused TableTunnel dependencies from InsertManager.java a…

7899591

…nd its implementation

		new TableTunnel.FlushOption()
		.timeout(super.getMaxComputeSinkConfig().getMaxComputeRecordPackFlushTimeoutMs()));


		public final class StreamingSessionManager {

		private final LoadingCache<String, TableTunnel.StreamUploadSession> sessionCache;

		List<String> fieldNames = Arrays.asList(SECONDS, NANOS);
		List<TypeInfo> typeInfos = Arrays.asList(TypeInfoFactory.BIGINT, TypeInfoFactory.INT);


		public class SinkConfigUtils {

		public static String getProtoSchemaClassName(SinkConfig sinkConfig) {


		Map<String, String> settings = converter.convert(null, odpsGlobalSettings);

		Assertions.assertEquals(2, settings.size());

feat: Max Compute Sink #52

Are you sure you want to change the base?

feat: Max Compute Sink #52

Conversation

ekawinataa commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekawinataa Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekawinataa Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekawinataa Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekawinataa Nov 26, 2024 •

edited

Loading

ekawinataa Nov 26, 2024 •

edited

Loading

ekawinataa Nov 28, 2024 •

edited

Loading