Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLUGIN-632] - Added: ServiceAccountType as JSON, unit tests, validations #13

Merged
merged 1 commit into from
May 20, 2021

Conversation

flakrimjusufi
Copy link
Contributor

Make Google drive/sheet plugins production ready
JIRA Ticket: https://cdap.atlassian.net/browse/PLUGIN-632

@jasir99
Copy link

jasir99 commented Apr 30, 2021

lgtm

Copy link
Contributor

@rmstar rmstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test source and sink plugins in the following scenarios:

  1. In CDF with file path set to auto-detect.
  2. In CDF, specify file path
  3. In CDF, specify Service Account JSON
  4. In CDAP, specify file path
  5. In CDAP, specify Service Account JSON
  6. Test with service account type, file path/JSON set as macro param.

@@ -18,6 +18,9 @@

import com.github.rholder.retry.RetryException;
import com.google.api.services.drive.model.File;
import com.google.gson.JsonObject;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these imports added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports were left behind after we did some additional tests. Thanks for letting me know. I removed them.

@@ -27,7 +27,7 @@
* Util class for building pipeline schema.
*/
public class SchemaBuilder {
public static final String SCHEMA_ROOT_RECORD_NAME = "FileFromFolder";
public static final String SCHEMA_ROOT_RECORD_NAME = "etlSchemaBody";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema was not being updated in Google Drive Source and initially we thought that this was the case, but then we found out that the issue was related with widget, not with SCHEMA_ROOT_RECORD_NAME. Reverted to its initial name in the updated PR.

@@ -98,7 +105,7 @@ protected Credential getCredentials() throws IOException {
}

protected List<String> getRequiredScopes() {
return Collections.singletonList(DriveScopes.DRIVE_READONLY);
return Collections.singletonList(DriveScopes.DRIVE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we need to make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also as part of the testings that we have done. We were dealing with this error message when testing the pipeline:

404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "location" : "fileId", "locationType" : "parameter", "message" : "File not found: 1ZqAr0q6gITvjqvfOJqkSIxd-usTNhvJy.", "reason" : "notFound" } ], "message" : "File not found: 1ZqAr0q6gITvjqvfOJqkSIxd-usTNhvJy." }. Please check the system logs for more details.

...and checking the documentation of Google Drive API for 404 error type, stating that:

To fix this error:

1. Inform the user they don't have read access to the file or that the file doesn't exist.
2. Instruct the user to ask the owner for permission to the file.

...we changed the SCOPE from DRIVE_READONLY to DRIVE.

But afterwards, when we verified that in fact we were not in the right path when testing the pipelines and DRIVE_READONLY was enough for read/write, we changed back to its initial SCOPE.

In the last commit, we reverted back the scope to DRIVE_READONLY.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With SCOPE set to DRIVE does everything work correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, everything also works correctly with SCOPE set to DRIVE_READONLY, that's why we reverted back the code.

private boolean validateAccountFilePath(FailureCollector collector) {
if (!containsMacro(ACCOUNT_FILE_PATH)) {
private boolean validateServiceAccount(FailureCollector collector) {
if (isServiceAccountFilePath() != null && isServiceAccountFilePath()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store the return value in a variable so you don't have to call isServiceAccountFilePath() twice. Same thing for isServiceAccountJson() below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated.

@flakrimjusufi
Copy link
Contributor Author

Please test source and sink plugins in the following scenarios:

  1. In CDF with file path set to auto-detect.
  2. In CDF, specify file path
  3. In CDF, specify Service Account JSON
  4. In CDAP, specify file path
  5. In CDAP, specify Service Account JSON
  6. Test with service account type, file path/JSON set as macro param.
  1. I'm unable to test with file path set to auto-detect in CDF because my keys have insufficient permissions. To be more specific, I've used flakrim@cirus.co to test the plugin since I don't have access to Google Drive with my partner account flakrim@88547414.corp-partner.google.com. In CDF, we only have access with partnership accounts.

  2. Is there a way to test the pipelines in CDF while specifying the file path? As far as I know, in CDF we can only test with Service Account as JSON. Please correct me if my assumption is wrong.

  3. Test passed successfully.

  4. Test passed successfully.

  5. Test passed successfully.

  6. There's an issue when you test with MACRO. We were getting this error message:

Java.util.concurrent.ExecutionException: com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Unterminated object at line 1 column 5965 at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294) ~[com.google.guava.guava-13.0.1.jar:na] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281) ~[com.google.guava.guava-13.0.1.jar:na] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[com.google.guava.guava-13.0.1.jar:na]
That's because of this block of code:

this.conf = new ImmutableMap.Builder<String, String>()
.put(PROPERTY_CONFIG_JSON, GSON.toJson(config))
.build();
}

After debugging this issue, we found out that the entire config was serialized.

Taking a consideration that we have to do a deserialization of the entire config and this is time consuming (after this fix, the entire functionality of source and sink plugin needs to be tested), we removed Macros at this moment of time. We were told that the requirements were prioritized and the fix needs to be delivered as soon as possible.

However, we can proceed with the deserialization of config and we can include Macros in this PR if needed.

@CuriousVini
Copy link
Member

@flakrimjusufi
Do you know what permissions are missing from your account?

@CuriousVini
Copy link
Member

Also are you noticing permissions issues at pipeline runtime? If so, could you please make sure dataproc service account (GCE default service account) has permissions to access drive directories/files.

@rmstar
Copy link
Contributor

rmstar commented May 5, 2021

Taking a consideration that we have to do a deserialization of the entire config and this is time consuming (after this fix, the entire functionality of source and sink plugin needs to be tested), we removed Macros at this moment of time. We were told that the requirements were prioritized and the fix needs to be delivered as soon as possible.

Macros are supported, so we should make sure they work correctly. Please fix it as part of this PR.

@flakrimjusufi
Copy link
Contributor Author

flakrimjusufi commented May 5, 2021

@flakrimjusufi
Do you know what permissions are missing from your account?

With our partnership accounts, we don't have access in any of gmail, google drive, etc. So, to test Google Drive Source/Sink in data proc, I need one of the followings:

  • My account in cirus (which I've used to create a project and work with Google Drive API) to have access in data proc.
  • My partnership account to have access in Google Drive API.

Permission issues are showing up in validation when file path is set to auto-detect in datafusion, I can't run the pipeline.

@flakrimjusufi
Copy link
Contributor Author

Taking a consideration that we have to do a deserialization of the entire config and this is time consuming (after this fix, the entire functionality of source and sink plugin needs to be tested), we removed Macros at this moment of time. We were told that the requirements were prioritized and the fix needs to be delivered as soon as possible.

Macros are supported, so we should make sure they work correctly. Please fix it as part of this PR.

@rmstar

In the last update, we deserialized the config and added Macros.


@Nullable
public String getServiceAccountFilePath() {
if (containsMacro(ACCOUNT_FILE_PATH) || accountFilePath == null ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace accountFilePath == null || accountFilePath.isEmpty() -> Strings.insNullOrEmpty(accountFilePath)

}

@Nullable
public String getServiceAccount() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in GoogleBasicAuthConfig.java where you're checking if the returned string is null or empty. The string value is unused.

It seems this is unnecessary. You just need something like this:

if (config.isServiceAccountFilePath()) {
...
} else if (config. isServiceAccountJson()) {
...
} else {
// use default credentials
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in the last update.

import org.junit.Test;

public class GoogleAuthBaseConfigTest {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add tests for validation error handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for validation error handling in the last update.

@flakrimjusufi
Copy link
Contributor Author

flakrimjusufi commented May 11, 2021

@rmstar

In the last commit, GoogleSheets plugins are also updated with serviceAccountType as JSON, macros and documentation.

Both Source and Sink were tested thoroughly and everything is working well.

Copy link
Contributor

@rmstar rmstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an ETL test with macros for google drive and sheets?

public static final String NAME_SERVICE_ACCOUNT_JSON = "serviceAccountJSON";
public static final String SERVICE_ACCOUNT_FILE_PATH = "filePath";
public static final String SERVICE_ACCOUNT_JSON = "JSON";
public static final String SCHEMA = "schema";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCHEMA is used in deserialization of config:

if (properties.has(GoogleDriveSourceConfig.SCHEMA)) {
googleDriveSourceConfig.setSchema(properties.get(GoogleDriveSourceConfig.SCHEMA).getAsString());
}

/**
* Returns the instance of Schema.
* @return The instance of Schema
*/
public Schema getSchema() {
if (schema == null) {
if (dataSchemaInfo.isEmpty()) {
throw new RuntimeException("There are no headers to process. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not relevant anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look relevant to me because we are already checking if Source folder (directory) contains any spreadsheets files in here:

private void validateSourceFolder(FailureCollector collector, List<File> spreadsheetsFiles) {
if (spreadsheetsFiles.isEmpty()) {
collector.addFailure(String.format("No spreadsheets found in '%s' folder with '%s' filter.",
getDirectoryIdentifier(), getFilter()), null)
.withConfigProperty(DIRECTORY_IDENTIFIER).withConfigProperty(FILTER);
}
}

...but I will revert this RuntimeException in getSchema().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm why are we throwing a RuntimeException here? If config validation fails, we should add the failure to the failure collector (pass in FailureCollector, call collector.addFailure()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated.

throw new RuntimeException("There are no headers to process. " +
"Perhaps no validation step was executed before schema generation.");
}
if (schema == null && !containsMacro("serviceAccountType")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if service account type is not macro (set to default, i.e. file path), but the file path is a macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. In this scenario the validation would fail. In the next update, I will also check if file path or service acccount json contains Macro.

@flakrimjusufi
Copy link
Contributor Author

Can you add an ETL test with macros for google drive and sheets?

I'm seeing lots of dependencies that are missing in pom.xml. Should I create a separate PR only with ETL Tests? It will take some time because first I will need to include all the missing dependencies, create the ETL Tests and then I will have to do a thorough test of both Google Drive and Google Sheets plugin.

@rmstar
Copy link
Contributor

rmstar commented May 18, 2021

Should I create a separate PR only with ETL Tests?

You can add a test in a separate PR if that's easier for you.

Copy link
Contributor

@rmstar rmstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash commits

…ions, documentation and re-factoring of GoogleDrive and GoogleSheets plugin
@flakrimjusufi
Copy link
Contributor Author

please squash commits

Commits squashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants