Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add translator #1108

Merged
merged 16 commits into from
Jul 13, 2021
Merged

Conversation

serena-ruan
Copy link
Contributor

Add translator into mmlspark

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@serena-ruan serena-ruan marked this pull request as ready for review July 6, 2021 10:06
@serena-ruan serena-ruan requested a review from mhamilton723 as a code owner July 6, 2021 10:06
@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@serena-ruan serena-ruan changed the title feat: add text translation feat: add translator Jul 8, 2021
@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jul 8, 2021

Codecov Report

Merging #1108 (82f026e) into master (d287be6) will decrease coverage by 0.18%.
The diff coverage is 78.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1108      +/-   ##
==========================================
- Coverage   85.55%   85.37%   -0.19%     
==========================================
  Files         254      257       +3     
  Lines       11805    12053     +248     
  Branches      625      629       +4     
==========================================
+ Hits        10100    10290     +190     
- Misses       1705     1763      +58     
Impacted Files Coverage Δ
.../microsoft/ml/spark/cognitive/FormRecognizer.scala 81.00% <ø> (ø)
.../microsoft/ml/spark/cognitive/TextTranslator.scala 76.24% <76.24%> (ø)
.../microsoft/ml/spark/cognitive/ComputerVision.scala 78.57% <76.92%> (ø)
...rosoft/ml/spark/cognitive/DocumentTranslator.scala 81.96% <81.96%> (ø)
...crosoft/ml/spark/cognitive/TranslatorSchemas.scala 100.00% <100.00%> (ø)
...ala/org/apache/spark/ml/param/DataFrameParam.scala 66.66% <0.00%> (-16.67%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d287be6...82f026e. Read the comment docs.

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

))).toJson.compactPrint, ContentType.APPLICATION_JSON))
}

private def queryForResult(key: Option[String],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any similarities that would allow us to abstract this and other async querying logic into the same function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

with HasInternalJsonOutputParser with HasCognitiveServiceInput with HasSubscriptionRegion
with HasSetLocation {

protected val subscriptionRegionHeaderName = "Ocp-Apim-Subscription-Region"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so strange i cant believe they make you specify this in a header lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's quite strange... the document says it's optional for global translator resource, but if we don't add it into header the response will be '{"error":{"code":401000,"message":"The request is not authorized because credentials are missing or invalid."}}'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arent cognitive services so great!


def setToLanguageCol(v: String): this.type = setVectorParam(toLanguage, v)

val fromLanguage = new ServiceParam[String](this, "fromLanguage", "Specifies the language of the input" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can abstract the fromLanguage and toLanguage methods into separate traits and "mix together" to reduce code. Also be on the lookout for other params that can be "factored" out in this manner as its less maintenance work for us later on ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there're several fromLanguage & toLanguage in this file, the 'isRequired' parameter and type might be different, so didn't factor these two within Translate out. Or do you have other approaches to mix them together?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! One idea would be to add protected fromLanguageRequired: Boolean = true on base class and just override that

class TranslateSuite extends TransformerFuzzing[Translate]
with TranslatorKey with Flaky with TranslatorUtils {

lazy val translate: Translate = new Translate()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can factor out common setters into a base method then just .set only the differing params. Isnt the fluent API nifty!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


lazy val textDf2: DataFrame = Seq(List("Hello, what is your name?", "Bye")).toDF("text")

lazy val textDf3: DataFrame = Seq(List("This is bullshit.")).toDF("text")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

.setOutputCol("translation")
.setConcurrency(5)

test("Translate multiple pieces of text with language autodetection") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these tests have a similiar structure, might want to factor this structure out so that the tests are smaller to write. I know it's just test code and who cares, but it will make your life easier when you need to update I promise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

test("Handle profanity") {
val results = translate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when re-using the same estimator make translate a def not a lazy val. I know its weird but the setters actually MODIFY state globally which is a weird thing on SparkML's part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. That solves my pain on creating multiple translators lol!

import spark.implicits._

// TODO: Replace all of those SAS urls after 2022-07-07
lazy val sourceUrl: String = "https://mmlspark.blob.core.windows.net/datasets?sp=rl&st=2021-07-06T06" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to get away without SAS URLs? the datasets blob is public I believe. Also you might want to pull the root URL into a val to keep code DRY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm I tried but seems can't access it directly using '"https://mmlspark.blob.core.windows.net/datasets", it returns "error": {
"code": "InvalidRequest",
"message": "Cannot access source document location with the current permissions.",
"target": "Operation",
"innerError": {
"code": "InvalidDocumentAccessLevel",
"message": "Cannot access source document location with the current permissions."
}
},

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On folders it might not work, but it should work on files. Is this manually inspecting a folder and loading content from it, or is it taking it URL by URL?

@@ -48,6 +48,8 @@ object Secrets {
lazy val AnomalyApiKey: String = getSecret("anomaly-api-key")
lazy val AzureSearchKey: String = getSecret("azure-search-key")
lazy val BingSearchKey: String = getSecret("bing-search-key")
lazy val TranslatorKey: String = getSecret("translator-key")
lazy val TranslatorName: String = getSecret("translator-name")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the translator name do? is it state or a secret?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the service name
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird

@mhamilton723
Copy link
Collaborator

Awesome stuff, so great to see this flying out of your fingertips!

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


lazy val transDf: DataFrame = Seq(List("こんにちは", "さようなら")).toDF("text")

lazy val transliterate: Transliterate = new Transliterate()
Copy link
Collaborator

@mhamilton723 mhamilton723 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change lazy val -> def here and elsewhere? I know its just used once but if we add more tests then it might be important for avoiding subtle errors

class DetectSuite extends TransformerFuzzing[Detect]
with TranslatorKey with Flaky with TranslatorUtils {

lazy val detect: Detect = new Detect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val -> def

import spark.implicits._

// TODO: Replace root SAS urls after 2022-07-07
lazy val sourceRoot: String = "?sp=rl&st=2021-07-06T06:28:26Z&se=2022-07-07T06:28:00Z" +
Copy link
Collaborator

@mhamilton723 mhamilton723 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to consider factor this in the following way

lazy val containerSasToken = ... (If this needs to be a SAS otherwise we should try to remove SAS)
lazy val urlRoot = "https://mmlspark.blob.core.windows.net/"

and use the same container as for all experiments. If you find the container SAS to be used all over we might consider breaking this into a single trait for all cog service tests so that we only need to update in one location if it expires. For file based APIs we probably don't need the SAS but for container and folder listing operations the SAS is probably needede

TargetInput(None, None, targetFileUrl2, "de", None))))
.toDF("sourceUrl", "storageType", "targets")

lazy val documentTranslator: DocumentTranslator = new DocumentTranslator()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val -> def and factor out shared structure

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes have a few more ideas for how to continue tidying, awesome work and appreciate the iterations on this :)

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan serena-ruan merged commit 84d8d24 into microsoft:master Jul 13, 2021
@serena-ruan serena-ruan deleted the serena/addTranslator branch July 13, 2021 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants