-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add translator #1108
feat: add translator #1108
Conversation
/azp run |
No pipelines are associated with this pull request. |
/azp run |
No pipelines are associated with this pull request. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #1108 +/- ##
==========================================
- Coverage 85.55% 85.37% -0.19%
==========================================
Files 254 257 +3
Lines 11805 12053 +248
Branches 625 629 +4
==========================================
+ Hits 10100 10290 +190
- Misses 1705 1763 +58
Continue to review full report at Codecov.
|
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…xemption in fuzzingTest
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
))).toJson.compactPrint, ContentType.APPLICATION_JSON)) | ||
} | ||
|
||
private def queryForResult(key: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any similarities that would allow us to abstract this and other async querying logic into the same function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
with HasInternalJsonOutputParser with HasCognitiveServiceInput with HasSubscriptionRegion | ||
with HasSetLocation { | ||
|
||
protected val subscriptionRegionHeaderName = "Ocp-Apim-Subscription-Region" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so strange i cant believe they make you specify this in a header lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's quite strange... the document says it's optional for global translator resource, but if we don't add it into header the response will be '{"error":{"code":401000,"message":"The request is not authorized because credentials are missing or invalid."}}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arent cognitive services so great!
|
||
def setToLanguageCol(v: String): this.type = setVectorParam(toLanguage, v) | ||
|
||
val fromLanguage = new ServiceParam[String](this, "fromLanguage", "Specifies the language of the input" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can abstract the fromLanguage and toLanguage methods into separate traits and "mix together" to reduce code. Also be on the lookout for other params that can be "factored" out in this manner as its less maintenance work for us later on ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although there're several fromLanguage & toLanguage in this file, the 'isRequired' parameter and type might be different, so didn't factor these two within Translate out. Or do you have other approaches to mix them together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! One idea would be to add protected fromLanguageRequired: Boolean = true on base class and just override that
class TranslateSuite extends TransformerFuzzing[Translate] | ||
with TranslatorKey with Flaky with TranslatorUtils { | ||
|
||
lazy val translate: Translate = new Translate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can factor out common setters into a base method then just .set only the differing params. Isnt the fluent API nifty!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
lazy val textDf2: DataFrame = Seq(List("Hello, what is your name?", "Bye")).toDF("text") | ||
|
||
lazy val textDf3: DataFrame = Seq(List("This is bullshit.")).toDF("text") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
.setOutputCol("translation") | ||
.setConcurrency(5) | ||
|
||
test("Translate multiple pieces of text with language autodetection") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these tests have a similiar structure, might want to factor this structure out so that the tests are smaller to write. I know it's just test code and who cares, but it will make your life easier when you need to update I promise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
test("Handle profanity") { | ||
val results = translate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when re-using the same estimator make translate a def not a lazy val. I know its weird but the setters actually MODIFY state globally which is a weird thing on SparkML's part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. That solves my pain on creating multiple translators lol!
import spark.implicits._ | ||
|
||
// TODO: Replace all of those SAS urls after 2022-07-07 | ||
lazy val sourceUrl: String = "https://mmlspark.blob.core.windows.net/datasets?sp=rl&st=2021-07-06T06" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you able to get away without SAS URLs? the datasets blob is public I believe. Also you might want to pull the root URL into a val to keep code DRY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm I tried but seems can't access it directly using '"https://mmlspark.blob.core.windows.net/datasets", it returns "error": {
"code": "InvalidRequest",
"message": "Cannot access source document location with the current permissions.",
"target": "Operation",
"innerError": {
"code": "InvalidDocumentAccessLevel",
"message": "Cannot access source document location with the current permissions."
}
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On folders it might not work, but it should work on files. Is this manually inspecting a folder and loading content from it, or is it taking it URL by URL?
@@ -48,6 +48,8 @@ object Secrets { | |||
lazy val AnomalyApiKey: String = getSecret("anomaly-api-key") | |||
lazy val AzureSearchKey: String = getSecret("azure-search-key") | |||
lazy val BingSearchKey: String = getSecret("bing-search-key") | |||
lazy val TranslatorKey: String = getSecret("translator-key") | |||
lazy val TranslatorName: String = getSecret("translator-name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the translator name do? is it state or a secret?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird
Awesome stuff, so great to see this flying out of your fingertips! |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
lazy val transDf: DataFrame = Seq(List("こんにちは", "さようなら")).toDF("text") | ||
|
||
lazy val transliterate: Transliterate = new Transliterate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change lazy val -> def here and elsewhere? I know its just used once but if we add more tests then it might be important for avoiding subtle errors
class DetectSuite extends TransformerFuzzing[Detect] | ||
with TranslatorKey with Flaky with TranslatorUtils { | ||
|
||
lazy val detect: Detect = new Detect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy val -> def
import spark.implicits._ | ||
|
||
// TODO: Replace root SAS urls after 2022-07-07 | ||
lazy val sourceRoot: String = "?sp=rl&st=2021-07-06T06:28:26Z&se=2022-07-07T06:28:00Z" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to consider factor this in the following way
lazy val containerSasToken = ... (If this needs to be a SAS otherwise we should try to remove SAS)
lazy val urlRoot = "https://mmlspark.blob.core.windows.net/"
and use the same container as for all experiments. If you find the container SAS to be used all over we might consider breaking this into a single trait for all cog service tests so that we only need to update in one location if it expires. For file based APIs we probably don't need the SAS but for container and folder listing operations the SAS is probably needede
TargetInput(None, None, targetFileUrl2, "de", None)))) | ||
.toDF("sourceUrl", "storageType", "targets") | ||
|
||
lazy val documentTranslator: DocumentTranslator = new DocumentTranslator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy val -> def and factor out shared structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes have a few more ideas for how to continue tidying, awesome work and appreciate the iterations on this :)
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Add translator into mmlspark