-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes data invariant format problems #109
Conversation
Thanks for writing code to fix the tests @veikkoeeva ! A more philosophical question, one I don't know the answer to, maybe you or other reviewers could chime in: For things like output format data it's important that we write that in the invariant culture (since it would be weird if someone wrote data from this library, then someone else was unable to read it, merely because they were in a different region). For this reason, I'd say (just to pick an example), the changes to I start to become slightly more uncomfortable, with the logs (again, just to pick one single example, the changes in I wonder if, while keeping your changes, it is possible to "solve" this problem, in the baseline tests at least, setting the culture explicitly to |
9e88532
to
1e6afe4
Compare
@TomFinley Good points. There are other problems too, I put https://github.com/dotnet/machinelearning/pull/109/files#diff-8be2040d2cb5bab8e8be76992fc763c9R179 since otherwise it would take the results as string, parse it to a float and have a completely different result when again printed out as string (if I recall correctly). I'm not sure if this would handle such a situation, but let's pause and think. Should we fix this now in some particular fashion and open a few more tickets for further refactoring? See previous point about Optimizer -- and your point may work. I'll check and see. This is how I would like to change the system: Have clearly separated data structures for computation and results, they're likely mostly there. Then have some sort of a separate formatter to the results to desired format. This might require something that traverses the data structure and then format individual entries while traversing, ideally writing to the sink as it goes so that if the sink doesn't accumulate results, they get written out immediately. Then the tests would use exactly the same facilities as any other code. As for example classes that might be useful to frame discussion, perhaps there's room to introduce IFormatProvider (one another exampe here). To do the aforementioned, one would need to separate code that logs/traces for "technical purpose" and then to write a result. I.e. write result clearly in the end as in choosing a suitable I think the data structures are like this gist: https://gist.github.com/veikkoeeva/50c8f38ec46b0a3ce70467d16af00dc1. If the columns there would be heterogenous (with type information), it could be processed with other standard facilities. Then one could write something like at https://stackoverflow.com/questions/2758850/make-csv-from-list-of-string-in-linq. The idea is to make plain data structures and separate printing and other handling logic off of them. I apologize also as I likely come across wrongly here. I'm unfamiliar with the code, but to throw around some ideas. I'm unfortunately somewhat busy for the next few weeks, but I think this would be nice discussion to have together with logging/tracing. |
@TomFinley Excellent suggestion. The tests are green with all the changes you see now and in addition to I'll remove changes and see how well this change holds, taking a note of <edit: Changed locale switch in tests to the base class instead of the inherited one. |
1e6afe4
to
f1b02ca
Compare
@@ -19,6 +19,7 @@ | |||
using Microsoft.ML.Runtime.Model.Pfa; | |||
using Microsoft.ML.Runtime.Internal.Internallearn; | |||
using Newtonsoft.Json.Linq; | |||
using System.Globalization; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Sort order?
@@ -8,6 +8,7 @@ | |||
using System.Collections.Generic; | |||
using Microsoft.ML.Runtime.Data; | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using System.Globalization; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort order?
@@ -13,6 +13,7 @@ | |||
using Microsoft.ML.Runtime.Data; | |||
using Microsoft.ML.Runtime.Internal.Calibration; | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using System.Globalization; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort order?
Cool, sounds great -- yes makes sense, thank you! In reply to: 388157235 [](ancestors = 388157235) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @veikkoeeva , is your note about thinking it's inappropriate to merge still current?
@TomFinley Don't merge yet. :) I had to sleep and need to take care day job in between, but I should get the test today (likely you should have day there at that day). |
f1b02ca
to
5bfa668
Compare
Should we explicitly be passing the "CurrentCulture" to those places? That way it is marked as "I thought about the usage of this formatting, and it is intended for human consumption." |
@eerhardt The problem was/is the results are printed out to a file in different formating as the ones that are used in the control set for comparison. Imbuing I don't think it's a correct idea to think these files should print in the locale of the user. Firstly, the text is in English either way so having numbers, dates and so on in different local would print them out in that way whereas currently there are these It looks like partially the problem here is also a separation of what is tracing and logging and what should be writing results. This is something I may get wrong, but writing to output "here and there" and type transformations are a bit of a problem. Hence the currently in the commit message it reads Fixes data invariant format problems The tests do not pass on machines that have different formatting than English language. The error happens since the results are written in different than expected format. 1. The main fix is to imbue en-US culture to the test thread so that results will be output in format that is comparable with the test format. 2. A secondary fix is to make comparisons between culture sensitive data type representations invariant when they do not have human readable dimensions. In OptimizationMonitor.cs case the cast between culture sensitive floating point and string will cause orders of magnitudes of error in output results. The intention of this path is not to offer a robust solution and remove future issues. There is room for refactoring where, for instance, locale information would be applied to input and output and logging/tracing would be clearly separated from another kind of locale sensitive handling. This way culture sensitive parts would be separated and particular output formats could be tested as separate cases if so desired. |
5bfa668
to
135791d
Compare
@eerhardt Structured logging would nice, by the way. :) And all those places that use Hence the comment it'd be easier to debug if one had the result as a set and then just compared it in the end and if needed, output using given locale. I would expect this is something the users of the libraries want to do, though. If there's something like |
@TomFinley I'm off to sleep now. I'll be around tomorrow (and the weekend) if this warrants more discussion or other changes (I might have missed some feedback too). It's OK to me to step back and consider the cases as much as needed. :) |
@eerhardt are more changes requested here? I'm unclear. @veikkoeeva looks like you will have to rebase/merge master |
@danmosemsft A good question. It could be just imbuing the locale will unblock the tests, but I'm not sure if it'd leave runtime problems (as I had to change the Optimizer too). Without imbuing the locale almost all the test passes just by changing the output to invariant culture, but a few issues remained. So the options could be:
I can rebase too tomorrow. It's almost midnight here, hence pushing for tomorrow. :) |
I'm not sure I'm convinced that all the places that now pass in InvariantCulture are right. For example: _ch.Info("{0}\t Time elapsed(s): {1}\n\n", DateTime.Now.ToString(CultureInfo.InvariantCulture), elapsedSeconds.ToString(CultureInfo.InvariantCulture)); This really looks like something that a user would be reading, and not another program. Thus, it shouldn't be using the InvariantCulture. (Although, obviously we are only using English words here, so localizing these messages would also be necessary if we wanted to be correct for any user.) |
Might be more correct to use, say, |
I disagree. In the cases where we are writing to a log file that a human will read, it is better to write using the current culture, that way the format of the date is written in the format the current user is used to. For example, in the US it is common to have For testing - I agree we can fix our tests by explicitly using a hard-coded culture (which you've done). But this change should only happen in the tests. All the changes in the product code, should be doing what is best for the user and the product. Can you revert the changes you made to the product code and just leave the test changes? Does that fix the issue? Also, there are now conflicts that need to be resolved. Can you take care of that? |
@eerhardt I'll revert changes to non-test code. Most likely I'll just try imbuing the locale in tests and see if it fixes the problems of tests not passing. If not, have to see why not. |
The tests do not pass on systems with locale other than en-US. The error happens since the results are written to files and the contents of the files are compared to set of correct results produced under en-US locale. The fix is to imbue en-US culture to the test thread so that results will be output in format that is comparable with the test format. This patch fixes only tests, but do not guarantee calculation will be correct in production systems using a locale different than en-US. In particular, there can be problems in reading data and then conversing data from characters to numeric format. Fixes dotnet#74
135791d
to
611858f
Compare
@eerhardt I updated to imbue locale only. It seems to fix the tests, but see the notes I don't think this guarantees runtime is OK under different locales. Might be, but it's not sure and the fix kind of masks the problem if there are. If so, maybe that's only temporary.
I disagree with that. I like to read consistently in one language so I don't have the mental load of multiple ones in the same text. Maybe also, if human readable reporting is the thing here, produce JSON and have a parser and another tool for viewing that output. But it looks to me the discussion should be about comparing in-memory data structures and then optionally test files are written according to some locale. Maybe there should be tests reading data in user specified locale and printing in user specified locale and then a comparison test to see if the results were valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this, @veikkoeeva. This looks good to me.
My comment wasn't referring to language, but it was referring to the format of the dates. Should the month go first? Or should the day go first? There are plenty of countries that use English as the language, but don't use the US date format. |
Thanks @veikkoeeva , @eerhardt and @TomFinley. I will merge this tomorrow. |
* Bump master to v0.3 (#269) * RocketEngine fix for selecting top learners (#270) * Changes to RocketEngine to fix take top k logic. * Add namespace information to allow file to reference correct version of Formatting object. * small code cleanup (#271) * Preparation for syncing sources with internal repo (#275) * make class partial so I can add constuctor in separate file. add constructros for testing * formatting * Changes to use evaluator metrics names in PipelineSweeperSupportedMetrics. Made the private const strings in two classes public. (#276) * add missing subcomponents to sweepers (#278) * add missing subcomponents * right one * more cleanup * remove lotus references. (#252) * Random seed and concurrency for tests (#277) * first attempt * add comments * specify seed for random. make constructor internal. * Fix SupportedMetric.ByName() method (#280) * Fix for SupportedMetric.ByName() method. Include new unit test for function. * Fix for SupportedMetric.ByName() method. Include new unit test for function. * Fix for SupportedMetric.ByName() method. Include new unit test for function. * Removed unnecessary field filter, per review comment. * ML.NET-242: FastTreeRanking per-iteration loss metrics are empty (#289) When training a FastTreeRanker using the `testFrequency` parameter, it is expected that NDCG is prented every testFrequency iterations. However, instead of NDCG, only empty strings are printed. The root cause was that the MaxDCG property of the dataset was never calculated, so the NDCG calculation is aborted, leaving an empty string as a result. This PR fixes the problem by computing the MaxDCG for the dataset when the Tests are defined (so that if the tests are not defined, the MaxDCG will never be calculated). Closes #242 * Fixed typo in the method summary (#296) * Remove stale line of code from test. (#297) * Update release notes link to use aka.ms. (#294) Our release notes link is broken because the `Documentation` was renamed to `docs`. Fix this for the future to use a redirection link. * Add release notes for ML.NET 0.2 (#301) * Add release notes for ML.NET 0.2 * Adding release note about TextLoader changes and additional issue/PR references * Addressing comments: fixing typos, changing formatting, and adding references * Get the cross validation macro to work with non-default column names (#291) * Add label/grou/weight column name arguments to CV and train-test macros * Fix unit test. * Merge. * Update CSharp API. * Fix EntryPointCatalog test. * Address PR comments. * update sample in README.MD with 0.2 features. (#304) * update sample with new text loader API. * update with 0.2 stuff. * OVA should respect normalization in underlying learner (#310) * Respect normalization in OVA. * some cleanup * fix copypaste issues * Export to ONNX and cross-platform command-line tool to script ML.NET training and inference (#248) * Export to ONNX and Maml cross-platform executable. * Add Cluster evaluator (#316) * Add Cluster evaluator * fix copypaste * address comments * formatting * Fixes locale dependent test output comparisons (#109) The tests do not pass on systems with locale other than en-US. The error happens since the results are written to files and the contents of the files are compared to set of correct results produced under en-US locale. The fix is to imbue en-US culture to the test thread so that results will be output in format that is comparable with the test format. This patch fixes only tests, but do not guarantee calculation will be correct in production systems using a locale different than en-US. In particular, there can be problems in reading data and then conversing data from characters to numeric format. Fixes #74 * Add PartitionedFileLoader (#61) * Remove unexisting project from solution (#335) * GetSummaryDataView/Row implementation for Pca and Linear Predictors (#185) * Implement `ICanGetSummaryAsIDataView` on `PcaPredictor` class * Implement `ICanGetSummaryAsIRow` on `LinearPredictor` class * Disable ordinary least squares by removing the entry point (#286) * Disable ols by temporarily removing the entry point. It may be added again once we figure out how to ship MKL as part of this project. * add append function to pipeline (#284) Add `Append` function to pipeline for more fluent API than that allowed by `Add` * Removed field/column name checking of input type in TextLoader. (#327) * fix namespace issue in CSharpGenerator and some refactoring (#339) fix namespace issue and refactoring * Using named-tuple in OneToOneTransforms' constructor to make API more readable. (#324) * Minor formatting in CollectionDataSourceTests.cs (#348) * Create CalibratedPredictor instead of SchemaBindableCalibratedPredictor (#338) `CalibratorUtils.TrainCalibrator` and `TrainCalibratorIfNeeded` now creates `CalibratedPredictor` instead of `SchemaBindableCalibratedPredictor` whenever the predictor implements `IValueMapper`. * Remove reference and dependency on System.ValueTuple (#351) * Add link to samples (#355) * Use HideEnumValueAttribute for both manifest and C# API generation. (#356) * Use HideEnumValueAttribute for both manifest and C# API generation. * Unhide NAReplaceTransform.ReplacementKind.SpecifiedValue. This may require some other PR to resolve the corresponding issues. * Move the NuGet package build files into a TFM specific directory. (#370) When installing Microsoft.ML on an unsupported framework (like net452), it is currently getting installed successfully. However, users should be getting an error stating that net452 is not supported by this package. The cause is the build files exist for any TFM, which NuGet interprets as this package supports any TFM. Moving the build files to be consistent with the 'lib' folder support. Fix #357 * `Stream` subclasses now have `Close` call `base.Close` to ensure disposal. (#369) * Subclasses of `Stream` now have `Close` call `base.Close` to ensure disposal. * Add DeleteOnClose to File opening. * Remove explicit delete of file. * Remove explicit close of substream. * Since no longer deleting explicitly, no longer need `_overflowPath` member. * Return distinct array of ParameterSet when ProposeSweep is called (#368) * Changed List to HashSet to ensure that there are no duplicates * Update fast tree argument help text (#372) * Update fast tree argument help text * Update wording * Update API to fix test * Update core manifest JSON to update help text * Combine multiple tree ensemble models into a single tree ensemble (#364) * Add a way to create a single tree ensemble model from multiple tree ensemble models. * Address PR comments, and fix bugs in serializing/deserializing RegressionTrees. * Address PR comments. * add pipelineitem for Ova (#363) add pipelineitem for Ova * Fix CV macro to output the warnings data view properly. (#385) * Link to an example on using converting ML.NET model to ONNX. (#386) * Adding documentation about entry points, and entry points graphs: EntryPoints.md and GraphRunner.md (#295) * Adding EntryPoints.md and GraphRunner.md * addressing PR feedback * Updating the title of the GraphRunner.md file * adressing Tom's feedback * adressing feedback * code formatting for class names * Addressing Gal's comments * Adding an example of an entry point. Fixing casing on ML.NET * fixing link * Adding LDA Transform (#377) * Revert to using the native code (#413) Corrects an unintentional "typo" in FastTreeRanking.cs where there was mistakenly a USE_FASTTREENATIVE2 instead of USE_FASTTREENATIVE. This resulted in some obscure hidden ranking options (distance weighting, normalize query lambdas, and a few others) being unavailable. These are important for some applications. * LightGBM (#392) * LightGBM and test. * add test baselines and nuget source for lightGBM binaries. * Add entrypoint for lightGBM. * add unsafe flag for release build. * update nuget version. * make lightgbm test single threaded. * install gcc on OS machines to resolve dependencies on openmp thatis needed by lightgbm native code. * PR comments. Leave BREW and GCC in bash script to verify macOS tests work. * remove brew and gcc from build script. * PR feedback. * disable test on macOS. * disable test on macOS. * PR feedback. * Adding Factorization Machines (#383) * Adding Factorization Machines * ONNX API documentation. (#419) * ONNX API documentation. * Bring ensembles into codebase (#379) Introduce Ensemble codebase * enable macOS tests for LightGBM. (#422) * Create a shorter temp file name for model loading. (#397) Create a shorter temp file name for model loading, as well as remove the potential for a race condition among multiple openings by using the creation of a lock file. * removing extraneous character that broke the linux build, and with it unecessary cmake version requirement (#425) * EvaluatorUtils to handle label column of type key without text key values (#394) * Fix EvaluatorUtils to handle label column of type key without text key values. * Removing non source files from solution (#362)
The tests do not pass on systems with locale other than en-US. The error happens since the results are written to files and the contents of the files are compared to set of correct results produced under en-US locale. The fix is to imbue en-US culture to the test thread so that results will be output in format that is comparable with the test format. This patch fixes only tests, but do not guarantee calculation will be correct in production systems using a locale different than en-US. In particular, there can be problems in reading data and then conversing data from characters to numeric format. Fixes dotnet#74
Looks like current solution doesn't work always.
|
The tests do not pass on machines that have different formatting than English language. The error happens since the results are written in different than expected format.
Fixes #74