-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch transform: retain input values #358
Comments
When you do a non-batch transform (inference), are you able to retain your input values? From my understanding based on https://aws.amazon.com/blogs/aws/sagemaker-nysummit2018/, batch transform is supposed to be a inference dealing with lots of data. The batch transform is still going to utilize the same inference code as the non-batch transform. So if your inference code normally outputs the inputs used, then it should show up in your batch transform output. I don't know if it makes sense to output the input along with the inference. Is this a common practice? |
I was under the impression that this could be used for doing model analysis in which case it does make sense to append the input along with the output values. |
Gotcha. So from my understanding, you want the input values to be added into the output file. I am not too sure if we will be able to honor this, but I'll add it as a feature request. I think my concern for this is that, it may potentially increase the size of the output file, however maybe we can add it as an optional flag to output the inputs too. Otherwise, I believe the output is ordered, meaning that the output follows the same order as the input, which should be top down. So while annoying, you might be able to read from your input file and correspond it with your output. |
This would definitely be useful. I have a postprocessing step that compares one of the input features to the prediction and tweaks the prediction if a certain condition is met. Including all the input columns might result in a very big file, so it would be nice to be able to only choose one or two of the input columns to include. |
Removed attachedModel in cleanup to avoid error
The way to circumvent this has been somewhat tedious, since our team was deciding on the solution to zip lines from the output and input files on AWS Glue or Lambda. We first tried a Glue Job that matched lines from both files, but we parsed the files as Spark dataframes and there was no function to enumerate rows (due to the given parallel nature of Spark, I guess) to match them. We then thought Lambda was not going to be an issue, since the zipping should be an easy task, but we then realized it was too much info to process. The lambda timed out and would sometimes have memory issues due to an issue with Boto3 (boto/boto3#1670). We thought again of a Glue job, without using the usual Glue APIs and just plan loading the files from S3, merging the lines and we could even perform further transforming to convert back the time series categories to meaningful values. Now, the last issue we've had is the inconsistency in the number of lines outputted from DeepAR, since sometimes they don't match. Our team has lost about 3 days of work on this due to not having meaningful (and custom) information in the output files which links them back to their corresponding input. It would be nice to have these features in the short term to improve the usability of the product. |
Hi @chang2394 @peacej @Gloix , Thanks for contributing your thoughts. SageMaker is working on this feature request. We cannot provide any ETA now but we will get back to this issue when it's complete. |
Hi! @chang2394 |
This would be extremely helpful! We are looking for a way to retain the input values (mainly for identification purposes) so that we can match the corresponding input to the output results. |
Agreed! We would also benefit from this tremendously. Please keep us posted when you have an ETA @yangaws, thanks! |
+1, Need to map the prediction result back to the original input |
I solved it using tf.contrib.estimator.forward_features |
I would also like to add an option that adds the original values or a specified column to the result, because in my case I do not use Tenserflow. |
Is there an update on the feature request? I am using the pre-built XG boost algorithm and would love to see the feature there. |
is there any update on this feature? Ideally, could we have an option to specify a list of columns (normally IDs and keys) to make sure the batch inference results are mapped correctly to the input. In addition, an option to operate on the predictions is great as well (for calculating percentile, generate probability or label from probability, etc). |
Hi all, please checkout last week's update to SageMaker Batch Transform that will satisfy these use cases. Currently, we support CSV, JSON and JSON Lines formats. Feature Documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html Blog Post showing an example with a standard UCI dataset: https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/ Companion Notebooks: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform/batch_transform_associate_predictions_with_input Please let us know your feedback on this feature! @tnaduc @MrYoungblood @madmadmadman @saritajoshi9389 @chrispruitt @lincolmr @joseramoncajide @Gloix @peacej @chang2394 |
I think this is a great and needed feature. |
Batch inference jobs on Sagemaker Autopilot model produce only labels as output. Can we get scores also? Is there an option to get scores? I need to do some post processing on scores. |
We need to improve the overall solution that is there for association of inputs to outputs. A common use case is to have inference pipelines set and in my case it is a container of SPARK ML followed by Xgboost. In this case, i pass a CSV file input that gets converted internally to a sparse vector frame that is passed on to the xgboost image. At this point, i cannot put in association of inputs with outputs. Also, all the hosted algorithms need to be enhanced to use the same set of inputs. For XGBoost, it uses csv and libsvm only but association works for csv only. I cannot use CSV as i have sparsevectors in my input frame and making them dense is not an option owing to the size that comes out when parsing and saving as CSV. |
I am doing batch transform on a deepAR model but the output produced contains only the mean and quantile values, is there some way to retain the input values on which the output was produced.
This is required since i have to plot prediction vs actual for records matching specified filtering criteria.
The text was updated successfully, but these errors were encountered: