Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch transform: retain input values #358

Closed
chang2394 opened this issue Aug 20, 2018 · 18 comments
Closed

Batch transform: retain input values #358

chang2394 opened this issue Aug 20, 2018 · 18 comments

Comments

@chang2394
Copy link

chang2394 commented Aug 20, 2018

I am doing batch transform on a deepAR model but the output produced contains only the mean and quantile values, is there some way to retain the input values on which the output was produced.
This is required since i have to plot prediction vs actual for records matching specified filtering criteria.

@ChoiByungWook
Copy link
Contributor

When you do a non-batch transform (inference), are you able to retain your input values?

From my understanding based on https://aws.amazon.com/blogs/aws/sagemaker-nysummit2018/, batch transform is supposed to be a inference dealing with lots of data. The batch transform is still going to utilize the same inference code as the non-batch transform. So if your inference code normally outputs the inputs used, then it should show up in your batch transform output.

I don't know if it makes sense to output the input along with the inference. Is this a common practice?

@chang2394
Copy link
Author

I was under the impression that this could be used for doing model analysis in which case it does make sense to append the input along with the output values.
If that is not the case can you suggest some other way to solve this, i am looking for something similar to tensorflow-model-analysis for sagemaker models.

@ChoiByungWook
Copy link
Contributor

Gotcha.

So from my understanding, you want the input values to be added into the output file.

I am not too sure if we will be able to honor this, but I'll add it as a feature request. I think my concern for this is that, it may potentially increase the size of the output file, however maybe we can add it as an optional flag to output the inputs too.

Otherwise, I believe the output is ordered, meaning that the output follows the same order as the input, which should be top down. So while annoying, you might be able to read from your input file and correspond it with your output.

@peacej
Copy link

peacej commented Sep 5, 2018

This would definitely be useful. I have a postprocessing step that compares one of the input features to the prediction and tweaks the prediction if a certain condition is met. Including all the input columns might result in a very big file, so it would be nice to be able to only choose one or two of the input columns to include.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Removed attachedModel in cleanup to avoid error
@Gloix
Copy link

Gloix commented Jan 28, 2019

The way to circumvent this has been somewhat tedious, since our team was deciding on the solution to zip lines from the output and input files on AWS Glue or Lambda. We first tried a Glue Job that matched lines from both files, but we parsed the files as Spark dataframes and there was no function to enumerate rows (due to the given parallel nature of Spark, I guess) to match them.

We then thought Lambda was not going to be an issue, since the zipping should be an easy task, but we then realized it was too much info to process. The lambda timed out and would sometimes have memory issues due to an issue with Boto3 (boto/boto3#1670).

We thought again of a Glue job, without using the usual Glue APIs and just plan loading the files from S3, merging the lines and we could even perform further transforming to convert back the time series categories to meaningful values.

Now, the last issue we've had is the inconsistency in the number of lines outputted from DeepAR, since sometimes they don't match.

Our team has lost about 3 days of work on this due to not having meaningful (and custom) information in the output files which links them back to their corresponding input.

It would be nice to have these features in the short term to improve the usability of the product.

@yangaws
Copy link
Contributor

yangaws commented Jan 29, 2019

Hi @chang2394 @peacej @Gloix ,

Thanks for contributing your thoughts. SageMaker is working on this feature request. We cannot provide any ETA now but we will get back to this issue when it's complete.

@joseramoncajide
Copy link

Hi! @chang2394
This is just what I need, the way to return the features along with the model predicted results!

@lincolmr
Copy link

This would be extremely helpful! We are looking for a way to retain the input values (mainly for identification purposes) so that we can match the corresponding input to the output results.

@chrispruitt
Copy link

Agreed! We would also benefit from this tremendously. Please keep us posted when you have an ETA @yangaws, thanks!

@saritajoshi9389
Copy link

+1, Need to map the prediction result back to the original input

@joseramoncajide
Copy link

I solved it using tf.contrib.estimator.forward_features
I asked AWS team to update TF to 1.12 on their SageMaker instances and now it works. You can tie batch predictions with incoming data using a feature key as described here: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/forward_features
Hope it helps!

@madmadmadman
Copy link

madmadmadman commented Mar 27, 2019

I would also like to add an option that adds the original values or a specified column ​​to the result, because in my case I do not use Tenserflow.

@MrYoungblood
Copy link

Is there an update on the feature request? I am using the pre-built XG boost algorithm and would love to see the feature there.

@tnaduc
Copy link

tnaduc commented May 28, 2019

is there any update on this feature?

Ideally, could we have an option to specify a list of columns (normally IDs and keys) to make sure the batch inference results are mapped correctly to the input. In addition, an option to operate on the predictions is great as well (for calculating percentile, generate probability or label from probability, etc).
Thanks

@j3ffreyjohn
Copy link

j3ffreyjohn commented Jul 8, 2019

Hi all, please checkout last week's update to SageMaker Batch Transform that will satisfy these use cases. Currently, we support CSV, JSON and JSON Lines formats.

Feature Documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html

Blog Post showing an example with a standard UCI dataset: https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/

Companion Notebooks: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform/batch_transform_associate_predictions_with_input

Please let us know your feedback on this feature!

@tnaduc @MrYoungblood @madmadmadman @saritajoshi9389 @chrispruitt @lincolmr @joseramoncajide @Gloix @peacej @chang2394

@mvsusp mvsusp closed this as completed Jul 9, 2019
@martinhammar
Copy link

I think this is a great and needed feature.
Can't get it working with BlazingText and jsonlines data, though. Is that supported?

@adarsh-dattatri
Copy link

Batch inference jobs on Sagemaker Autopilot model produce only labels as output. Can we get scores also? Is there an option to get scores? I need to do some post processing on scores.

@maddy2u
Copy link

maddy2u commented Apr 28, 2020

We need to improve the overall solution that is there for association of inputs to outputs.

A common use case is to have inference pipelines set and in my case it is a container of SPARK ML followed by Xgboost. In this case, i pass a CSV file input that gets converted internally to a sparse vector frame that is passed on to the xgboost image. At this point, i cannot put in association of inputs with outputs. Also, all the hosted algorithms need to be enhanced to use the same set of inputs. For XGBoost, it uses csv and libsvm only but association works for csv only. I cannot use CSV as i have sparsevectors in my input frame and making them dense is not an option owing to the size that comes out when parsing and saving as CSV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests