-
Notifications
You must be signed in to change notification settings - Fork 492
Replace sparkdl's ImageSchema with Spark2.3's version #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
smurching
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, just had a few small comments + some thoughts regarding error messages.
python/sparkdl/__init__.py
Outdated
| 'TFImageTransformer', 'TFInputGraph', 'TFTransformer', | ||
| 'DeepImagePredictor', 'DeepImageFeaturizer', 'KerasImageFileTransformer', 'KerasTransformer', | ||
| 'imageInputPlaceholder'] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove the extra newlines here
project/plugins.sbt
Outdated
| @@ -1,4 +1,5 @@ | |||
| // You may use this file to add plugin dependencies for sbt. | |||
| resolvers += "Local Spark repo" at "file:///Users/tomas/.m2/repository" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Eventually remove this
python/sparkdl/graph/pieces.py
Outdated
| image_float = tf.decode_raw(image_buffer, tf.float32, name="decode_raw") | ||
|
|
||
| else: | ||
| raise ValueError('unsupported image data type "%s"' % img_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this. How hard would it be to print a list of supported image data types here (and potentially in other places where we validate data types)? Can we just use supportedOcvTypes from imageIO.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good idea. I'll update the error message. I can't use the OcvTypes though, currently this code is independent and it only knows (it's hardcoded) how to handle uint8 and float32.
python/sparkdl/image/imageIO.py
Outdated
| """ | ||
| return udf(_resizeFunction(size), imageSchema) | ||
| if len(size) != 2: | ||
| raise ValueError("New image size should have for [height, width] but got {}".format(size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for" -> "format"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup yup, thanks for noticing!
python/sparkdl/image/imageIO.py
Outdated
| :param imageDirectory: str, file path. | ||
| :param numPartition: int, number or partitions to use for reading files. | ||
| :return: DataFrame, with columns: (filepath: str, image: imageSchema). | ||
| def readImagesWithCustomLib(path, decode_f,numPartition = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (style): Space after commas
| * @return Row image in spark.ml.image format with 3 channels in BGR order. | ||
| */ | ||
| private[sparkdl] def spImageFromBufferedImage(image: BufferedImage): Row = { | ||
| private[sparkdl] def spImageFromBufferedImage(image: BufferedImage, origin:String = null): Row = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (style): Space after ":" (i.e. origin: String)
| outputCol="prediction",) | ||
| fullPredict = transformer.transform(self.imageDF).collect() | ||
| fileOrder = self.fileOrder | ||
| fullPredict = sorted(transformer.transform(self.imageDF).collect(),key=lambda x:fileOrder[x['image']['origin'].split('/')[-1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Put this sorting into a helper function since it appears multiple times and/or add a comment explaining what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, both suggestions make sense.
| def builder(children: Seq[Expression]) = udf.apply(children.map(cx => new Column(cx)) : _*).expr | ||
| val registry = sqlCtx.sessionState.functionRegistry | ||
| registry.registerFunction(name, builder) | ||
| registry.registerFunction(sqlCtx.sessionState.sqlParser.parseFunctionIdentifier(name), builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is a mouthful :P but it does seem like the right way to parse UDF names based on apache/spark#17518
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for referencing the PR!
Technically you could pass the identifier straight without calling the sql parse, however I hit an issue with that before - using an id with embedded dashes would fail silently and later you get weird error that the udf is not found so I think it's better/safer to parse it like this.
| .drop(tfs_output_col) | ||
| ) | ||
| mode = imageIO.imageTypeByName('CV_32FC%d' % orig_image.nChannels) | ||
| return Row(origin="",mode=mode.ord, height=height, width=width, nChannels=orig_image.nChannels, data=bytearray(np.array(numeric_data).astype(np.float32).tobytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (style): Try to keep line length < 100 chars if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a Scala style guide (https://github.com/databricks/scala-style-guide) but I can't find one for Python -- AFAIK for Python we just follow PEP8 conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, makes sense :) I'll try to follow that in the future.
python/sparkdl/image/imageIO.py
Outdated
|
|
||
| def imageTypeByOrdinal(ord): | ||
| if not ord in __ocvTypesByOrdinal: | ||
| raise KeyError("unsupported image type with ordinal " + ord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: mention that we're working with OpenCV image types, e.g. "received unsupported OpenCV image type with ordinal " + ord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be another place where it's useful to include a list of supported values (ordinals) in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
python/sparkdl/image/imageIO.py
Outdated
| img = Image.open(BytesIO(imageData)) | ||
| except IOError: | ||
| return None | ||
| def resizeImage_jvm(size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: Let's go ahead and remove this method, since it's not currently being used.
sueann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some rough comments for you - I haven't looked at everything in detail. I believe @smurching has left you some comments as well so I'll send you these for now. I particular next time I will take a closer look at the logic changes in the transformers to make sure there's nothing we've missed. Thanks!
Note: Let's make a release (0.3.0) before this gets merged in and mark it as the last one backwards-compatible with Spark 2.2. We should also send an email to the mailing-list (now that we have one) about the breaking changes coming in 1.0.0.
python/sparkdl/__init__.py
Outdated
|
|
||
| from .graph.input import TFInputGraph | ||
| from .image.imageIO import imageSchema, imageType, readImages | ||
| from pyspark.ml.image import ImageSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move above this group to its own group
python/sparkdl/__init__.py
Outdated
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
build.sbt
Outdated
| // http://www.scala-sbt.org/0.13/docs/index.html | ||
|
|
||
| val sparkVer = sys.props.getOrElse("spark.version", "2.1.1") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to ignore the built file for now while you work on the dependency problem. Let me know if you'd like me to look at it.
README.md
Outdated
| ``` | ||
|
|
||
| The resulting DataFrame contains a string column named "filePath" containing the path to each image file, and a image struct ("`SpImage`") column named "image" containing the decoded image data. | ||
| or alternatively, using PIL in python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this way return the images in the same schema (BGR)? If not, I don't think we should recommend this (i.e. remove from the README).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup it does, it produces the same image schema with the same channel ordering. This is the one used in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, when should I use the Spark version vs the PIL version? It seems confusing to have both here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can remove it from the readme. I put it there because we actually did need to use the python read to match Keras so I thought user should know about the option.
| elif img_dtype == 'float32': | ||
| image_float = tf.decode_raw(image_buffer, tf.float32, name="decode_raw") | ||
|
|
||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the new schema, are there legitimate types that have float64 (or any other dtypes) as img_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the schema does not specify types. It only specifies a field with OpenCv type number. There are open CV types which have float64. Technically the schema includes openCvTypes map with only a subset of types, however we already need types outside of this subset (Tf produced images are stored as float32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does ImageSchema support OpenCV types that have float64? If so, should we support them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently as far as I know there is no way how you can get float64 image.
ImageSchema as a data format supports it in that it has a mode field which is supposed to have OpenCV type in it and there are OpenCV types with float64. However, it is not listed in the list of openCV types in their scala code (and neither are any float32 which we need) and as it stands now, readImages can only ever produce images stored in unsigned bytes (both scala an PIL version) so one of CV_8U* formats. We also need the float32 formats since thats' what we return when returning images from TF so I added those to our python side.
The python code from image schema can only handle unsigned byte images, thats why I use our own version in imageIO (imageArrayToStruct and imageStructToArray).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From offline discussion: The ImageSchema utilities in Spark only support uint8 types. Ideally float32 types would also be supported natively in Spark so we don't have to have special logic in this package to handle it. We'll create a Jira in Spark for that and try to address it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a Jira for this already? If so, could you link from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we do. https://issues.apache.org/jira/browse/SPARK-22730
You mean you want it in the code? That would probably go to imageIO, I'll put it there
| class ImageUtilsSuite extends FunSuite { | ||
| // We want to make sure to test ImageUtils in headless mode to ensure it'll work on all systems. | ||
| assert(System.getProperty("java.awt.headless") === "true") | ||
| // assert(System.getProperty("java.awt.headless") === "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
| graphic.dispose() | ||
|
|
||
| spImageFromBufferedImage(tgtImg) | ||
| spImageFromBufferedImage(tgtImg,origin=ImageSchema.getOrigin(spImage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space between arguments
| } | ||
|
|
||
| def getResizeImageUDF(h:Int,w:Int): UserDefinedFunction = udf( (x:Row) => { | ||
| resizeImage(h,w,3 /** hardcoded for now, currently resize code only accepts 3 channels **/,x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space between arguments
| from sparkdl.utils import jvmapi as JVMAPI | ||
| from sparkdl.image.imageIO import imageSchema, imageArrayToStruct | ||
| from sparkdl.image.imageIO import imageArrayToStruct | ||
| from pyspark.ml.image import ImageSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to the pyspark section above
| assert isinstance(img_arr_reloaded, np.ndarray), \ | ||
| "expect preprocessor to return a numpy array" | ||
| img_arr_reloaded = img_arr_reloaded.astype(np.uint8) | ||
| img_arr_reloaded = img_arr_reloaded.astype(np.uint8)[...,::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we don't use ImageSchema in the context of the udfs, do we? i forget if the user ever gets access to the column with the image loaded throughout the udf using process. i'm wondering if we need to do this BGR - RGB back-and-forth here, especially given that a typical user will probably define a keras_load function using the RGB format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The udf returns image shcema so it has to be reversed.
13f0e1e to
ec1a2f7
Compare
…hema with Spark2.3's version
83cb7f7 to
b6d0a27
Compare
Codecov Report
@@ Coverage Diff @@
## master #85 +/- ##
==========================================
- Coverage 85.99% 82.49% -3.51%
==========================================
Files 30 33 +3
Lines 1692 1879 +187
Branches 17 35 +18
==========================================
+ Hits 1455 1550 +95
- Misses 237 329 +92
Continue to review full report at Codecov.
|
b6d0a27 to
7c016dc
Compare
… spark-deep-learning, so we can use it against current production version of spark. Should be removed once Spark2.3 is released.
…ameter to TFImage to specify current channel order. Spark deep learning will automatically fix given graph to match the image schema input.
b190125 to
1ab34bf
Compare
…rk.ml.image to sparkdl.image to avoid patchin the __path__ variable of the pyspark.ml.image module
|
@sueann
|
sueann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moar questions & comments for you.
| for i, img in enumerate(images): | ||
| assert img is not None and img.mode == "RGB" | ||
| imageArray[i] = np.array(img.resize(shape)) | ||
| imageArray[i] = imageIO._reverseChannels(np.array(img.resize(shape))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have getSampleImageList() return images in BGR. It's only used here, and there is nothing about that function that suggests it should return RGB (e.g. not keras-related or anything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns PIL images, not image schema, that's what dictates the RGB order. I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this test's perspective it doesn't need an image to come in as RGB and the logic here is not relevant to it. Is there any reason why getSampleImageList() shouldn't just return imageArray itself?
| resized_images = tf.image.resize_images(image_arr, InceptionV3Constants.INPUT_SHAPE) | ||
| processed_images = preprocess_input(resized_images) | ||
| # keras expects array in RGB order, we get it from image schema in BGR => need to flip | ||
| processed_images = preprocess_input(imageIO._reverseChannels(resized_images)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(sueann); check why we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our previous discussion we decided to all Keras code is RGB, then the preprocessor returns RGB and we need to flip it to return valid image schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think that's fine
| InceptionV3Constants.INPUT_SHAPE) | ||
| preprocessed = preprocess_input(resized_images) | ||
| # keras expects array in RGB order, we get it from image schema in BGR => need to flip | ||
| preprocessed = preprocess_input(imageIO._reverseChannels(resized_images)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(sueann): check why we need this and whether we can put it somewhere else
| preppedImage = cls.appModel._testPreprocess(imageArray.astype('float32')) | ||
| cls.kerasPredict = cls.appModel._testKerasModel(include_top=True).predict(preppedImage) | ||
| cls.preppedImage = preppedImage | ||
| cls.kerasPredict = cls.appModel._testKerasModel(include_top=True).predict(preppedImage,batch_size=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't preppedImage BGR? if so, wouldn't keras_model.predict() on it be wrong? (i'm confused why these tests are passing)
python/sparkdl/graph/pieces.py
Outdated
| """ | ||
|
|
||
| def buildSpImageConverter(img_dtype): | ||
| def buildSpImageConverter(channelOrder,img_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after comma
python/tests/graph/test_pieces.py
Outdated
|
|
||
| xcpt_model = Xception(weights="imagenet") | ||
| stages = [('spimage', gfac.buildSpImageConverter(SparkMode.RGB_FLOAT32)), | ||
| stages = [('spimage', gfac.buildSpImageConverter('BGR','float32')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after ,
python/tests/image/test_imageIO.py
Outdated
| def test_resize(self): | ||
| imgAsRow = imageIO.imageArrayToStruct(array) | ||
| smaller = imageIO._resizeFunction([4, 5]) | ||
| imgAsPIL = PIL.Image.fromarray(obj=imageIO._reverseChannels(array)).resize((5,4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearer reordering of lines (for readability):
self.assertRaises(ValueError, imageIO.createResizeImageUDF, [1, 2, 3])
make_smaller = imageIO.createResizeImageUDF([4, 5]).func
imgAsRow = imageIO.imageArrayToStruct(array)
smallerImg = make_smaller(imgAsRow)
self.assertEqual(smallerImg.height, 4)
self.assertEqual(smallerImg.width, 5)
# Compare to PIL resizing
imgAsPIL = PIL.Image.fromarray(obj=imageIO._reverseChannels(array)).resize((5,4))
smallerAry = imageIO._reverseChannels(np.asarray(imgAsPIL))
np.testing.assert_array_equal(smallerAry, imageIO.imageStructToArray(smallerImg))
# I'm not sure what the following is testing so I don't know where they should go yet
for n in ImageSchema.imageSchema['image'].dataType.names:
smallerImg[n]
sameImage = imageIO.createResizeImageUDF((imgAsRow.height, imgAsRow.width)).func(imgAsRow)
self.assertEqual(imgAsRow, sameImage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll change the order. I believe the last part of the test tests that the image has the correct schema which this might not be the best way to test. I left it in since it was there before.
python/tests/image/test_imageIO.py
Outdated
| self.assertEqual(imgAsStruct.width, width) | ||
| self.assertEqual(imgAsStruct.data, array.tobytes()) | ||
| imgReconstructed = imageIO.imageStructToArray(imgAsStruct) | ||
| np.testing.assert_array_equal(array,imgReconstructed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space aftr ,
| self.assertEqual(imgAsStruct.width, width) | ||
| self.assertEqual(len(imgAsStruct.data), array.size * 4) | ||
|
|
||
| # Check channel mismatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to keep these assertRaises tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert raises assertions are not needed anymore. Data type is inferred from the array type instead of being passed as an argument. there is no invalid conversion any more.
python/tests/image/test_imageIO.py
Outdated
| _test(np.random.random_sample((10,11,nChannels)).astype('float32')) | ||
|
|
||
| def test_image_round_trip(self): | ||
| # Test round trip: array -> png -> sparkImg -> array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may not need this test anymore but let me thikn about it (or we can chat offline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need it until it gets moved out to Spark's ImageSchema.
python/sparkdl/image/imageIO.py
Outdated
|
|
||
| __ocvTypesByName = {m.name:m for m in supportedOcvTypes} | ||
| __ocvTypesByOrdinal = {m.ord:m for m in supportedOcvTypes} | ||
| __ocvTypesByName = {m.name:m for m in _supportedOcvTypes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double underscore is only for python constructs?
| raise TypeError(err_msg.format(type(value), value)) | ||
|
|
||
| return value | ||
| @staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline between methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^this
python/sparkdl/param/converters.py
Outdated
| return value | ||
| @staticmethod | ||
| def toChannelOrder(value): | ||
| if not value in ('RGB','BGR'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'L' (or whatever the one-channel one is) is also supported no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^this
| # Test that resize with the same size is a no-op | ||
| sameImage = imageIO.createResizeImageUDF((imgAsRow.height, imgAsRow.width)).func(imgAsRow) | ||
| self.assertEqual(imgAsRow, sameImage) | ||
| # Test that we have a valid image schema (all fields are in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comments!
python/tests/image/test_imageIO.py
Outdated
| _test(np.random.randint(0, 256, (10, 11, nChannels), 'uint8')) | ||
| _test(np.random.random_sample((10,11,nChannels)).astype('float32')) | ||
|
|
||
| def test_image_round_trip(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this one since the tests directly above and below test what this is testing more clearly.
| shape = cls.appModel.inputShape() | ||
|
|
||
| imgFiles, images = getSampleImageList() | ||
| imageArray = np.empty((len(images), shape[0], shape[1], 3), 'uint8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move getSampleImagelist() as a "private" method in this test class and have it take care of resizing & reverseChannels as well. so a reader doesn't have to worry about getSampleImageList's return image type.
6e09669 to
9236fc8
Compare
| self.assertEqual(processed_images.shape[2], InceptionV3Constants.INPUT_SHAPE[1]) | ||
|
|
||
| transformer = TFImageTransformer(inputCol="image", outputCol=outputCol, graph=g, | ||
| transformer = TFImageTransformer(channelOrder='BGR',inputCol="image", outputCol=outputCol, graph=g, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also test the RGB version separtely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see you have it above
| self.assertTrue( (processed == keras_processed).all() ) | ||
|
|
||
|
|
||
| # TODO: I believe this is already tested in named_image_test, should we remove it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep this test and the one above as simple test cases for TFTransformer, one for BGR and one for RGB. what do you think?
9236fc8 to
a3d386f
Compare
sueann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of comments that require your attention - could you take a look? I have re-marked them (there are two) - the rest new ones are all about the space after commas 😂. Otherwise it looks good to me. Thanks!
| raise TypeError(err_msg.format(type(value), value)) | ||
|
|
||
| return value | ||
| @staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^this
python/sparkdl/param/converters.py
Outdated
| return value | ||
| @staticmethod | ||
| def toChannelOrder(value): | ||
| if not value in ('RGB','BGR'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^this
| graph=keras_graph) | ||
| image_df = self.loadImagesInternal(dataset, self.getInputCol()) | ||
| transformer = TFImageTransformer(inputCol=self._loadedImageCol(), | ||
| transformer = TFImageTransformer(channelOrder='RGB',inputCol=self._loadedImageCol(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
| outputTensor=modelGraphSpec["outputTensorName"], | ||
| outputMode=modelGraphSpec["outputMode"]) | ||
| resizeUdf = resizeImage(modelGraphSpec["inputTensorSize"]) | ||
| tfTransformer = TFImageTransformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous styling was correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still under 100 characters though. there are definitely longer lines in the file.
|
|
||
| output_col = "transformed_image" | ||
| transformer = TFImageTransformer(inputCol="image", outputCol=output_col, graph=g, | ||
| transformer = TFImageTransformer(channelOrder='RGB',inputCol="image", outputCol=output_col, graph=g, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
python/sparkdl/image/imageIO.py
Outdated
| # dtype - data type of the image's array, sorted as a numpy compatible string. | ||
| # | ||
| # NOTE: likely to be migrated to Spark ImageSchema code in the near future. | ||
| _OcvType = namedtuple("OcvType",["name","ord","nChannels","dtype"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces....
python/sparkdl/image/imageIO.py
Outdated
| """ | ||
| return sparkModeLookup[imageRow.mode] | ||
|
|
||
| return Row(origin=origin,mode=imageType.ord, height=height, width=width, nChannels=nChannels, data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
python/sparkdl/image/imageIO.py
Outdated
| if imgType.nChannels != 1: | ||
| ary = _reverseChannels(ary) | ||
| if imgType.nChannels == 1: | ||
| return Image.fromarray(obj=ary,mode='L') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space here and below
python/sparkdl/image/imageIO.py
Outdated
| imgAsPil = imageStructToPIL(imgAsRow).resize(sz) | ||
| # PIL is RGB based while image schema is BGR based => we need to flip the channels | ||
| imgAsArray = _reverseChannels(np.asarray(imgAsPil)) | ||
| return imageArrayToStruct(imgAsArray,origin=imgAsRow.origin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
python/sparkdl/image/imageIO.py
Outdated
| return None | ||
| decodeImage = udf(_decode, ImageSchema.imageSchema['image'].dataType) | ||
| imageData = filesToDF(sc, path, numPartitions=numPartition) | ||
| return imageData.select(decodeImage("filePath","fileData").alias("image")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
8912821 to
cf33b02
Compare
cf33b02 to
def1e0e
Compare
sueann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one tiny thing to fix. We can merge after. Thanks!
| schema = StructType([StructField("filePath", StringType(), False), | ||
| StructField("fileData", BinaryType(), False)]) | ||
| rdd = sc.binaryFiles(path, minPartitions=numPartitions).repartition(numPartitions) | ||
| rdd = sc.binaryFiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol i think autopep8's spacing for function calls doesn't match pep8... but oh well, good enough for me.
python/sparkdl/param/converters.py
Outdated
| raise ValueError("Unsupported channel order. Expected one of ('RGB','BGR') but got '%s'") % value | ||
| if not value in ('L', 'RGB', 'BGR'): | ||
| raise ValueError( | ||
| "Unsupported channel order. Expected one of ('RGB','BGR') but got '%s'") % value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add 'L' here
a86c949 to
5ef9a6b
Compare
|
@tomasatdatabricks do you mind sending an email to the sparkdl mailing list (google groups linked from the readme) about this change:
|
Use Spark 2.3's ImageSchema as image interface.
-preserved ability to read and resize images in python using PIL to match Keras
(resize gives different result but also reading jpegs produced images which were off by 1 on some green pixels)
[TODO] - In order to run on spark < 2.3, the image schema files have been copied here and need to be removed in the future.