[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

harupy · 2021-04-20T02:33:51Z

What changes were proposed in this pull request?

Fixes incorrect return type for rawPredictionUDF in OneVsRestModel.

Why are the changes needed?

Bugfix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

python/pyspark/ml/classification.py

harupy · 2021-04-20T03:12:45Z

python/pyspark/ml/classification.py

@@ -3151,7 +3151,7 @@ def func(predictions):
                    predArray.append(x)
                return Vectors.dense(predArray)

-            rawPredictionUDF = udf(func)


Should I add a test here to ensure that the rawPrediction column is no longer string

spark/python/pyspark/ml/tests/test_algorithms.py

Lines 108 to 117 in 0494dc9

def test_output_columns(self):

df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),

(1.0, Vectors.sparse(2, [], [])),

(2.0, Vectors.dense(0.5, 0.5))],

["label", "features"])

lr = LogisticRegression(maxIter=5, regParam=0.01)

ovr = OneVsRest(classifier=lr, parallelism=1)

model = ovr.fit(df)

output = model.transform(df)

self.assertEqual(output.columns, ["label", "features", "rawPrediction", "prediction"])

Yeah, I think we should better add a test if possible.

Got it, added a test

@HyukjinKwon
why only transformed_df.head() trigger this error ?
does it indicate bugs in pyspark-sql udf ?

Seems like pred.show() triggers an exception too? what does it return in other methods?

HyukjinKwon · 2021-04-20T03:14:39Z

ok to test

HyukjinKwon · 2021-04-20T03:14:44Z

add to whitelist

HyukjinKwon · 2021-04-20T03:14:51Z

cc @WeichenXu123 FYI

SparkQA · 2021-04-20T03:44:52Z

Test build #137665 has finished for PR 32245 at commit 3f75ab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T04:00:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42193/

SparkQA · 2021-04-20T04:00:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42193/

SparkQA · 2021-04-20T04:02:06Z

Test build #137666 has finished for PR 32245 at commit 5e05b50.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T05:01:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42194/

SparkQA · 2021-04-20T05:01:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42194/

WeichenXu123 · 2021-04-20T05:38:49Z

CC @zhengruifeng

SparkQA · 2021-04-20T06:24:58Z

Test build #137668 has finished for PR 32245 at commit 3c2ac95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T06:56:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42196/

SparkQA · 2021-04-20T06:56:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42196/

WeichenXu123

LGTM

python/pyspark/ml/tests/test_algorithms.py

SparkQA · 2021-04-21T02:20:31Z

Test build #137708 has finished for PR 32245 at commit b6fabb3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-21T02:50:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42236/

SparkQA · 2021-04-21T04:25:28Z

Test build #137713 has finished for PR 32245 at commit ed26d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-21T04:49:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42241/

SparkQA · 2021-04-21T04:54:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42241/

WeichenXu123 · 2021-04-21T07:33:30Z

LGTM

HyukjinKwon · 2021-04-21T07:34:42Z

Looks good. @harupy, would you mind filling the PR description per the template?

HyukjinKwon · 2021-04-21T07:43:09Z

@viirya, are you preparing Spark 2.4 RC now? This is supposed to be in Spark 2.4 too but this isn't a regression so it doesn't block. It's just a good to have so if you're preparing, it should be fine to don't backport.

viirya · 2021-04-21T07:48:27Z

@viirya, are you preparing Spark 2.4 RC now? This is supposed to be in Spark 2.4 too but this isn't a regression so it doesn't block. It's just a good to have so if you're preparing, it should be fine to don't backport.

#32256 was just merged, so I have not started new RC yet. I can wait for this.

HyukjinKwon · 2021-04-21T07:55:43Z

BTW, the tests passed at https://github.com/harupy/spark/actions/runs/769366516. GitHub Actions didn't work properly for linking that run for some reasons ..

I will leave it to @WeichenXu123 then.

…nUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit b6350f5) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2021-04-21T08:31:40Z

@harupy

Backport to branch-3.1 cause conflicts.
Could you create a PR against apache/spark branch-3.1 ?

++<<<<<<< HEAD
 +    def test_parallelism_doesnt_change_output(self):
++=======
+     def test_raw_prediction_column_is_of_vector_type(self):
+         # SPARK-35142: `OneVsRestModel` outputs raw prediction as a string column
+         df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),
+                                          (1.0, Vectors.sparse(2, [], [])),
+                                          (2.0, Vectors.dense(0.5, 0.5))],
+                                         ["label", "features"])
+         lr = LogisticRegression(maxIter=5, regParam=0.01)
+         ovr = OneVsRest(classifier=lr, parallelism=1)
+         model = ovr.fit(df)
+         row = model.transform(df).head()
+         self.assertIsInstance(row["rawPrediction"], DenseVector)
+ 
+     def test_parallelism_does_not_change_output(self):
++>>>>>>> b6350f5bb0... [SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel`

harupy · 2021-04-21T08:40:44Z

@WeichenXu123 Opened a PR: #32269

viirya · 2021-04-21T08:59:28Z

I don't see backport to 2.4. Do you plan to backport it? @WeichenXu123 @harupy?

harupy · 2021-04-21T09:04:23Z

@viirya Got it. I'll open another PR for 2.4.

Wait, does OneVsRestModel in 2.4 output the raw prediction column? Looks like it doesn't.

spark/python/pyspark/ml/classification.py

Lines 1964 to 2009 in 1630d64

    
           def _transform(self, dataset): 
        
               # determine the input columns: these need to be passed through 
        
               origCols = dataset.columns 
        
               # add an accumulator column to store predictions of all the models 
        
               accColName = "mbc$acc" + str(uuid.uuid4()) 
        
               initUDF = udf(lambda _: [], ArrayType(DoubleType())) 
        
               newDataset = dataset.withColumn(accColName, initUDF(dataset[origCols[0]])) 
        
               # persist if underlying dataset is not persistent. 
        
               handlePersistence = dataset.storageLevel == StorageLevel(False, False, False, False) 
        
               if handlePersistence: 
        
                   newDataset.persist(StorageLevel.MEMORY_AND_DISK) 
        
               # update the accumulator column with the result of prediction of models 
        
               aggregatedDataset = newDataset 
        
               for index, model in enumerate(self.models): 
        
                   rawPredictionCol = model._call_java("getRawPredictionCol") 
        
                   columns = origCols + [rawPredictionCol, accColName] 
        
                   # add temporary column to store intermediate scores and update 
        
                   tmpColName = "mbc$tmp" + str(uuid.uuid4()) 
        
                   updateUDF = udf( 
        
                       lambda predictions, prediction: predictions + [prediction.tolist()[1]], 
        
                       ArrayType(DoubleType())) 
        
                   transformedDataset = model.transform(aggregatedDataset).select(*columns) 
        
                   updatedDataset = transformedDataset.withColumn( 
        
                       tmpColName, 
        
                       updateUDF(transformedDataset[accColName], transformedDataset[rawPredictionCol])) 
        
                   newColumns = origCols + [tmpColName] 
        
                   # switch out the intermediate column with the accumulator column 
        
                   aggregatedDataset = updatedDataset\ 
        
                       .select(*newColumns).withColumnRenamed(tmpColName, accColName) 
        
               if handlePersistence: 
        
                   newDataset.unpersist() 
        
               # output the index of the classifier with highest confidence as prediction 
        
               labelUDF = udf( 
        
                   lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]), 
        
                   DoubleType()) 
        
               # output label and label metadata as prediction 
        
               return aggregatedDataset.withColumn( 
        
                   self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)

HyukjinKwon · 2021-04-21T11:44:43Z

Okay, looks like we can skip Spark 2.4.

viirya · 2021-04-21T15:45:11Z

Thanks for confirming. @harupy @HyukjinKwon

specify return type for rawPredictionUDF

Loading status checks…

3f75ab2

github-actions bot added CORE ML PYTHON labels Apr 20, 2021

harupy force-pushed the harupy:SPARK-35142 branch from bb641f5 to 3f75ab2 Apr 20, 2021

harupy marked this pull request as ready for review Apr 20, 2021

harupy reviewed Apr 20, 2021

View changes

python/pyspark/ml/classification.py Show resolved Hide resolved

harupy reviewed Apr 20, 2021

View changes

harupy added 2 commits Apr 20, 2021

Add test

Loading status checks…

383c84f

Fix incorrect variable name

Loading status checks…

5e05b50

import VectorUDT

Loading status checks…

3c2ac95

WeichenXu123 approved these changes Apr 21, 2021

View changes

WeichenXu123 reviewed Apr 21, 2021

View changes

python/pyspark/ml/tests/test_algorithms.py Outdated Show resolved Hide resolved

harupy added 3 commits Apr 21, 2021

Create a separate test

Loading status checks…

98d241e

rename test

Loading status checks…

2f12765

add comment

Loading status checks…

b6fabb3

Fix test failure

Loading status checks…

ed26d2c

HyukjinKwon changed the title ~~[SPARK-35142][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel~~ [SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel Apr 21, 2021

WeichenXu123 closed this in b6350f5 Apr 21, 2021

harupy mentioned this pull request Apr 21, 2021

[SPARK-35142][PYTHON][ML][3.1] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32269

Closed

harupy mentioned this pull request Apr 21, 2021

[SPARK-35142][PYTHON][ML][3.0] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32275

Open

apache / spark

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

harupy commented Apr 20, 2021 •

edited by WeichenXu123

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Apr 20, 2021

HyukjinKwon commented Apr 20, 2021

HyukjinKwon commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

WeichenXu123 commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

WeichenXu123 left a comment

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

WeichenXu123 commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

viirya commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

WeichenXu123 commented Apr 21, 2021

harupy commented Apr 21, 2021

viirya commented Apr 21, 2021

harupy commented Apr 21, 2021 •

edited

HyukjinKwon commented Apr 21, 2021

viirya commented Apr 21, 2021

	def test_output_columns(self):
	df = self.spark.createDataFrame([(0.0, Vectors.dense(1.0, 0.8)),
	(1.0, Vectors.sparse(2, [], [])),
	(2.0, Vectors.dense(0.5, 0.5))],
	["label", "features"])
	lr = LogisticRegression(maxIter=5, regParam=0.01)
	ovr = OneVsRest(classifier=lr, parallelism=1)
	model = ovr.fit(df)
	output = model.transform(df)
	self.assertEqual(output.columns, ["label", "features", "rawPrediction", "prediction"])

apache / spark

[SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32245

[SPARK-35142][PYTHON][ML] Fix incorrect return type for rawPredictionUDF in OneVsRestModel #32245

Conversation

harupy commented Apr 20, 2021 • edited by WeichenXu123

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

harupy Apr 20, 2021 • edited Author Contributor

This comment has been minimized.

HyukjinKwon Apr 20, 2021 Member

This comment has been minimized.

harupy Apr 20, 2021 Author Contributor

This comment has been minimized.

WeichenXu123 Apr 20, 2021 Contributor

This comment has been minimized.

HyukjinKwon Apr 21, 2021 Member

HyukjinKwon commented Apr 20, 2021

HyukjinKwon commented Apr 20, 2021

HyukjinKwon commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

WeichenXu123 commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

WeichenXu123 left a comment

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

SparkQA commented Apr 21, 2021

WeichenXu123 commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

viirya commented Apr 21, 2021

HyukjinKwon commented Apr 21, 2021

WeichenXu123 commented Apr 21, 2021

harupy commented Apr 21, 2021

viirya commented Apr 21, 2021

harupy commented Apr 21, 2021 • edited

HyukjinKwon commented Apr 21, 2021

viirya commented Apr 21, 2021

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` #32245

harupy commented Apr 20, 2021 •

edited by WeichenXu123

harupy Apr 20, 2021 •

edited

Author Contributor

HyukjinKwon Apr 20, 2021
Member

harupy Apr 20, 2021
Author Contributor

WeichenXu123 Apr 20, 2021
Contributor

HyukjinKwon Apr 21, 2021
Member

harupy commented Apr 21, 2021 •

edited