Click here to Skip to main content
15,937,429 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
So I have this spark dataframe with following schema:

```
Terminal
root
 |-- id: string (nullable = true)
 |-- elements: struct (nullable = true)
 |    |-- created: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- field: string (nullable = true)
 |    |    |    |-- fieldId: string (nullable = true)
 |    |    |    |-- fieldtype: string (nullable = true)
 |    |    |    |-- from: string (nullable = true)
 |    |    |    |-- fromString: string (nullable = true)
 |    |    |    |-- tmpFromAccountId: string (nullable = true)
 |    |    |    |-- tmpToAccountId: string (nullable = true)
 |    |    |    |-- to: string (nullable = true)
 |    |    |    |-- toString: string (nullable = true)



For this case, I want to change value inside "items" elements (field, fieldId, etc.) using defined value ("Issue") - without caring if it is empty or already filled. So it should be from:

Terminal
+--------+--------------------------------------------------------------------------------+
| id     | elements                                                                       |
+--------+--------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[field, , status,,,,, 23456, Yes]]]    |
+--------+--------------------------------------------------------------------------------+


To:

Terminal
+--------+----------------------------------------------------------------------------------------------------------+
| id     | elements                                                                                                 |
+--------+----------------------------------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue]]]|
+-------------------------------------------------------------------------------------------------------------------+


What I have tried:

I already try using this script in python file, but it didn't work:

Terminal
replace_list = ['field', 'fieldtype', 'fieldId', 'from', 'fromString', 'to', 'toString', 'tmpFromAccountId', 'tmpToAccountId']

# Didn't work 1
for col_name in replace_list: df = df.withColumn(f"items.element.{col_name}", lit("Issue"))

# Didn't work 2
for col_name in replace_list: df = df.withColumn("elements.items.element", struct(col(f"elements.items.element.*"), lit("Issue").alias(f"{col_name}")))


In this case, I'm using Spark version 2.4.8. I don't want to use explode method since I want to avoid join dataframes. Is it possible to perform this kind of operation directly in spark? Thank you.
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900