Click here to Skip to main content
15,792,870 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
I have a housing dataset in which I have both categorical and numerical variables. Out of this dataset I created another dataset of numeric_attributes only in which I have numeric_attributes in an array. Dataset - Array values. Numeric_attributes [No. of bedrooms, Price, Age]

Now I want to loop over Numeric_attributes array first and then inside each element to calculate mean of each numeric_attribute.

Dataset 1 Age Price Location 20 56000 ABC 30 58999 XYZ

Dataset 2 (Array in dataframe) Numeric_attributes [Age, Price]

output Mean(Age) Mean(Price)

What I have tried:

def minimum_value(df2):
    min_value = lambda x: x.min()
    for a in df2.collect():
        for b in a.collect():
            min_udf = F.udf(lambda row: [min_value(x) for x in b])
            df2.withColumn('minimum_value', min_udf(F.col('Numerical_attributes').cast("array<int>")))
        return df2
Updated 21-Apr-22 6:43am
Richard MacCutchan 21-Apr-22 5:28am    
What is the problem with that code?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900