forked from jleetutorial/python-spark-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathHousePriceProblem.py
More file actions
39 lines (32 loc) · 1.67 KB
/
HousePriceProblem.py
File metadata and controls
39 lines (32 loc) · 1.67 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
if __name__ == "__main__":
'''
Create a Spark program to read the house data from in/RealEstate.csv,
group by location, aggregate the average price per SQ Ft and sort by average price per SQ Ft.
The houses dataset contains a collection of recent real estate listings in
San Luis Obispo county and around it.
The dataset contains the following fields:
1. MLS: Multiple listing service number for the house (unique ID).
2. Location: city/town where the house is located. Most locations are in
San Luis Obispo county and northern Santa Barbara county (Santa MariaOrcutt, Lompoc,
Guadelupe, Los Alamos), but there some out of area locations as well.
3. Price: the most recent listing price of the house (in dollars).
4. Bedrooms: number of bedrooms.
5. Bathrooms: number of bathrooms.
6. Size: size of the house in square feet.
7. Price/SQ.ft: price of the house per square foot.
8. Status: type of sale. Thee types are represented in the dataset: Short Sale,
Foreclosure and Regular.
Each field is comma separated.
Sample output:
+----------------+-----------------+
| Location| avg(Price SQ Ft)|
+----------------+-----------------+
| Oceano| 95.0|
| Bradley| 206.0|
| San Luis Obispo| 359.0|
| Santa Ynez| 491.4|
| Cayucos| 887.0|
|................|.................|
|................|.................|
|................|.................|
'''