DataAnalysisWithPythonAndPySpark/code/Ch02/word_count_submit.py at trunk · DomD7/DataAnalysisWithPythonAndPySpark

24 lines (19 loc) · 775 Bytes

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()
spark.sparkContext.setLogLevel("WARN")
# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("../../data/gutenberg_books/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby(F.col("word"))
    .count()
results.orderBy("count", ascending=False).show(10)
results.coalesce(1).write.csv("./results_single_partition.csv")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

word_count_submit.py

Latest commit

History

word_count_submit.py

File metadata and controls