DataAnalysisWithPythonAndPySpark/code/Ch03/word_count.py at trunk · irmablanco/DataAnalysisWithPythonAndPySpark

32 lines (22 loc) · 814 Bytes

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    explode,
    regexp_extract,
spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
lines = book.select(split(book.value, " ").alias("line"))
words = lines.select(explode(col("line")).alias("word"))
words_lower = words.select(lower(col("word")).alias("word"))
words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
words_nonull = words_clean.where(col("word") != "")
results = words_nonull.groupby(col("word")).count()
results.orderBy("count", ascending=False).show(10)
results.coalesce(1).write.csv("./simple_count_single_partition.csv")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

word_count.py

Latest commit

History

word_count.py

File metadata and controls