What techniques ensure deduplication in large datasets?
Spark/Big Datamedium
2
What trade-offs would you consider when choosing between batch processing and real-time streaming?
Spark/Big Datahard
3
What's the difference between narrow and wide transformations?
Spark/Big Datamedium
4
Which Spark property controls the number of shuffle partitions?
Spark/Big Datamedium
5
Write PySpark code to extract data from a CSV and create a table.
Spark/Big Datamedium
6
Write PySpark code to save a DataFrame in Parquet format to an S3 bucket.
Spark/Big Datamedium
7
Write a PySpark job that calculates the number of unique users who logged in per day, but exclude any logins from inactive users listed in a separate file.
Spark/Big Datamedium
8
Write a PySpark script to check for missing values and duplicate rows in a DataFrame. How would you ensure data quality before saving it to a storage system?
Spark/Big Datahard
+20 More Questions with Expert Answers
Get the complete 1,800+ question library with detailed, expert-level answers covering SQL, Spark, System Design, Python, Cloud, and Behavioral topics.