Sale!

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Original price was: $ 50.Current price is: $ 35.

Exam Code	Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0
Exam Name	Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0
Questions	300 Questions Answers With Explanation
Update Date	April 30, 2025

Category DataBricks

Sample Questions

Section 1: Spark Architecture & Execution

Q1: What is the role of the Driver in a Spark application?

A. Runs tasks on worker nodes
B. Converts user code into Spark jobs and schedules tasks
C. Stores cached data in memory
D. Acts as a distributed file system

✅ Correct Answer: B
Explanation: The Driver converts user code into Spark jobs, schedules tasks, and collects results, while Executors run tasks and store cached data.

Q2: What is a Spark Executor responsible for?

A. Managing cluster resources
B. Running tasks and storing cached data
C. Defining the DAG
D. Handling user authentication

✅ Correct Answer: B
Explanation: Executors run tasks assigned by the Driver and store cached data in memory/disk.

Q3: What is Lazy Evaluation in Spark?

A. Spark executes transformations immediately
B. Spark delays execution until an action is called
C. Spark skips failed tasks automatically
D. Spark optimizes disk storage

✅ Correct Answer: B
Explanation: Spark lazily evaluates transformations (e.g., filter(), select()) and only executes them when an action (e.g., count(), collect()) is triggered.

Q4: What is a DAG in Spark?

A. A distributed file system
B. A logical execution plan of transformations
C. A storage format for DataFrames
D. A type of join operation

✅ Correct Answer: B
Explanation: The Directed Acyclic Graph (DAG) represents the sequence of transformations before optimization.

Q5: How does Spark recover from an Executor failure?

A. Restarts the Driver
B. Recomputes lost partitions using lineage
C. Switches to a backup cluster
D. Ignores the failed tasks

✅ Correct Answer: B
Explanation: Spark uses RDD lineage to recompute lost data partitions from the source.

Section 2: Spark DataFrame API

Q6: What is the difference between `cache()` and `persist()`?

A. cache() is used for Disk storage, persist() for Memory
B. cache() defaults to MEMORY_ONLY, persist() allows custom storage levels
C. persist() is only for streaming DataFrames
D. cache() is deprecated in Spark 3.0

✅ Correct Answer: B
Explanation:

cache() = persist(StorageLevel.MEMORY_ONLY)
persist() allows custom levels like MEMORY_AND_DISK.

Q7: How do you read a CSV file in Spark?

A. spark.read.text("file.csv")
B. spark.read.csv("file.csv", header=True, inferSchema=True)
C. spark.load.csv("file.csv")
D. spark.csv.read("file.csv")

✅ Correct Answer: B
Explanation: The correct method is spark.read.csv() with optional parameters like header and inferSchema.

Q8: Which operation is a transformation in Spark?

A. count()
B. collect()
C. filter()
D. show()

✅ Correct Answer: C
Explanation:

Transformations (lazy): filter(), select(), groupBy()
Actions (eager): count(), collect(), show()

Q9: What does `coalesce()` do?

A. Merges small partitions into fewer partitions without shuffling
B. Increases the number of partitions with a full shuffle
C. Converts a DataFrame to an RDD
D. Drops NULL values from a DataFrame

✅ Correct Answer: A
Explanation: coalesce() reduces partitions without a full shuffle, unlike repartition().

Q10: How do you rename a column in a DataFrame?

A. df.withColumnRenamed("old", "new")
B. df.rename("old", "new")
C. df.select("old").alias("new")
D. df.changeColumnName("old", "new")

✅ Correct Answer: A
Explanation: The correct method is withColumnRenamed().

Section 3: Spark SQL & Optimizations

Q11: How do you register a DataFrame as a SQL table?

A. df.saveAsTable("table")
B. df.createOrReplaceTempView("table")
C. df.registerTable("table")
D. df.toSQL("table")

✅ Correct Answer: B
Explanation: createOrReplaceTempView() registers a temporary SQL view.

Q12: Which join is most efficient when joining large tables?

A. broadcast join
B. sort-merge join
C. cross join
D. full outer join

✅ Correct Answer: B
Explanation: Sort-merge join is Spark’s default for large tables, while broadcast join is best for small tables.

Q13: What is predicate pushdown?

A. Pushing filters to the storage layer to reduce I/O
B. Increasing partition size for faster reads
C. Converting SQL queries to RDDs
D. A type of shuffle operation

✅ Correct Answer: A
Explanation: Predicate pushdown applies filters early (e.g., in Parquet files) to minimize data read.

Section 4: Debugging & Performance Tuning

Q14: How do you check the execution plan of a DataFrame?

A. df.explain()
B. df.showPlan()
C. df.plan()
D. df.debug()

✅ Correct Answer: A
Explanation: explain() shows the logical and physical execution plans.

Q15: What does `spark.sql.shuffle.partitions` control?

A. Number of input partitions when reading data
B. Number of partitions after a shuffle operation
C. Size of cached DataFrames
D. Number of executors in the cluster

✅ Correct Answer: B
Explanation: This setting controls shuffle partition count (default: 200).

Section 5: Delta Lake (Bonus for Databricks)

Q16: What is Delta Lake?

A. A distributed file system
B. An optimized storage layer with ACID transactions
C. A streaming framework
D. A machine learning library

✅ Correct Answer: B
Explanation: Delta Lake provides ACID transactions, schema enforcement, and time travel.

Section 6: Advanced DataFrame Operations

Q17: You need to add a new column with a conditional value. Which method should you use?

A) df.withColumn("status", when(col("age") >= 18, "adult").otherwise("minor"))
B) df.addColumn("status", ifelse(col("age") >= 18, "adult", "minor"))
C) df.select(col("*"), expr("IF(age >= 18, 'adult', 'minor') AS status"))
D) df.updateColumn("status", col("age") >= 18 ? "adult" : "minor")

✅ Correct Answer: A
Explanation: when().otherwise() is Spark’s conditional expression. Option C (SQL-style) also works but is less idiomatic for DataFrame API.

Q18: How do you remove duplicates from a DataFrame?

A) df.distinct()
B) df.dropDuplicates()
C) df.unique()
D) Both A and B

✅ Correct Answer: D
Explanation: Both distinct() and dropDuplicates() work, but dropDuplicates(subset=["col1", "col2"]) allows targeting specific columns.

Section 7: Performance Tuning

Q19: A join operation is slow. What’s the FIRST optimization to try?

A) Increase spark.sql.shuffle.partitions
B) Use broadcast() if one table is small
C) Cache both DataFrames
D) Switch to RDDs

✅ Correct Answer: B
Explanation: Broadcasting the smaller table avoids shuffling (most impactful fix). Tuning shuffle partitions (A) helps but is secondary.

Q20: What does `.repartition(10, col("country"))` do?

A) Creates 10 partitions randomly
B) Repartitions into 10 partitions, hashing the “country” column
C) Coalesces to 10 partitions
D) Sorts the DataFrame by “country”

✅ Correct Answer: B
Explanation: This partitions data by hash of “country”, ensuring same-country records land in the same partition.

Section 8: Spark SQL Deep Dive

Q21: How do you calculate the average salary by department in SQL?

A) SELECT AVG(salary) FROM employees
B) SELECT department, AVG(salary) FROM employees GROUP BY department
C) SELECT department, MEAN(salary) FROM employees
D) SELECT department, salary.mean() FROM employees

✅ Correct Answer: B
Explanation: SQL uses AVG() for averages, and GROUP BY for aggregations.

Q22: What’s wrong with this query?

SELECT user_id, COUNT(*) 
FROM transactions 
WHERE COUNT(*) > 5 
GROUP BY user_id

A) Missing HAVING clause
B) WHERE can’t filter aggregates
C) Both A and B
D) Nothing – it’s correct

✅ Correct Answer: C
Explanation: Aggregate filters require HAVING, not WHERE. Correct version:

SELECT user_id, COUNT(*) 
FROM transactions 
GROUP BY user_id 
HAVING COUNT(*) > 5

Section 9: Debugging & Errors

Q23: You get “Out of Memory” errors. What’s the LEAST likely solution?

A) Increase spark.executor.memory
B) Use coalesce() to reduce partitions
C) Cache fewer datasets
D) Set spark.sql.shuffle.partitions=10

✅ Correct Answer: D
Explanation: Too few shuffle partitions (D) causes oversized partitions (worsens OOMs). Solutions A-C directly address memory.

Q24: A task fails with “FileNotFoundException”. What’s the most likely cause?

A) Driver ran out of memory
B) Input file was moved/deleted during job
C) Executor CPU overloaded
D) Shuffle partitions too small

✅ Correct Answer: B
Explanation: Spark reads file paths at runtime. If files disappear mid-job, this error occurs.

Section 10: Delta Lake (Databricks-Specific)

Q25: How do you time-travel to view Delta table data as of yesterday?

A) spark.read.delta("path", versionAsOf=1)
B) spark.read.format("delta").option("timestampAsOf", "2023-01-01").load("path")
C) spark.sql("SELECT * FROM delta.path VERSION AS OF 1")
D) Both B and C

✅ Correct Answer: D
Explanation: Delta supports time travel via timestamp or version number (both syntaxes work).

Section 11: Structured Streaming

Q26: You’re processing a live stream of IoT data. How do you handle late-arriving data?

A) Increase spark.streaming.blockInterval
B) Use withWatermark() and window functions
C) Set spark.sql.streaming.forceDeleteTempCheckpointLocation=true
D) Disable spark.streaming.backpressure.enabled

✅ Correct Answer: B
Explanation: Watermarking (withWatermark()) defines how late data can be while still being included in windowed aggregations. Example:

df.withWatermark("eventTime", "10 minutes").groupBy(window("eventTime", "5 minutes"))

Q27: What happens if you don’t specify a watermark in streaming aggregation?

A) Spark automatically sets a 1-minute watermark
B) State information grows indefinitely
C) The query fails immediately
D) Late data is always processed

✅ Correct Answer: B
Explanation: Without watermarking, Spark must maintain all state forever (memory risk). Watermark tells Spark when to discard old state.

Section 12: Advanced Optimizations

Q28: Your DataFrame join has skew (one key has 10M rows, others have 100). How do you fix it?

A) df.repartition(200)
B) spark.sql.adaptive.skewJoin.enabled=true (Spark 3.0+)
C) df.hint("skew", "join_key", [10000000])
D) Both B and C

✅ Correct Answer: D
Explanation: Spark 3.0+ can handle skew automatically (B), or you can manually salt keys (C). Example manual fix:

# Add random salt to skewed key
skewed_df = df.withColumn("salted_key", concat(col("join_key"), lit("_"), floor(rand()*10)))

Q29: When should you use `.persist(StorageLevel.DISK_ONLY)`?

A) When RAM is limited but SSD storage is fast
B) For iterative machine learning workflows
C) When recomputation is cheaper than disk I/O
D) Never – MEMORY_ONLY is always better

✅ Correct Answer: A
Explanation: DISK_ONLY is useful when cluster memory is exhausted but you have fast SSDs. Common in large shuffle stages.

Section 13: Spark SQL Internals

Q30: What does this query plan indicate?

== Physical Plan ==
*(2) Project [id#0L, (id#0L * 2) AS double_id#2L]
+- *(1) Filter (id#0L > 5)
   +- *(1) Scan csv [id#0L]

A) Projection happens before filtering
B) Filtering happens before projection
C) The query uses a broadcast join
D) The plan is invalid

✅ Correct Answer: B
Explanation: Spark optimizes by filtering (id > 5) first, then projecting (id*2). The *(1)/*(2) notation shows execution steps.

Section 14: Error Handling

Q31: Your job fails with “ExecutorLostFailure: Container killed by YARN for exceeding memory limits”. What’s the best fix?

A) Decrease spark.executor.memoryOverhead
B) Increase spark.executor.memory
C) Add more filter() operations early
D) All of the above

✅ Correct Answer: D
Explanation: This error means executors exceeded memory. Solutions:

Increase memory allocation (B)
Reduce overhead (A)
Process less data via early filtering (C)

Section 15: Delta Lake Deep Dive

Q32: How do you compact small files in Delta Lake?

A) OPTIMIZE delta./path/to/table ZORDER BY timestamp
B) VACUUM delta./path/to/table RETAIN 168 HOURS
C) ALTER TABLE delta./path/to/table COMPACT SMALLFILES
D) spark.sql("REPAIR TABLE delta./path/to/table")

✅ Correct Answer: A
Explanation: OPTIMIZE combines small files and can co-locate data via ZORDER BY. VACUUM (B) only removes old files.

Section 16: PySpark Specifics

Q33: Why is this UDF slow compared to native Spark functions?

from pyspark.sql.functions import udf
slow_udf = udf(lambda x: x.upper())
df.withColumn("upper", slow_udf("name"))

A) UDFs can’t be vectorized
B) Data must be serialized to Python and back
C) No predicate pushdown
D) All of the above

✅ Correct Answer: D
Explanation: Python UDFs (vs. native Spark SQL functions) suffer from:

Serialization overhead (B)
No whole-stage code generation (A)
Broken optimization (C)

Section 17: Cluster Configuration

Q34: Your 100-core cluster uses only 10 cores during execution. What’s likely misconfigured?

A) spark.executor.cores=10 with 10 executors
B) spark.default.parallelism=10
C) spark.dynamicAllocation.enabled=true
D) spark.sql.shuffle.partitions=200

✅ Correct Answer: B
Explanation: spark.default.parallelism controls initial partition count for RDDs. Too low = underutilization. For 100 cores, set to at least 200-300.

Section 18: Advanced Troubleshooting

Q35: Your Spark UI shows tasks stuck in “Scheduling” state. What’s the most likely cause?

A) Too many executors idle
B) Driver network latency
C) Insufficient cluster resources
D) DataFrame caching is enabled

✅ Correct Answer: C
Diagnosis: When tasks remain scheduled but not executing, the cluster lacks resources (CPU/memory). Check:

spark.executor.instances vs. available cluster nodes
YARN/Mesos/K8s resource queues
spark.dynamicAllocation.enabled=true to auto-scale

Section 19: Query Execution Mysteries

Q36: Why does this query scan all files despite the filter?

df = spark.read.parquet("/data/year=*/month=*")
df.filter(col("year") == 2023).count()

A) Partition pruning isn’t supported with wildcards
B) The filter is applied after file scan
C) Parquet file metadata is corrupted
D) Need to use .where() instead of .filter()

✅ Correct Answer: A
Solution: Wildcards disable partition pruning. Instead use:

spark.read.parquet("/data/year=2023/month=*")

Section 20: Performance Paradoxes

Q37: Adding `.cache()` makes your job SLOWER. Why?

A) Cache overhead exceeds recomputation cost
B) Executors are over-provisioned
C) Disk persistence is enabled
D) Shuffle partitions are too small

✅ Correct Answer: A
Rule of Thumb: Only cache when:

Data is reused multiple times
Recomputing is expensive (complex transformations)
Cluster has available memory

Section 21: API Gotchas

Q38: Why does this fail at runtime?

df.select("price").filter(col("price") > 100).show()

A) Column object vs. string mismatch
B) Missing parentheses
C) show() isn’t an action
D) Need to register temp view first

✅ Correct Answer: A
Fixed Version:

df.select(col("price")).filter(col("price") > 100).show()
# OR
df.select("price").filter("price > 100").show()

Section 22: Resource Tuning

Q39: Your executors keep crashing. Which config is most critical to adjust?

A) spark.executor.memoryOverhead
B) spark.sql.shuffle.partitions
C) spark.default.parallelism
D) spark.log.level

✅ Correct Answer: A
Memory Tuning Guide:

Set spark.executor.memory (heap)
Add 10-20% memoryOverhead (off-heap)
Monitor peak usage in Spark UI

Section 23: The Infamous OOM

Q40: Driver OOMs during `collect()`. What’s the safest alternative?

A) Increase driver memory
B) Use .take(1000) or .show()
C) Switch to RDD API
D) Disable DAG visualization

✅ Correct Answer: B
Golden Rule: Never collect() large datasets to driver. Prefer:

.take(N) for samples
Write to storage (df.write.parquet()) for full exports

Section 24: Schema Nightmares

Q41: Merging 1000 Parquet files fails with schema mismatch. How to resolve?

A) spark.sql.parquet.mergeSchema=true
B) Manually edit each file’s schema
C) Convert to CSV first
D) Use df.schema() to override

✅ Correct Answer: A
Alternative: spark.read.option(“mergeSchema”, “true”).parquet(“/path”)

Section 25: The Ultimate Optimization

Q42: Your 4-hour job must run in 1 hour. What’s the first thing to try?

A) Increase executor count and memory
B) Rewrite in Scala
C) Switch to Pandas UDFs
D) Pre-warm the cluster

✅ Correct Answer: A
Optimization Hierarchy:

Throw resources at it (easiest)
Fix data skew
Optimize shuffle partitions
Improve algorithms (last resort)

Bonus: 3 Pro Tips for the Exam

Memorize these key configs:

spark.sql.shuffle.partitions  # Default: 200
spark.executor.memory         # Default: 1G
spark.dynamicAllocation.enabled # Default: false

Know the execution order:
WHERE → GROUP BY → HAVING → SELECT → ORDER BY
When in doubt:
- For performance: Check Spark UI
- For errors: Search the 300+ MB of Spark logs

Final Challenge

Q43: What’s the output of this code?

spark.range(5).repartition(3).rdd.glom().collect()

A) [[0,1,2,3,4]]
B) [[0], [1], [2,3,4]]
C) Array of 3 sub-arrays with distributed values
D) Throws an error

✅ Correct Answer: C
Explanation:

glom() shows partition contents
Repartitioning splits the range unpredictably
Example output: [[0,3], [1,4], [2]]

Why is Pass4Certs the best choice for certification exam preparation?

Pass4Certs is dedicated to providing practice test questions with answers, free of charge, unlike other web-based interfaces. To see the whole review material you really want to pursue a free record on Pass4Certs. A great deal of clients all around the world are getting high grades by utilizing our dumps. You can get 100 percent passing and unconditional promise on test. PDF files are accessible immediately after purchase.

A Central Tool to Help You Prepare for Exam

Pass4Certs.com is the last educational cost reason for taking the test. We meticulously adhere to the exact audit test questions and answers, which are regularly updated and verified by experts. Our exam dumps experts, who come from a variety of well-known administrations, are intelligent and qualified individuals who have looked over a very important section of exam question and answer to help you understand the concept and pass the certification exam with good marks.braindumps is the most effective way to set up your test in only 1 day.

User Friendly & Easily Accessible on Mobile Devices

Easy to Use and Accessible from Mobile Devices.There is a platform for the exam that is very easy to use. The fundamental point of our foundation is to give most recent, exact, refreshed and truly supportive review material. Students can use this material to study and successfully navigate the implementation and support of systems. Students can access authentic test questions and answers, which will be available for download in PDF format immediately after purchase. As long as your mobile device has an internet connection, you can study on this website, which is mobile-friendly for testers.

Dumps Are Verified by Industry Experts

Get Access to the Most Recent and Accurate Questions and Answers Right Away:
Our exam database is frequently updated throughout the year to include the most recent exam questions and answers. Each test page will contain date at the highest point of the page including the refreshed rundown of test questions and replies. You will pass the test on your first attempt due to the authenticity of the current exam questions.

Dumps for the exam have been checked by industry professionals who are dedicated for providing the right test questions and answers with brief descriptions. Each Questions & Answers is checked through experts. Highly qualified individuals with extensive professional experience in the vendor examination.

Pass4Certs.com delivers the best exam questions with detailed explanations in contrast with a number of other exam web portals.

Money Back Guarantee

Pass4Certs.com is committed to give quality braindumps that will help you breezing through the test and getting affirmation. In order to provide you with the best method of preparation for the exam, we provide the most recent and realistic test questions from current examinations. If you purchase the entire PDF file but failed the vendor exam, you can get your money back or get your exam replaced. Visit our guarantee page for more information on our straightforward money-back guarantee

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Leave Your Review

Customer Reviews

"This course helped me pass my exam on the first try! The practice tests and explanations were spot on. Highly recommended!" ⭐⭐⭐⭐⭐

"The content was very helpful and concise. Some topics were a little deeper, but overall was excellent and i recommend, it definitely helped me pass my certification." ⭐⭐⭐⭐⭐

"Passed my exam with 92%! The flashcards and timed quizzes were a game-changer. Perfect for last-minute revision." ⭐⭐⭐⭐⭐

"Pass4certs is the real MVP. I crammed for 3 days using their dumps and walked out of the exam like a boss. Passed with 89%!" ⭐⭐⭐⭐⭐

"Shoutout to Pass4certs for helping me level up my career. I’ve passed two certifications back-to-back with their help. Super reliable and updated content!" ⭐⭐⭐⭐⭐

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Sample Questions

Section 1: Spark Architecture & Execution

Q1: What is the role of the Driver in a Spark application?

Q2: What is a Spark Executor responsible for?

Q3: What is Lazy Evaluation in Spark?

Q4: What is a DAG in Spark?

Q5: How does Spark recover from an Executor failure?

Section 2: Spark DataFrame API

Q6: What is the difference between cache() and persist()?

Q7: How do you read a CSV file in Spark?

Q8: Which operation is a transformation in Spark?

Q9: What does coalesce() do?

Q10: How do you rename a column in a DataFrame?

Section 3: Spark SQL & Optimizations

Q11: How do you register a DataFrame as a SQL table?

Q12: Which join is most efficient when joining large tables?

Q13: What is predicate pushdown?

Section 4: Debugging & Performance Tuning

Q14: How do you check the execution plan of a DataFrame?

Q15: What does spark.sql.shuffle.partitions control?

Section 5: Delta Lake (Bonus for Databricks)

Q16: What is Delta Lake?

Section 6: Advanced DataFrame Operations

Q17: You need to add a new column with a conditional value. Which method should you use?

Q18: How do you remove duplicates from a DataFrame?

Section 7: Performance Tuning

Q19: A join operation is slow. What’s the FIRST optimization to try?

Q20: What does .repartition(10, col("country")) do?

Section 8: Spark SQL Deep Dive

Q21: How do you calculate the average salary by department in SQL?

Q22: What’s wrong with this query?

Section 9: Debugging & Errors

Q23: You get “Out of Memory” errors. What’s the LEAST likely solution?

Q24: A task fails with “FileNotFoundException”. What’s the most likely cause?

Section 10: Delta Lake (Databricks-Specific)

Q25: How do you time-travel to view Delta table data as of yesterday?

Section 11: Structured Streaming

Q26: You’re processing a live stream of IoT data. How do you handle late-arriving data?

Q27: What happens if you don’t specify a watermark in streaming aggregation?

Section 12: Advanced Optimizations

Q28: Your DataFrame join has skew (one key has 10M rows, others have 100). How do you fix it?

Q29: When should you use .persist(StorageLevel.DISK_ONLY)?

Section 13: Spark SQL Internals

Q30: What does this query plan indicate?

Section 14: Error Handling

Q31: Your job fails with “ExecutorLostFailure: Container killed by YARN for exceeding memory limits”. What’s the best fix?

Section 15: Delta Lake Deep Dive

Q32: How do you compact small files in Delta Lake?

Section 16: PySpark Specifics

Q33: Why is this UDF slow compared to native Spark functions?

Section 17: Cluster Configuration

Q34: Your 100-core cluster uses only 10 cores during execution. What’s likely misconfigured?

Section 18: Advanced Troubleshooting

Q35: Your Spark UI shows tasks stuck in “Scheduling” state. What’s the most likely cause?

Section 19: Query Execution Mysteries

Q36: Why does this query scan all files despite the filter?

Section 20: Performance Paradoxes

Q37: Adding .cache() makes your job SLOWER. Why?

Section 21: API Gotchas

Q38: Why does this fail at runtime?

Section 22: Resource Tuning

Q39: Your executors keep crashing. Which config is most critical to adjust?

Section 23: The Infamous OOM

Q40: Driver OOMs during collect(). What’s the safest alternative?

Section 24: Schema Nightmares

Q41: Merging 1000 Parquet files fails with schema mismatch. How to resolve?

Section 25: The Ultimate Optimization

Q42: Your 4-hour job must run in 1 hour. What’s the first thing to try?

Bonus: 3 Pro Tips for the Exam

Final Challenge

Why is Pass4Certs the best choice for certification exam preparation?

A Central Tool to Help You Prepare for Exam

User Friendly & Easily Accessible on Mobile Devices

Dumps Are Verified by Industry Experts

Money Back Guarantee

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Leave Your Review

Customer Reviews

Quick Links

Q6: What is the difference between `cache()` and `persist()`?

Q9: What does `coalesce()` do?

Q15: What does `spark.sql.shuffle.partitions` control?

Q20: What does `.repartition(10, col("country"))` do?

Q29: When should you use `.persist(StorageLevel.DISK_ONLY)`?

Q37: Adding `.cache()` makes your job SLOWER. Why?

Q40: Driver OOMs during `collect()`. What’s the safest alternative?