Apache Spark for Machine Learning | by Shitanshu Pandey

Spark provides a machine studying library (Spark ML/ MLlib) that simplifies constructing and coaching fashions at scale

1. Regression

Helps algorithms like Linear Regression, Random Forest, and Gradient Boosted Timber to foretell steady values (e.g., value or amount).

Workflow Steps:

Put together information and create SparkSession.
Learn and preprocess CSV information.
Use VectorAssembler to mix options.
Break up information into practice/take a look at.
Match regression mannequin.
Consider utilizing metrics (RMSE, R², MAE).

# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.analysis import RegressionEvaluator
# Create Spark session
spark = SparkSession.builder.appName("RegressionExample").getOrCreate()
# 2. Load and Examine Information
# Exchange 'my_regression_data.csv' with the precise information file
information = spark.learn.csv("my_regression_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Assemble Options and Label
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"], 
outputCol="options"
)
assembled_data = assembler.remodel(information)
# Rename the precise goal column to "label" if wanted
final_data = assembled_data.withColumnRenamed("goal", "label")
final_data.choose("options", "label").present(5, truncate=False)
# 4. Break up Information into Prepare/Take a look at
train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)
# 5. Construct and Prepare the Mannequin
lr = LinearRegression(featuresCol="options", labelCol="label")
mannequin = lr.match(train_data)
# 6. Make Predictions & Consider
predictions = mannequin.remodel(test_data)
evaluator_r2 = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
r2 = evaluator_r2.consider(predictions)
evaluator_rmse = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator_rmse.consider(predictions)
print(f"R^2: {r2}")
print(f"RMSE: {rmse}")
# 7. Cease the Session
spark.cease()

2. Classification

Splendid for duties like fraud detection, spam detection, or picture classification.
Frequent algorithms: Logistic Regression, Determination Timber, Random Forest, Gradient-Boosted Timber.
Consider fashions with metrics like Accuracy, Precision, Recall, and F1-score.

# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.analysis import MulticlassClassificationEvaluator
spark = SparkSession.builder.appName("ClassificationExample").getOrCreate()
# 2. Load and Examine Information
information = spark.learn.csv("my_classification_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Deal with the Label Column and Assemble Options
# Convert categorical label column (e.g., Class) to numerical.
indexer = StringIndexer(inputCol="Class", outputCol="label")
indexed_data = indexer.match(information).remodel(information)
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"], 
outputCol="options"
)
assembled_data = assembler.remodel(indexed_data)
final_data = assembled_data.choose("options", "label")
final_data.present(5, truncate=False)
# 4. Break up Information into Prepare/Take a look at
train_data, test_data = final_data.randomSplit([0.7, 0.3], seed=42)
# 5. Prepare a Logistic Regression Mannequin
lr = LogisticRegression(featuresCol="options", labelCol="label")
mannequin = lr.match(train_data)
# 6. Predict & Consider
predictions = mannequin.remodel(test_data)
# Accuracy
acc_evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy"
)
accuracy = acc_evaluator.consider(predictions)
print(f"Accuracy = {accuracy}")
# F1 Rating
f1_evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="f1"
)
f1_score = f1_evaluator.consider(predictions)
print(f"F1 Rating = {f1_score}")
# 7. Cease Spark Session
spark.cease()

3. Clustering

Makes use of unsupervised studying to group information factors based mostly on similarity.
Fashionable algorithms: Ok-Means, Gaussian Combination Fashions.
Spark ML simplifies cluster creation and analysis (e.g., by computing “silhouette” scores).

# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler
from pyspark.ml.clustering import KMeans
spark = SparkSession.builder.appName("ClusteringExample").getOrCreate()
# 2. Load and Examine Information
information = spark.learn.csv("my_clustering_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Assemble Options
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"], 
outputCol="options"
)
final_data = assembler.remodel(information)
final_data.present(5, truncate=False)
# 4. Prepare the Ok-Means Mannequin
kmeans = KMeans(okay=3, seed=42)  # 3 clusters, for instance
mannequin = kmeans.match(final_data)
# 5. Make Predictions & Study Outcomes
predictions = mannequin.remodel(final_data)
predictions.choose("options", "prediction").present(5, truncate=False)
# Optionally available: cluster facilities
facilities = mannequin.clusterCenters()
print("Cluster Facilities:")
for idx, middle in enumerate(facilities):
print(f"Heart of cluster {idx}: {middle}")
# 6. Cease Spark Session
spark.cease()

Source link

Why PDF Extraction Still Feels LikeHack

🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

Why PDF Extraction Still Feels LikeHack

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

An Unbiased Review of Snowflake’s Document AI

Entrepreneurs Can Slash Admin Time With These 2,800+ Attorney-Drafted Templates

[Swarms Marketplace Latest Updates][June 13, 2025] | by Kye Gomez | Jun, 2025

Our Picks

Why PDF Extraction Still Feels LikeHack

GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why

Millions of websites to get ‘game-changing’ AI bot blocker

Apache Spark for Machine Learning | by Shitanshu Pandey | Apr, 2025

1. Regression

2. Classification

3. Clustering

Related Posts