Spark provides a machine studying library (Spark ML/ MLlib) that simplifies constructing and coaching fashions at scale
1. Regression
Helps algorithms like Linear Regression, Random Forest, and Gradient Boosted Timber to foretell steady values (e.g., value or amount).
Workflow Steps:
- Put together information and create SparkSession.
- Learn and preprocess CSV information.
- Use VectorAssembler to mix options.
- Break up information into practice/take a look at.
- Match regression mannequin.
- Consider utilizing metrics (RMSE, R², MAE).
# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.analysis import RegressionEvaluator
# Create Spark session
spark = SparkSession.builder.appName("RegressionExample").getOrCreate()
# 2. Load and Examine Information
# Exchange 'my_regression_data.csv' with the precise information file
information = spark.learn.csv("my_regression_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Assemble Options and Label
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"],
outputCol="options"
)
assembled_data = assembler.remodel(information)
# Rename the precise goal column to "label" if wanted
final_data = assembled_data.withColumnRenamed("goal", "label")
final_data.choose("options", "label").present(5, truncate=False)
# 4. Break up Information into Prepare/Take a look at
train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)
# 5. Construct and Prepare the Mannequin
lr = LinearRegression(featuresCol="options", labelCol="label")
mannequin = lr.match(train_data)
# 6. Make Predictions & Consider
predictions = mannequin.remodel(test_data)
evaluator_r2 = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
r2 = evaluator_r2.consider(predictions)
evaluator_rmse = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator_rmse.consider(predictions)
print(f"R^2: {r2}")
print(f"RMSE: {rmse}")
# 7. Cease the Session
spark.cease()
2. Classification
- Splendid for duties like fraud detection, spam detection, or picture classification.
- Frequent algorithms: Logistic Regression, Determination Timber, Random Forest, Gradient-Boosted Timber.
- Consider fashions with metrics like Accuracy, Precision, Recall, and F1-score.
# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.analysis import MulticlassClassificationEvaluator
spark = SparkSession.builder.appName("ClassificationExample").getOrCreate()
# 2. Load and Examine Information
information = spark.learn.csv("my_classification_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Deal with the Label Column and Assemble Options
# Convert categorical label column (e.g., Class) to numerical.
indexer = StringIndexer(inputCol="Class", outputCol="label")
indexed_data = indexer.match(information).remodel(information)
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"],
outputCol="options"
)
assembled_data = assembler.remodel(indexed_data)
final_data = assembled_data.choose("options", "label")
final_data.present(5, truncate=False)
# 4. Break up Information into Prepare/Take a look at
train_data, test_data = final_data.randomSplit([0.7, 0.3], seed=42)
# 5. Prepare a Logistic Regression Mannequin
lr = LogisticRegression(featuresCol="options", labelCol="label")
mannequin = lr.match(train_data)
# 6. Predict & Consider
predictions = mannequin.remodel(test_data)
# Accuracy
acc_evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy"
)
accuracy = acc_evaluator.consider(predictions)
print(f"Accuracy = {accuracy}")
# F1 Rating
f1_evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="f1"
)
f1_score = f1_evaluator.consider(predictions)
print(f"F1 Rating = {f1_score}")
# 7. Cease Spark Session
spark.cease()
3. Clustering
- Makes use of unsupervised studying to group information factors based mostly on similarity.
- Fashionable algorithms: Ok-Means, Gaussian Combination Fashions.
- Spark ML simplifies cluster creation and analysis (e.g., by computing “silhouette” scores).
# 1. Create Spark Session & Import Libraries
import findspark
findspark.init()from pyspark.sql import SparkSession
from pyspark.ml.characteristic import VectorAssembler
from pyspark.ml.clustering import KMeans
spark = SparkSession.builder.appName("ClusteringExample").getOrCreate()
# 2. Load and Examine Information
information = spark.learn.csv("my_clustering_data.csv", header=True, inferSchema=True)
information.present(5)
information.printSchema()
# 3. Assemble Options
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "feature3"],
outputCol="options"
)
final_data = assembler.remodel(information)
final_data.present(5, truncate=False)
# 4. Prepare the Ok-Means Mannequin
kmeans = KMeans(okay=3, seed=42) # 3 clusters, for instance
mannequin = kmeans.match(final_data)
# 5. Make Predictions & Study Outcomes
predictions = mannequin.remodel(final_data)
predictions.choose("options", "prediction").present(5, truncate=False)
# Optionally available: cluster facilities
facilities = mannequin.clusterCenters()
print("Cluster Facilities:")
for idx, middle in enumerate(facilities):
print(f"Heart of cluster {idx}: {middle}")
# 6. Cease Spark Session
spark.cease()