Close Menu
    Trending
    • Why PDF Extraction Still Feels LikeHack
    • GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why
    • Millions of websites to get ‘game-changing’ AI bot blocker
    • I Worked Through Labor, My Wedding and Burnout — For What?
    • Cloudflare will now block AI bots from crawling its clients’ websites by default
    • 🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025
    • Futurwise: Unlock 25% Off Futurwise Today
    • 3D Printer Breaks Kickstarter Record, Raises Over $46M
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Apache Spark for Machine Learning | by Shitanshu Pandey | Apr, 2025
    Machine Learning

    Apache Spark for Machine Learning | by Shitanshu Pandey | Apr, 2025

    Team_AIBS NewsBy Team_AIBS NewsApril 10, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Spark provides a machine studying library (Spark ML/ MLlib) that simplifies constructing and coaching fashions at scale

    1. Regression

    Helps algorithms like Linear Regression, Random Forest, and Gradient Boosted Timber to foretell steady values (e.g., value or amount).

    Workflow Steps:

    1. Put together information and create SparkSession.
    2. Learn and preprocess CSV information.
    3. Use VectorAssembler to mix options.
    4. Break up information into practice/take a look at.
    5. Match regression mannequin.
    6. Consider utilizing metrics (RMSE, R², MAE).
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.analysis import RegressionEvaluator

    # Create Spark session
    spark = SparkSession.builder.appName("RegressionExample").getOrCreate()

    # 2. Load and Examine Information
    # Exchange 'my_regression_data.csv' with the precise information file
    information = spark.learn.csv("my_regression_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Assemble Options and Label
    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    assembled_data = assembler.remodel(information)

    # Rename the precise goal column to "label" if wanted
    final_data = assembled_data.withColumnRenamed("goal", "label")
    final_data.choose("options", "label").present(5, truncate=False)

    # 4. Break up Information into Prepare/Take a look at
    train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)

    # 5. Construct and Prepare the Mannequin
    lr = LinearRegression(featuresCol="options", labelCol="label")
    mannequin = lr.match(train_data)

    # 6. Make Predictions & Consider
    predictions = mannequin.remodel(test_data)

    evaluator_r2 = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
    r2 = evaluator_r2.consider(predictions)

    evaluator_rmse = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator_rmse.consider(predictions)

    print(f"R^2: {r2}")
    print(f"RMSE: {rmse}")

    # 7. Cease the Session
    spark.cease()

    2. Classification

    • Splendid for duties like fraud detection, spam detection, or picture classification.
    • Frequent algorithms: Logistic Regression, Determination Timber, Random Forest, Gradient-Boosted Timber.
    • Consider fashions with metrics like Accuracy, Precision, Recall, and F1-score.
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler, StringIndexer
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.analysis import MulticlassClassificationEvaluator

    spark = SparkSession.builder.appName("ClassificationExample").getOrCreate()

    # 2. Load and Examine Information
    information = spark.learn.csv("my_classification_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Deal with the Label Column and Assemble Options
    # Convert categorical label column (e.g., Class) to numerical.
    indexer = StringIndexer(inputCol="Class", outputCol="label")
    indexed_data = indexer.match(information).remodel(information)

    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    assembled_data = assembler.remodel(indexed_data)

    final_data = assembled_data.choose("options", "label")
    final_data.present(5, truncate=False)

    # 4. Break up Information into Prepare/Take a look at
    train_data, test_data = final_data.randomSplit([0.7, 0.3], seed=42)

    # 5. Prepare a Logistic Regression Mannequin
    lr = LogisticRegression(featuresCol="options", labelCol="label")
    mannequin = lr.match(train_data)

    # 6. Predict & Consider
    predictions = mannequin.remodel(test_data)

    # Accuracy
    acc_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy"
    )
    accuracy = acc_evaluator.consider(predictions)
    print(f"Accuracy = {accuracy}")

    # F1 Rating
    f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="f1"
    )
    f1_score = f1_evaluator.consider(predictions)
    print(f"F1 Rating = {f1_score}")

    # 7. Cease Spark Session
    spark.cease()

    3. Clustering

    • Makes use of unsupervised studying to group information factors based mostly on similarity.
    • Fashionable algorithms: Ok-Means, Gaussian Combination Fashions.
    • Spark ML simplifies cluster creation and analysis (e.g., by computing “silhouette” scores).
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler
    from pyspark.ml.clustering import KMeans

    spark = SparkSession.builder.appName("ClusteringExample").getOrCreate()

    # 2. Load and Examine Information
    information = spark.learn.csv("my_clustering_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Assemble Options
    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    final_data = assembler.remodel(information)
    final_data.present(5, truncate=False)

    # 4. Prepare the Ok-Means Mannequin
    kmeans = KMeans(okay=3, seed=42) # 3 clusters, for instance
    mannequin = kmeans.match(final_data)

    # 5. Make Predictions & Study Outcomes
    predictions = mannequin.remodel(final_data)
    predictions.choose("options", "prediction").present(5, truncate=False)

    # Optionally available: cluster facilities
    facilities = mannequin.clusterCenters()
    print("Cluster Facilities:")
    for idx, middle in enumerate(facilities):
    print(f"Heart of cluster {idx}: {middle}")

    # 6. Cease Spark Session
    spark.cease()



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGold Miners Gain Momentum as Prices Surge Back Past $3,010
    Next Article Yale Students Raised $3M in 14 Days for ‘Anti-Facebook’ Startup
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Machine Learning

    Reinforcement Learning in the Age of Modern AI | by @pramodchandrayan | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    An Unbiased Review of Snowflake’s Document AI

    April 16, 2025

    Entrepreneurs Can Slash Admin Time With These 2,800+ Attorney-Drafted Templates

    June 20, 2025

    [Swarms Marketplace Latest Updates][June 13, 2025] | by Kye Gomez | Jun, 2025

    June 14, 2025
    Our Picks

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025

    GenAI Will Fuel People’s Jobs, Not Replace Them. Here’s Why

    July 1, 2025

    Millions of websites to get ‘game-changing’ AI bot blocker

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.