Close Menu
    Trending
    • Using Graph Databases to Model Patient Journeys and Clinical Relationships
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Apache Spark for Machine Learning | by Shitanshu Pandey | Apr, 2025
    Machine Learning

    Apache Spark for Machine Learning | by Shitanshu Pandey | Apr, 2025

    Team_AIBS NewsBy Team_AIBS NewsApril 10, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Spark provides a machine studying library (Spark ML/ MLlib) that simplifies constructing and coaching fashions at scale

    1. Regression

    Helps algorithms like Linear Regression, Random Forest, and Gradient Boosted Timber to foretell steady values (e.g., value or amount).

    Workflow Steps:

    1. Put together information and create SparkSession.
    2. Learn and preprocess CSV information.
    3. Use VectorAssembler to mix options.
    4. Break up information into practice/take a look at.
    5. Match regression mannequin.
    6. Consider utilizing metrics (RMSE, R², MAE).
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.analysis import RegressionEvaluator

    # Create Spark session
    spark = SparkSession.builder.appName("RegressionExample").getOrCreate()

    # 2. Load and Examine Information
    # Exchange 'my_regression_data.csv' with the precise information file
    information = spark.learn.csv("my_regression_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Assemble Options and Label
    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    assembled_data = assembler.remodel(information)

    # Rename the precise goal column to "label" if wanted
    final_data = assembled_data.withColumnRenamed("goal", "label")
    final_data.choose("options", "label").present(5, truncate=False)

    # 4. Break up Information into Prepare/Take a look at
    train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)

    # 5. Construct and Prepare the Mannequin
    lr = LinearRegression(featuresCol="options", labelCol="label")
    mannequin = lr.match(train_data)

    # 6. Make Predictions & Consider
    predictions = mannequin.remodel(test_data)

    evaluator_r2 = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="r2")
    r2 = evaluator_r2.consider(predictions)

    evaluator_rmse = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator_rmse.consider(predictions)

    print(f"R^2: {r2}")
    print(f"RMSE: {rmse}")

    # 7. Cease the Session
    spark.cease()

    2. Classification

    • Splendid for duties like fraud detection, spam detection, or picture classification.
    • Frequent algorithms: Logistic Regression, Determination Timber, Random Forest, Gradient-Boosted Timber.
    • Consider fashions with metrics like Accuracy, Precision, Recall, and F1-score.
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler, StringIndexer
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.analysis import MulticlassClassificationEvaluator

    spark = SparkSession.builder.appName("ClassificationExample").getOrCreate()

    # 2. Load and Examine Information
    information = spark.learn.csv("my_classification_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Deal with the Label Column and Assemble Options
    # Convert categorical label column (e.g., Class) to numerical.
    indexer = StringIndexer(inputCol="Class", outputCol="label")
    indexed_data = indexer.match(information).remodel(information)

    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    assembled_data = assembler.remodel(indexed_data)

    final_data = assembled_data.choose("options", "label")
    final_data.present(5, truncate=False)

    # 4. Break up Information into Prepare/Take a look at
    train_data, test_data = final_data.randomSplit([0.7, 0.3], seed=42)

    # 5. Prepare a Logistic Regression Mannequin
    lr = LogisticRegression(featuresCol="options", labelCol="label")
    mannequin = lr.match(train_data)

    # 6. Predict & Consider
    predictions = mannequin.remodel(test_data)

    # Accuracy
    acc_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy"
    )
    accuracy = acc_evaluator.consider(predictions)
    print(f"Accuracy = {accuracy}")

    # F1 Rating
    f1_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="f1"
    )
    f1_score = f1_evaluator.consider(predictions)
    print(f"F1 Rating = {f1_score}")

    # 7. Cease Spark Session
    spark.cease()

    3. Clustering

    • Makes use of unsupervised studying to group information factors based mostly on similarity.
    • Fashionable algorithms: Ok-Means, Gaussian Combination Fashions.
    • Spark ML simplifies cluster creation and analysis (e.g., by computing “silhouette” scores).
    # 1. Create Spark Session & Import Libraries
    import findspark
    findspark.init()

    from pyspark.sql import SparkSession
    from pyspark.ml.characteristic import VectorAssembler
    from pyspark.ml.clustering import KMeans

    spark = SparkSession.builder.appName("ClusteringExample").getOrCreate()

    # 2. Load and Examine Information
    information = spark.learn.csv("my_clustering_data.csv", header=True, inferSchema=True)
    information.present(5)
    information.printSchema()

    # 3. Assemble Options
    assembler = VectorAssembler(
    inputCols=["feature1", "feature2", "feature3"],
    outputCol="options"
    )
    final_data = assembler.remodel(information)
    final_data.present(5, truncate=False)

    # 4. Prepare the Ok-Means Mannequin
    kmeans = KMeans(okay=3, seed=42) # 3 clusters, for instance
    mannequin = kmeans.match(final_data)

    # 5. Make Predictions & Study Outcomes
    predictions = mannequin.remodel(final_data)
    predictions.choose("options", "prediction").present(5, truncate=False)

    # Optionally available: cluster facilities
    facilities = mannequin.clusterCenters()
    print("Cluster Facilities:")
    for idx, middle in enumerate(facilities):
    print(f"Heart of cluster {idx}: {middle}")

    # 6. Cease Spark Session
    spark.cease()



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGold Miners Gain Momentum as Prices Surge Back Past $3,010
    Next Article Yale Students Raised $3M in 14 Days for ‘Anti-Facebook’ Startup
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

    July 1, 2025
    Machine Learning

    Why PDF Extraction Still Feels LikeHack

    July 1, 2025
    Machine Learning

    🚗 Predicting Car Purchase Amounts with Neural Networks in Keras (with Code & Dataset) | by Smruti Ranjan Nayak | Jul, 2025

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    I Mentor First-Time Entrepreneurs — These Are the 4 Unseen Benefits I Gained By Giving Back

    February 10, 2025

    Apollo and Design Choices of Video Large Multimodal Models (LMMs) | by Matthew Gunton | Jan, 2025

    January 24, 2025

    How Cross-Chain DApps Handle Gas Optimization

    March 3, 2025
    Our Picks

    Using Graph Databases to Model Patient Journeys and Clinical Relationships

    July 1, 2025

    Cuba’s Energy Crisis: A Systemic Breakdown

    July 1, 2025

    AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.