Close Menu
    Trending
    • The Whole Story of MDP in RL. Understand what Markov Decision… | by Rem E | Aug, 2025
    • How Engineers Can Adapt to AI’s Growing Role in Coding
    • Here’s Why Anthropic Refuses to Offer 9-Figure Pay Like Meta
    • A Game-Changer in On-Device Creativity
    • This is How Machine Learning Changing the World | by Ashar Arif | Aug, 2025
    • GFT: Wynxx Reduces Time to Launch Financial Institutions’ AI and Cloud Projects
    • Humanoid Robot CHILD Mimics Parent-Child Motion
    • What Top Founders Know About Domains That Most Entrepreneurs Miss
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»Revise Smarter: Using Machine Learning to Unlock GCSE Maths. Part 2 | by Riley K | Jul, 2025
    Machine Learning

    Revise Smarter: Using Machine Learning to Unlock GCSE Maths. Part 2 | by Riley K | Jul, 2025

    Team_AIBS NewsBy Team_AIBS NewsJuly 31, 2025No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Scaling the Mission | Automation

    Zoom picture can be displayed

    Within the first a part of this venture we explored utilizing machine studying to categorise Maths GCSE questions by class:

    Quantity

    Algebra

    Ratio, proportion, and charges of change

    Geometry and measures

    Statistics & Likelihood

    If the reader is how we reached a proof of idea that machine studying is properly outfitted for classification issues such because the one described, then please confer with the primary article right here:

    In Half 2, we take a look at scaling the venture. By utilizing automation, we improve our information set of 6 previous papers to cowl your complete availability of Edexcel previous papers, from the yr 2010 onwards. This elevated the info set from round 200 particular person inquiries to 2117. We additionally sharpened the parsing course of when changing the PDF to textual content, permitting us to construct our machine studying mannequin faster and with ease. We see an enchancment of 85% to 87% accuracy on the earlier mannequin. How did this occur? Let’s get to it!

    In Half 1, we used a small pattern database, solely 6 papers, with some filler questions generated by Chat GPT. Downloading manually the papers utilizing hyperlinks from the Maths Genie web site was okay, as this wasn’t very time consuming. Nevertheless, growing our venture to scale meant downloading all of the papers out there, for the Basis Papers, this was 36, which reached again to the yr 2017, and for the Increased Papers, the entire was 68, reaching additional again to the yr 2009. Manually obtain all these papers ?!! Feels like enjoyable, proper?

    Zoom picture can be displayed

    So we take a look at utilizing automation with a small light-weight script in Python to do the duty for us. After inspecting the HTML code to grasp the logic of how the obtain hyperlinks are introduced, we import the requests library into Python (lets you simply obtain net pages and information from the web) and the attractive soup library (helps you extract particular data from messy net pages (HTML)):

    import requests
    from bs4 import BeautifulSoup
    import os
    from urllib.parse import urljoin

    # CONFIGURE THIS:
    base_url = "https://www.mathsgenie.co.uk/papers.php" # Change to precise base URL
    download_dir = "gcse_papers"

    # Create folder to save lots of PDFs
    os.makedirs(download_dir, exist_ok=True)

    # Get the HTML content material
    response = requests.get(base_url)
    soup = BeautifulSoup(response.textual content, 'html.parser')

    # Discover all tags with href ending in .pdf
    pdf_links = []
    for a_tag in soup.find_all('a', href=True):
    href = a_tag['href']
    if href.endswith('.pdf') and ('basis' in href.decrease() or 'f' in href.decrease()):
    full_url = urljoin(base_url, href)
    pdf_links.append((full_url, a_tag.textual content.strip()))

    print(f"Discovered {len(pdf_links)} basis paper(s). Downloading...")

    # Obtain the PDFs
    for url, label in pdf_links:
    filename = os.path.be a part of(download_dir, os.path.basename(url))
    if os.path.exists(filename):
    print(f"Already exists: {filename}")
    proceed
    print(f"Downloading {label} -> {filename}")
    strive:
    file_response = requests.get(url)
    with open(filename, 'wb') as f:
    f.write(file_response.content material)
    besides Exception as e:
    print(f"Did not obtain {url}: {e}")

    print("Obtain full.")

    The logic of the code is easy. If a filename with the extension .pdf is discovered, and the file incorporates an “f” then obtain it ! It is because all of the Basis papers had been within the format comparable to 1f2022, the place the primary character denotes the paper (1,2 or 3), the second character denotes the extent (Basis or Increased) and the remaining characters give reference to the yr the paper was revealed. We obtain all of the Basis papers out there, place them in a folder and repeat the method for the Increased papers.

    Zoom picture can be displayed

    With our Basis and Increased papers organised of their respective folders, we flip to parsing the questions from the PDF to textual content, to organize them right into a format which our machine studying mannequin can perceive and work with. Utilizing the code from Half 1 as a place to begin we start to manually extract the questions from a small variety of papers to see if we’re getting the outcomes we anticipate.

    Sadly, there seems to be plenty of noise within the textual content, together with parts comparable to main punctuation marks (“…….”) which seem on the finish of questions and typically in between questions, in addition to non-printable/management characters, left over from the PDF extraction course of. These characters, we study, are a typical artifact when coping with Arithmetic or Science PDFs.

    Surprisingly, there seems to be little in the way in which of open supply materials which addresses these points. And but having clear textual content, with symbols relating faithfully to Arithmetic meant all of the distinction of getting poor or properly functioning coaching information. As a small instance, so the reader can higher perceive, take the assertion:

    ℰ = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

    Right here we see the particular image ℰ which means Common Set. That is what college students see within the PDF model of the examination, however our textual content model, when parsed got here out:

    E = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

    Zoom picture can be displayed

    We use regex to make sure the image is faithfully recreated in textual content:

    textual content = re.sub(r"bEb", "ℰ", textual content)

    For the main punctuation marks “…..”, we outline a rule utilizing regex to remove any sequence of punctuation marks larger than 3:

     textual content = re.sub(r".{4,}", "", textual content)

    After round 1–2 hours work, the parser was optimised and able to carry out its operate. The total code for parsing GCSE Edexcel Maths papers is beneath:

    import fitz  # PyMuPDF
    import re
    import unicodedata
    import os
    # BEST PARSER TO DATE 30/7/2025
    folder_path = r"gcse_papershigher_papers"
    output_file = "all_questions_master.txt"

    question_end_pattern = re.compile(r"(Complete for Questions+d+s+iss+d+s+mark[s]?)")

    instruction_phrases = [
    "answer all questions",
    "write your answers in the spaces",
    "you must write down all the stages",
    "turn over",
    "blank page",
    "page of",
    "do not write in this area"
    ]

    def clean_line(line):
    line = line.exchange('u2002', ' ').exchange('u2003', ' ').exchange('u00A0', ' ')
    line = line.exchange('u200B', ' ').exchange('u200A', '').exchange('u200C', '')
    line = line.exchange('u200D', '').exchange('u2060', '').exchange('uFEFF', '')
    line = line.exchange('u2009', '')
    line = line.exchange('u0008', '')
    line = re.sub(r'[u2060uFEFF]', '', line)
    line = ''.be a part of(c for c in line if not unicodedata.class(c).startswith('M'))
    line = re.sub(r'[ t]+', ' ', line)
    line = re.sub(r'[x00-x1Fx7F]', '', line)
    return line.strip()

    def fix_math_symbols(textual content):
    textual content = textual content.exchange('−', '-').exchange('–', '-').exchange('—', '-')
    textual content = textual content.exchange('×', '*').exchange('✕', '*')
    textual content = re.sub(r'[u200B-u200Du2060uFEFF]', '', textual content)
    textual content = re.sub(r'[^x00-x7F]+', ' ', textual content)
    textual content = re.sub(r's+', ' ', textual content)
    textual content = re.sub(r".{4,}", "", textual content)
    textual content = re.sub(r"bEb", "ℰ", textual content)
    textual content = re.sub(r"setEb", "set ℰ", textual content)
    textual content = re.sub(r'x07', '', textual content)
    return textual content.strip()

    def fix_currency_symbols(textual content):
    textual content = re.sub(r'b(d+)s?pb', r'£0.1', textual content)
    textual content = re.sub(r'(worth is|price is|priced at|prices)s+(d+(.d+)?)b', r'1 £2', textual content, flags=re.I)
    textual content = re.sub(r'((regular worth|worth of|price of|purchase for|priced at|prices)s+)(d+(.d+)?)(?!s?pb|s*£)', r'1£3', textual content, flags=re.I)
    return textual content

    def check_question_order(questions):
    prev_num = None
    errors = []
    for i, q in enumerate(questions):
    match = re.match(r's*(d+)', q)
    if not match:
    errors.append((i, q, "No main quantity"))
    proceed
    num = int(match.group(1))
    if prev_num is None:
    prev_num = num
    proceed
    if num != prev_num + 1:
    msg = f"Anticipated {prev_num + 1}, discovered {num} at index {i}:"
    errors.append((i, q, msg))
    prev_num = num
    if not errors:
    print("✅ All query numbers seem so as!")
    else:
    print(f"⚠️ Discovered {len(errors)} ordering difficulty(s):n")
    for idx, q, msg in errors:
    print(f"{msg}n '{q[:80]}...'")

    def renumber_questions(questions):
    new_questions = []
    for q in questions:
    match = re.search(r'(Complete for Query (d+) is', q)
    if match:
    q_num = match.group(1)
    cleaned = re.sub(r'^s*-?d+s*([).]|])?s*', '', q)
    fastened = f"{q_num} {cleaned}"
    new_questions.append(fastened)
    else:
    new_questions.append(q)
    return new_questions

    all_questions = []

    for filename in os.listdir(folder_path):
    if not filename.decrease().endswith(".pdf"):
    proceed
    pdf_path = os.path.be a part of(folder_path, filename)
    doc = fitz.open(pdf_path)
    question_texts = []
    current_question = ""
    for i, web page in enumerate(doc):
    if i == 0:
    proceed
    textual content = web page.get_text()
    strains = textual content.break up("n")
    for line in strains:
    stripped = clean_line(line)
    if (
    stripped.startswith("*P") and stripped.endswith("*") or
    re.match(r"^d{1,3}$", stripped) or
    all(c in "" for c in stripped) or
    len(stripped) < 2 or
    any(phrase in stripped.decrease() for phrase in instruction_phrases)
    ):
    proceed
    current_question += stripped + " "
    if question_end_pattern.search(stripped):
    cleaned = current_question.strip()
    cleaned = re.sub(r"[.,;:s]+(?=(Complete for Query)", "", cleaned)
    if not re.search(r".s*(Complete for Query", cleaned):
    cleaned = re.sub(r"s*(?=(Complete for Query)", ". ", cleaned)
    cleaned = fix_math_symbols(cleaned)
    cleaned = fix_currency_symbols(cleaned)
    question_texts.append(cleaned)
    current_question = ""
    renumbered_questions = renumber_questions(question_texts)
    for q in renumbered_questions:
    all_questions.append(f'"{q}"') # every query as string, quoted

    with open(output_file, "w", encoding="utf-8") as f:
    f.write("[n")
    for i, q in enumerate(all_questions):
    comma = "," if i < len(all_questions) - 1 else ""
    f.write(f" {q}{comma}n")
    f.write("]n")

    print(f"✅ Extracted {len(all_questions)} questions from all PDFs. Saved to {output_file} :)")

    Zoom picture can be displayed

    With our GCSE questions now numbering over 2000, we resolve to show to Chat GPT to classify them into their respective classes. Like in Half 1, we used the next immediate:

    Here's a textual content file with a sequence of questions taken from GCSE Basis Arithmetic previous papers. For every query, give it a single categorisation of both: 

    1. Quantity
    2. Algebra
    3. Ratio, proportion, and charges of change
    4. Geometry and measures
    5. Statistics & Likelihood

    Cross reference the next paperwork with the intention to help with classification.
    https://property.publishing.service.gov.uk/media/5a7cb5b040f0b6629523b52c/GCSE_mathematics_subject_content_and_assessment_objectives.pdf https://{qualifications}.pearson.com/content material/dam/pdf/GCSE/arithmetic/2015/specification-and-sample-assesment/gcse-maths-2015-specification.pdf

    Ensure that there may be precisely one class for every entry and no extra. For instance if there are 232 Questions, there needs to be 232 Classes responding on to the Questions. Please place every assigned class in double citation marks, separated by commas, with every entry showing on a brand new line.

    Chat GPT returned a textual content file of 2117 listed classes corresponding to every of our questions:

    "Algebra",
    "Quantity",
    "Statistics",
    "Geometry and measures",
    "Ratio, proportion, and charges of change",
    ...
    ...
    ...

    We practice our mannequin on the brand new information set and get the next outcome:

    Zoom picture can be displayed

    Okay, so what do the outcomes imply?

    • Precision: When the mannequin predicts a subject, how typically is it proper?
    • Recall: Of all precise questions from a subject, what number of did the mannequin discover?
    • F1-score: Combines each precision and recall right into a single rating (increased is best).

    We see an enchancment on final week’s mannequin, shifting from 85% accuracy to 87%. Areas comparable to Likelihood the mannequin is especially sturdy: 100% at recognizing chance questions within the take a look at!, whereas some areas like Quantity, 77% of questions had been appropriately categorised, a very good success fee but in addition giving us some room for enchancment.

    This week’s technical challenges had been principally within the parsing means of PDFs to textual content. What did we study ?

    1. When utilizing small datasets, eradicating noise from say 5 papers is trivial and will be achieved manually. However when scaling a venture, all these little artifacts construct up and might crush sources within the quick to medium time period. That is why an understanding of regex is necessary, study it as human developer ! Don’t depend on AI to provide the reply or an answer on a plate.
    2. In contemplating the low quantity of open supply materials referring to the parsing of Arithmetic/ Science PDF to textual content, we begin to see the complexity of textual content extraction and the actual challenges it presents. In my mannequin, the Python script responded with round 98% accuracy. Often query order can be corrupted — I needed to write a debugging script to inform me if this occurred after which modify the ordering as wanted. However unusually, though the script labored wonderful on GCSE Basis/ Increased Previous papers, when utilized to the Specimen Pattern papers the output was junk !!! Completely unfit for function. So possibly that is the explanation why solely a restricted quantity of open supply materials exists on this area. While a parser could be wonderful tuned to work on particular materials comparable to ours, when making use of it to a normal case, it’s a lot much less prone to succeed. That is attention-grabbing and one thing we should return to.

    What went properly ? Any surprises?

    The mannequin bettering to 85% to 87% was extremely welcoming. We see now the potential in automation as a device for information analytics and sampling, and a strong asset/ ally in scaling a venture. I realised right this moment, that the accuracy of Chat GPTs responses within the categorising of questions had been mirrored within the efficiency of the machine studying mannequin. There seems to be consistency and logic in its classification. Reliably predicting 100% of Likelihood questions demonstrates strongly, the flexibility to have the ability to classify these questions successfully. The machine studying mannequin provides weight to the notion that LLMs like Chat GPT will be very efficient, and this hybrid method to classification (machine studying + LLM) hopefully demonstrates the potential!

    Okay that’s wonderful. What’s subsequent?

    So at right this moment’s level now we have 2117 questions categorised by topic space (Quantity, Algebra…), by issue (extra on this quickly…), and by curriculum degree (Basis/ Increased), all presently represented in a JSON file. The following steps are:

    1. Align particular person questions, with their attributes, to their PDF picture counter elements.
    2. Get experimental ! Design dynamic monitoring system which follows customers capacity to reply questions. Regulate to their capacity

    That’s all for now, thanks for studying.

    Zoom picture can be displayed



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGoogle DeepMind Launches AlphaEarth Foundations Virtual Satellite
    Next Article LLMs and Mental Health | Towards Data Science
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    The Whole Story of MDP in RL. Understand what Markov Decision… | by Rem E | Aug, 2025

    August 1, 2025
    Machine Learning

    This is How Machine Learning Changing the World | by Ashar Arif | Aug, 2025

    August 1, 2025
    Machine Learning

    From Naive Assumptions to Ensemble Mastery: My Journey Through Core Machine Learning Algorithms | by Devashish Belwal | Aug, 2025

    August 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The Whole Story of MDP in RL. Understand what Markov Decision… | by Rem E | Aug, 2025

    August 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    How AI is Redefining the Music Industry

    June 27, 2025

    Wikipedia legally challenges ‘flawed’ online safety rules

    May 8, 2025

    Some Walmart Employees Are Wearing Body Cameras. Here’s why.

    December 17, 2024
    Our Picks

    The Whole Story of MDP in RL. Understand what Markov Decision… | by Rem E | Aug, 2025

    August 1, 2025

    How Engineers Can Adapt to AI’s Growing Role in Coding

    August 1, 2025

    Here’s Why Anthropic Refuses to Offer 9-Figure Pay Like Meta

    August 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.