Within the first a part of this venture we explored utilizing machine studying to categorise Maths GCSE questions by class:
Quantity
Algebra
Ratio, proportion, and charges of change
Geometry and measures
Statistics & Likelihood
If the reader is how we reached a proof of idea that machine studying is properly outfitted for classification issues such because the one described, then please confer with the primary article right here:
In Half 2, we take a look at scaling the venture. By utilizing automation, we improve our information set of 6 previous papers to cowl your complete availability of Edexcel previous papers, from the yr 2010 onwards. This elevated the info set from round 200 particular person inquiries to 2117. We additionally sharpened the parsing course of when changing the PDF to textual content, permitting us to construct our machine studying mannequin faster and with ease. We see an enchancment of 85% to 87% accuracy on the earlier mannequin. How did this occur? Let’s get to it!
In Half 1, we used a small pattern database, solely 6 papers, with some filler questions generated by Chat GPT. Downloading manually the papers utilizing hyperlinks from the Maths Genie web site was okay, as this wasn’t very time consuming. Nevertheless, growing our venture to scale meant downloading all of the papers out there, for the Basis Papers, this was 36, which reached again to the yr 2017, and for the Increased Papers, the entire was 68, reaching additional again to the yr 2009. Manually obtain all these papers ?!! Feels like enjoyable, proper?
So we take a look at utilizing automation with a small light-weight script in Python to do the duty for us. After inspecting the HTML code to grasp the logic of how the obtain hyperlinks are introduced, we import the requests library into Python (lets you simply obtain net pages and information from the web) and the attractive soup library (helps you extract particular data from messy net pages (HTML)):
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin# CONFIGURE THIS:
base_url = "https://www.mathsgenie.co.uk/papers.php" # Change to precise base URL
download_dir = "gcse_papers"
# Create folder to save lots of PDFs
os.makedirs(download_dir, exist_ok=True)
# Get the HTML content material
response = requests.get(base_url)
soup = BeautifulSoup(response.textual content, 'html.parser')
print(f"Discovered {len(pdf_links)} basis paper(s). Downloading...")
# Obtain the PDFs
for url, label in pdf_links:
filename = os.path.be a part of(download_dir, os.path.basename(url))
if os.path.exists(filename):
print(f"Already exists: {filename}")
proceed
print(f"Downloading {label} -> {filename}")
strive:
file_response = requests.get(url)
with open(filename, 'wb') as f:
f.write(file_response.content material)
besides Exception as e:
print(f"Did not obtain {url}: {e}")
print("Obtain full.")
The logic of the code is easy. If a filename with the extension .pdf is discovered, and the file incorporates an “f” then obtain it ! It is because all of the Basis papers had been within the format comparable to 1f2022, the place the primary character denotes the paper (1,2 or 3), the second character denotes the extent (Basis or Increased) and the remaining characters give reference to the yr the paper was revealed. We obtain all of the Basis papers out there, place them in a folder and repeat the method for the Increased papers.
With our Basis and Increased papers organised of their respective folders, we flip to parsing the questions from the PDF to textual content, to organize them right into a format which our machine studying mannequin can perceive and work with. Utilizing the code from Half 1 as a place to begin we start to manually extract the questions from a small variety of papers to see if we’re getting the outcomes we anticipate.
Sadly, there seems to be plenty of noise within the textual content, together with parts comparable to main punctuation marks (“…….”) which seem on the finish of questions and typically in between questions, in addition to non-printable/management characters, left over from the PDF extraction course of. These characters, we study, are a typical artifact when coping with Arithmetic or Science PDFs.
Surprisingly, there seems to be little in the way in which of open supply materials which addresses these points. And but having clear textual content, with symbols relating faithfully to Arithmetic meant all of the distinction of getting poor or properly functioning coaching information. As a small instance, so the reader can higher perceive, take the assertion:
ℰ = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
Right here we see the particular image ℰ which means Common Set. That is what college students see within the PDF model of the examination, however our textual content model, when parsed got here out:
E = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
We use regex to make sure the image is faithfully recreated in textual content:
textual content = re.sub(r"bEb", "ℰ", textual content)
For the main punctuation marks “…..”, we outline a rule utilizing regex to remove any sequence of punctuation marks larger than 3:
textual content = re.sub(r".{4,}", "", textual content)
After round 1–2 hours work, the parser was optimised and able to carry out its operate. The total code for parsing GCSE Edexcel Maths papers is beneath:
import fitz # PyMuPDF
import re
import unicodedata
import os
# BEST PARSER TO DATE 30/7/2025
folder_path = r"gcse_papershigher_papers"
output_file = "all_questions_master.txt"question_end_pattern = re.compile(r"(Complete for Questions+d+s+iss+d+s+mark[s]?)")
instruction_phrases = [
"answer all questions",
"write your answers in the spaces",
"you must write down all the stages",
"turn over",
"blank page",
"page of",
"do not write in this area"
]
def clean_line(line):
line = line.exchange('u2002', ' ').exchange('u2003', ' ').exchange('u00A0', ' ')
line = line.exchange('u200B', ' ').exchange('u200A', '').exchange('u200C', '')
line = line.exchange('u200D', '').exchange('u2060', '').exchange('uFEFF', '')
line = line.exchange('u2009', '')
line = line.exchange('u0008', '')
line = re.sub(r'[u2060uFEFF]', '', line)
line = ''.be a part of(c for c in line if not unicodedata.class(c).startswith('M'))
line = re.sub(r'[ t]+', ' ', line)
line = re.sub(r'[x00-x1Fx7F]', '', line)
return line.strip()
def fix_math_symbols(textual content):
textual content = textual content.exchange('−', '-').exchange('–', '-').exchange('—', '-')
textual content = textual content.exchange('×', '*').exchange('✕', '*')
textual content = re.sub(r'[u200B-u200Du2060uFEFF]', '', textual content)
textual content = re.sub(r'[^x00-x7F]+', ' ', textual content)
textual content = re.sub(r's+', ' ', textual content)
textual content = re.sub(r".{4,}", "", textual content)
textual content = re.sub(r"bEb", "ℰ", textual content)
textual content = re.sub(r"setEb", "set ℰ", textual content)
textual content = re.sub(r'x07', '', textual content)
return textual content.strip()
def fix_currency_symbols(textual content):
textual content = re.sub(r'b(d+)s?pb', r'£0.1', textual content)
textual content = re.sub(r'(worth is|price is|priced at|prices)s+(d+(.d+)?)b', r'1 £2', textual content, flags=re.I)
textual content = re.sub(r'((regular worth|worth of|price of|purchase for|priced at|prices)s+)(d+(.d+)?)(?!s?pb|s*£)', r'1£3', textual content, flags=re.I)
return textual content
def check_question_order(questions):
prev_num = None
errors = []
for i, q in enumerate(questions):
match = re.match(r's*(d+)', q)
if not match:
errors.append((i, q, "No main quantity"))
proceed
num = int(match.group(1))
if prev_num is None:
prev_num = num
proceed
if num != prev_num + 1:
msg = f"Anticipated {prev_num + 1}, discovered {num} at index {i}:"
errors.append((i, q, msg))
prev_num = num
if not errors:
print("✅ All query numbers seem so as!")
else:
print(f"⚠️ Discovered {len(errors)} ordering difficulty(s):n")
for idx, q, msg in errors:
print(f"{msg}n '{q[:80]}...'")
def renumber_questions(questions):
new_questions = []
for q in questions:
match = re.search(r'(Complete for Query (d+) is', q)
if match:
q_num = match.group(1)
cleaned = re.sub(r'^s*-?d+s*([).]|])?s*', '', q)
fastened = f"{q_num} {cleaned}"
new_questions.append(fastened)
else:
new_questions.append(q)
return new_questions
all_questions = []
for filename in os.listdir(folder_path):
if not filename.decrease().endswith(".pdf"):
proceed
pdf_path = os.path.be a part of(folder_path, filename)
doc = fitz.open(pdf_path)
question_texts = []
current_question = ""
for i, web page in enumerate(doc):
if i == 0:
proceed
textual content = web page.get_text()
strains = textual content.break up("n")
for line in strains:
stripped = clean_line(line)
if (
stripped.startswith("*P") and stripped.endswith("*") or
re.match(r"^d{1,3}$", stripped) or
all(c in "" for c in stripped) or
len(stripped) < 2 or
any(phrase in stripped.decrease() for phrase in instruction_phrases)
):
proceed
current_question += stripped + " "
if question_end_pattern.search(stripped):
cleaned = current_question.strip()
cleaned = re.sub(r"[.,;:s]+(?=(Complete for Query)", "", cleaned)
if not re.search(r".s*(Complete for Query", cleaned):
cleaned = re.sub(r"s*(?=(Complete for Query)", ". ", cleaned)
cleaned = fix_math_symbols(cleaned)
cleaned = fix_currency_symbols(cleaned)
question_texts.append(cleaned)
current_question = ""
renumbered_questions = renumber_questions(question_texts)
for q in renumbered_questions:
all_questions.append(f'"{q}"') # every query as string, quoted
with open(output_file, "w", encoding="utf-8") as f:
f.write("[n")
for i, q in enumerate(all_questions):
comma = "," if i < len(all_questions) - 1 else ""
f.write(f" {q}{comma}n")
f.write("]n")
print(f"✅ Extracted {len(all_questions)} questions from all PDFs. Saved to {output_file} :)")
With our GCSE questions now numbering over 2000, we resolve to show to Chat GPT to classify them into their respective classes. Like in Half 1, we used the next immediate:
Here's a textual content file with a sequence of questions taken from GCSE Basis Arithmetic previous papers. For every query, give it a single categorisation of both: 1. Quantity
2. Algebra
3. Ratio, proportion, and charges of change
4. Geometry and measures
5. Statistics & Likelihood
Cross reference the next paperwork with the intention to help with classification.
https://property.publishing.service.gov.uk/media/5a7cb5b040f0b6629523b52c/GCSE_mathematics_subject_content_and_assessment_objectives.pdf https://{qualifications}.pearson.com/content material/dam/pdf/GCSE/arithmetic/2015/specification-and-sample-assesment/gcse-maths-2015-specification.pdf
Ensure that there may be precisely one class for every entry and no extra. For instance if there are 232 Questions, there needs to be 232 Classes responding on to the Questions. Please place every assigned class in double citation marks, separated by commas, with every entry showing on a brand new line.
Chat GPT returned a textual content file of 2117 listed classes corresponding to every of our questions:
"Algebra",
"Quantity",
"Statistics",
"Geometry and measures",
"Ratio, proportion, and charges of change",
...
...
...
We practice our mannequin on the brand new information set and get the next outcome:
Okay, so what do the outcomes imply?
- Precision: When the mannequin predicts a subject, how typically is it proper?
- Recall: Of all precise questions from a subject, what number of did the mannequin discover?
- F1-score: Combines each precision and recall right into a single rating (increased is best).
We see an enchancment on final week’s mannequin, shifting from 85% accuracy to 87%. Areas comparable to Likelihood the mannequin is especially sturdy: 100% at recognizing chance questions within the take a look at!, whereas some areas like Quantity, 77% of questions had been appropriately categorised, a very good success fee but in addition giving us some room for enchancment.
This week’s technical challenges had been principally within the parsing means of PDFs to textual content. What did we study ?
- When utilizing small datasets, eradicating noise from say 5 papers is trivial and will be achieved manually. However when scaling a venture, all these little artifacts construct up and might crush sources within the quick to medium time period. That is why an understanding of regex is necessary, study it as human developer ! Don’t depend on AI to provide the reply or an answer on a plate.
- In contemplating the low quantity of open supply materials referring to the parsing of Arithmetic/ Science PDF to textual content, we begin to see the complexity of textual content extraction and the actual challenges it presents. In my mannequin, the Python script responded with round 98% accuracy. Often query order can be corrupted — I needed to write a debugging script to inform me if this occurred after which modify the ordering as wanted. However unusually, though the script labored wonderful on GCSE Basis/ Increased Previous papers, when utilized to the Specimen Pattern papers the output was junk !!! Completely unfit for function. So possibly that is the explanation why solely a restricted quantity of open supply materials exists on this area. While a parser could be wonderful tuned to work on particular materials comparable to ours, when making use of it to a normal case, it’s a lot much less prone to succeed. That is attention-grabbing and one thing we should return to.
What went properly ? Any surprises?
The mannequin bettering to 85% to 87% was extremely welcoming. We see now the potential in automation as a device for information analytics and sampling, and a strong asset/ ally in scaling a venture. I realised right this moment, that the accuracy of Chat GPTs responses within the categorising of questions had been mirrored within the efficiency of the machine studying mannequin. There seems to be consistency and logic in its classification. Reliably predicting 100% of Likelihood questions demonstrates strongly, the flexibility to have the ability to classify these questions successfully. The machine studying mannequin provides weight to the notion that LLMs like Chat GPT will be very efficient, and this hybrid method to classification (machine studying + LLM) hopefully demonstrates the potential!
Okay that’s wonderful. What’s subsequent?
So at right this moment’s level now we have 2117 questions categorised by topic space (Quantity, Algebra…), by issue (extra on this quickly…), and by curriculum degree (Basis/ Increased), all presently represented in a JSON file. The following steps are:
- Align particular person questions, with their attributes, to their PDF picture counter elements.
- Get experimental ! Design dynamic monitoring system which follows customers capacity to reply questions. Regulate to their capacity
That’s all for now, thanks for studying.