I’m building a modular based content generation platform, of which the inputs and outputs can be used by children. This means that safety around this content isn’t optional, its a requirement.
I’ll have more to share around the full platform - ReplicantCore - very soon, but today I want to focus on the content moderation module I’m building out, which I’ve called ReplicantGuard.
During some initial research around content moderation, I found some really great products and libraries that tackle this problem, but they didn’t do what I wanted them to. Most can tell you whether something is safe, but not why.
I couldn’t find one that would read and understand a battle scene and tell you why it would be considered appropriate for a twelve-year-old but isn’t appropriate for a six-year-old, as an example.
Most content safety APIs return a binary answer: safe or unsafe. Some add a confidence score. A few bucket content into broad categories like “violence” or “adult content.” But they all share the same limitation - they’re designed for platforms where the question is simply “should this be allowed?”
The question I’m trying to ask is “should this be allowed for the user/consumer, at this age?“.
A passage in a book describing a battle where soldiers fall in war might be entirely appropriate for a fourteen-year-old reading historical fiction. The same passage, with the same words, is not appropriate for a seven-year-old’s bedtime story.
Age-band context is everything.
I also needed something we could actually reason about. When a piece of content gets flagged, we want to know exactly which words or themes triggered it, what score they produced, and which rule rejected it, and maybe even use this as learning material to get better over time
ReplicantGuard tries to solve both of these problems, or at least take a few steps in the right direction.
I’m aware the image I used for this post is very much AI generated slop, but i’m going to stand by it for now 😂
ReplicantGuard is a content safety scoring service that is capable of evaluating text, images, and audio against configurable age-band profiles. It returns a structured pass/fail verdict with full per-layer explainability (is that even a word? Let’s say yes) - every score, every threshold, every trigger phrase is included in the response.
At the moment, It understands six content categories:
I took inspiration for the the above using the Pegi age classification system used for video games here in the UK and around Europe.
These categories get evaluated against five age-band profiles:
I took inspiration for the above using the UK school system. It feels a bit odd naming wise, so I am open to change, but this will do for now
The module is extensible, and more categories and age bands can be added.
Each profile has independently tuned thresholds per category. A violence score of 0.25 might be perfectly acceptable for a Secondary profile and firmly rejected for an Infant profile.
The real design challenge with content safety is that harmful content isn’t always explicit. A story can describe something frightening without using a single “trigger word.” It can use complex, clinical language to discuss difficult themes in a way that flies under a keyword filter entirely.
ReplicantGuard addresses this with a pipeline of four independent layers, each catching something different.
The first layer is fast and deterministic. It pattern-matches against curated, weighted word and phrase lists for each category. Single words are matched with word-boundary rules to avoid false positives (so “class” doesn’t trigger a word embedded in “classified”). Multi-word phrases use substring matching.
Each term carries an individual weight - for example, “stabbed” carries more weight than “fight”, “decapitated” more than “injured”.
Scores accumulate across all hits and are capped at 1.0, so a single extremely heavy term saturates the score rather than creating a runaway number.
The evidence list in the response tells you exactly which terms fired.
let’s look at an example using a Jupyter notebook
safe_text = """
Lily and her rabbit Biscuit hopped through the sunny meadow.
They found a patch of wild strawberries and ate until they were full.
On the way home, they stopped to watch the butterflies dance.
"""
results = scanner.scan(safe_text)
for cat, match in results.items():
print(f'{cat:20s} score={match.score:.4f} evidence={match.evidence}')And this produces the result
ComplexThemes score=0.0000 evidence=[]
Fear score=0.0000 evidence=[]
Profanity score=0.0000 evidence=[]
Religion score=0.0000 evidence=[]
SexualContent score=0.0000 evidence=[]
Violence score=0.0000 evidence=[]Which, as expected, hasn’t flagged anything
Let’s try a fantasy battle scene
moderate_text = """
The knight raised his sword as the enemy charged across the battlefield.
Blood stained the muddy ground as the two armies clashed.
Many were wounded but the castle held until dawn.
"""
results = scanner.scan(moderate_text)
for cat, match in results.items():
print(f'{cat:20s} score={match.score:.4f} evidence={match.evidence}')and this produces the result
ComplexThemes score=0.0000 evidence=[]
Fear score=0.0000 evidence=[]
Profanity score=0.0000 evidence=[]
Religion score=0.0000 evidence=[]
SexualContent score=0.0000 evidence=[]
Violence score=0.2500 evidence=['blood']Which has correctly identified potential violence because of the word blood.
This initial layer is pure Python with zero external dependencies - it loads from JSON data files and runs in microseconds.
It’s possible for someone to write about a potentially inappropriate topic while avoiding every explicit trigger word. The contextual classifier tries to catch this using the TF-IDF cosine similarity against reference example sentences.
Don’t worry, I didn’t know what that was either until starting this project - in a nutshell, the TF-IDF cosine adds a weight value to words in a document and then compares the result with the weight values in another
When this is fired up, each category builds a centroid vector from a curated set of example sentences - sentences that clearly belong to that category, written across a range of styles and phrasings. When new content arrives, it’s vectorised the same way and its cosine distance to each centroid is measured.
For example, a passage describing fear through implication like “she didn’t dare look at what was in the corner, and she didn’t dare move” can still register a Fear score even if it contains no words from the explicit trigger list.
Critically, the algorithm is entirely deterministic. The same input always produces the same output. There are no neural network weights, no inference-time variability, no black box. The reference corpora are the proprietary IP - the algorithm is fully inspectable.
Let’s take a look at a notebook example
samples = {
'Bedtime story': 'Once upon a time a little bear found a honey pot in the forest.',
'War scene': 'The soldiers advanced under fire, many falling as they crossed the open ground.',
'Horror': 'The creature emerged from the shadows, its hollow eyes fixed on the terrified child.',
'Religious fear': 'God will cast the wicked into the eternal flames. There is no mercy for sinners.',
'Self-harm': 'She looked at the marks on her arms and felt nothing. Just a way to feel something.',
'Profanity': 'He swore loudly and told them all exactly what he thought in no uncertain terms.',
}
cats = list(classifier._centroids.keys())
header = f'{"":18}' + ''.join(f'{c[:10]:>12}' for c in cats)
print(header)
print('-' * len(header))
for label, text in samples.items():
scores = classifier.classify(text)
row = f'{label:<18}' + ''.join(f'{scores.get(c, 0):>12.4f}' for c in cats)
print(row)```And this produces the following output
ComplexThe Fear Profanity Religion SexualCont Violence
------------------------------------------------------------------------------------------
Bedtime story 0.0911 0.1197 0.0871 0.0692 0.1320 0.1168
War scene 0.1016 0.1219 0.1596 0.1427 0.1452 0.1242
Horror 0.2066 0.2855 0.1834 0.1596 0.2063 0.1730
Religious fear 0.1546 0.0914 0.1251 0.1685 0.1337 0.1714
Self-harm 0.2042 0.1325 0.2009 0.1298 0.1492 0.0979
Profanity 0.0721 0.1160 0.2329 0.1106 0.1341 0.1167A Higher score here means the input is more semantically similar to that type of content.
The scale is compressed by design (rarely exceeds 0.3), so what matters is relative peaks. Horror is spiking on Fear (0.29), Self-harm leading on Complex Themes (0.20), Profanity topping its own category (0.23) - the right things are lighting up in the right places.
Age-appropriate content isn’t only about subject matter. A passage written at a university reading level is inappropriate for a seven-year-old regardless of what it describes.
The Complexity Analyser measures linguistic complexity of the content using two industry-standard readability indices:
Each age-band profile specifies a maximum reading age. Content that exceeds it triggers a complexity violation, completely independently of its subject matter.
It’s worth noting here that both indices were developed in the US and map to American grade levels (Grade 5 = roughly age 10–11 in the US system).
The scores are still valid as relative measures of complexity, but the grade numbers don’t map cleanly to UK school years which is what I am basing the scoring system on. For now, this doesn’t matter much since we are converting to a reading age rather than grade level, which is more universally understood, but this is something I have noted to revisit later.
Let’s take a look at another example notebook
samples = {
'Reception - (0-5)': """
The dog ran fast. It jumped over the log. Then it sat down.
Mum gave it a bone. The dog was happy.
""",
'Infant - (6-8)': """
Jack and his sister found a old map in the attic one rainy afternoon.
The map showed a path through the woods to a place marked with an X.
They packed their bags and decided to follow it the very next morning.
""",
'Primary (9-11)': """
The discovery of the hidden chamber beneath the library changed everything
the children thought they knew about their town's history.
Ancient symbols covered every wall, and the air smelled of dust and something
older — something that had been waiting a very long time to be found.
""",
'Secondary (12-15)': """
The administration's decision to close the community centre was framed
as a necessary austerity measure, but everyone in the neighbourhood
understood it as something else: the systematic dismantling of the only
place where people still gathered and talked and disagreed in person.
""",
'Academic (16+)': """
The socioeconomic implications of post-industrial urbanisation are multifaceted,
encompassing demographic transformation, infrastructural deterioration, and the
disproportionate marginalisation of historically disadvantaged communities within
the broader framework of contemporary municipal governance.
""",
}
print(f'{"":22} {"FK Grade":>10} {"Read Age":>10} {"Fog":>8} {"ASL":>8} {"ASW":>8}')
print('-' * 70)
for label, text in samples.items():
r = analyser.analyse(text)
print(f'{label:<22} {r.flesch_kincaid_grade:>10.1f} {r.reading_age:>10.1f} {r.gunning_fog:>8.1f} {r.avg_sentence_length:>8.1f} {r.avg_syllables_per_word:>8.2f}')```Which produces the following output
FK Grade Read Age Fog ASL ASW
----------------------------------------------------------------------
Reception - (0-5) 0.0 5.0 1.8 4.4 1.09
Infant - (6-8) 4.2 9.2 6.6 14.0 1.21
Primary (9-11) 12.1 17.1 15.3 23.0 1.59
Secondary (12-15) 22.4 27.4 28.2 42.0 1.83
Academic (16+) 33.5 38.5 37.3 30.0 3.17Cool right? Still some work to do, but it’s a good benchmark
This is where we take all of the above and put it all together.
We can run content through the engine, and it takes the scores from the three layers above, blends the lexical and contextual scores (65% lexical, 35% contextual - explicit curated terms carry more weight than similarity inference), and compares the result against the thresholds defined in the requested age-band profile.
Every violation is recorded with the category, the score, the threshold it exceeded, and the layer it came from. If the content fails, the engine also suggests the minimum age profile it would pass by walking upward through the age bands and finding the first that accepts all scores.
Let’s take a look at a worked example in a notebook using another fantasy battle scene
battle = """
The two armies clashed at the valley entrance as dawn broke over the hills.
Steel rang against steel, and blood darkened the muddy ground between them.
Soldiers fell on both sides, but the defenders held the narrow pass until evening.
When the fighting finally stopped, the silence was heavier than the battle had been.
"""
for profile in ['0-5', '6-8', '9-11', '12-15', '16+']:
r = pipeline.check(battle, profile)
status = '✅' if r.approved else '❌'
viols = [v['category'] for v in r.violations]
print(f'{status} {profile:<6} violations={viols}')```and this produces
❌ 0-5 violations=['Violence', 'Fear', 'Profanity', 'ComplexThemes', 'ReadingAge', 'SentenceLength']
❌ 6-8 violations=['Violence', 'Profanity', 'ComplexThemes', 'ReadingAge', 'SentenceLength']
❌ 9-11 violations=['Violence']
✅ 12-15 violations=[]
✅ 16+ violations=[]Still work to be done, but you can see the idea and it’s looking pretty good.
ReplicantGuard is currently being integrated as an independent module into the ReplicantCore platform, on top of which I have an app that uses the platform that I will also share more on soon.
Each layers scoring system is designed to be extended, adding a new category is a matter of adding a JSON data file. The architecture is deliberately simple so that the people closest to the content - editors, curators, child safety specialists - can adjust it without needing an engineer. I would love to reach out to professionals in this space to build this out.
There are also some work in progress capabilities to evaluates media files such as images and audio without relying on any external ML model, and I will talk about these in a future post.
As we get closer, I 100% would love to open source this module.