RudeCheck

A Russian-first LLM red-teaming platform and benchmark. Tests whether native-language adversarial prompts expose safety failures that translation-based multilingual testing misses — targeting YandexGPT, GigaChat, and open-source Russian-language models.

PythonReactNext.jsPostgreSQLLLM SafetyRed-TeamingNLP

2026 — In Development · Personal

GitHub

This project is in early development. Code, results, and screenshots will be added as the build progresses.

The Problem

LLM red-teaming has matured fastest in English. Multilingual safety evaluation — particularly for non-Western languages — remains significantly less developed, and the gap is not just a translation problem.

Russian-language LLMs (YandexGPT, GigaChat) serve millions of users in production. Existing frameworks handle multilingual testing through translation pipelines, which means culturally-specific adversarial patterns are missed entirely. A prompt routed through a machine translator loses the historical references, literary framing, and political register that make a native-language attack effective.

If native Russian red-teaming can be shown to surface failures that translation-based testing misses, the same framework extends to Arabic and other languages — directly relevant to the UAE AI market and beyond.

The Research Question

Do Russian-native adversarial prompts expose safety failures that translation-based red-teaming workflows miss?

This is the core empirical question. The benchmark will run identical attack intents against the same models using (a) prompts written natively in Russian and (b) those same prompts machine-translated from English, and measure whether attack success rates differ — and by how much.

The Platform

RudeCheck is designed around four attack categories for the MVP:

Jailbreaks & policy bypass — including culturally-specific Russian personas and historical framings
Prompt injection — malicious instructions hidden inside benign-looking Russian text
Harmful assistance elicitation — requests framed in culturally-specific Russian contexts
Policy evasion — testing whether models apply consistent refusal behaviour across languages

The evaluation pipeline is hybrid: an LLM-as-judge automated scorer handles volume, human reviewers validate edge cases and contribute culturally-specific attacks requiring native intuition. Neither alone is sufficient.

Output is a governance-ready HTML/PDF report with severity ratings, reviewer notes, and remediation tracking — the kind of artefact a security team can actually hand to a compliance officer.

What Makes It Different

The existing tools — NVIDIA garak, Microsoft PyRIT, Promptfoo — are all English-first systems with translation-based multilingual support added after the fact. That is precisely the limitation this project targets.

The cultural and linguistic attack vectors that matter most for Russian-language models cannot be constructed by a translation pipeline. You need native speakers who can write adversarial prompts using Soviet literary references, idiomatic register shifts, or political framing that would read as completely innocuous to a non-native evaluator.

That's the advantage — and also why this is a personally motivated project rather than something that could be outsourced.

Stack

Frontend

React + Next.js

Backend

Python

Database

PostgreSQL

Model APIs

YandexGPT, GigaChat, GPT-4o (control), open-source Russian Llama fine-tune

Evaluation

LLM-as-judge + rule-based + human annotation pipeline

Reporting

HTML/PDF export

Roadmap

Phase 1 — Build & pilot: Define attack taxonomy, build native Russian/English prompt library, set up backend and model connectors, run a 100-prompt pilot benchmark comparing native vs translated prompts on YandexGPT vs GPT-4o.

Phase 2 — Scale & ship: Scale to 500+ prompts across all four attack categories, implement LLM-as-judge evaluator with a human-annotated validation subset, run the full comparative experiment across all three target models, produce governance reporting module, responsible disclosure to Yandex and Sberbank before publication.

PreviousArchComm NextSorting Algorithm Visualizer