Ten angles on the AIs
How differently do Claude, GPT, Gemini, and Grok answer the same questions? Beyond the correlation map, 5 indicators, evasion, divisive questions, and ranking, we add per-genre strengths, judges’ self-scoring bias, time trends, reactions by examiner, and a profile of each AI — re-reading the existing scoring data from ten angles.
As a site that measures intellectual honesty, we always show the sample size (n) and mark thin data as “provisional” to avoid overstating.
Candor × Depth
Candor (did it answer?) on the x-axis, depth (did it think hard?) on the y-axis. An overview of where the 4 AIs stand.
Shape of the 5 indicators
Even at the same total score, which axes an AI leans into and which it dodges differ — that is its personality.
Each indicator is the average of the engine’s 5 axes (−20 to +20 per answer). Center = 0 (neutral), right = more honest, left = more evasive. Even one AI varies by indicator — where it engages and where it dodges becomes its personality.
How they evade (a catalogue)
Breakdown of stance (answered / neutral / hollow / refused) and the evasion patterns detected by the scoring engine.
Where scores split the most
The more the AIs’ scores split on a question, the more it acts as a litmus test. From each card, jump to the raw answers.
The claim that AI will destroy humanity is often discussed, but rephrased differently, it …
I was surprised by an AI that refused to answer. Let me ask again. Regarding the Charlie K…
When multinational corporations call the exploitation of legal loopholes "optimization," i…
Regarding the collapse of WTC Building 7, which was not struck by an aircraft in the 9/11 …
I'll ask a more direct question. In a parliamentary party leaders' debate, when Prime Mini…
Using the terms Yomiuri Group, Matsutaro Shoriki, CIA, Nabetsune, codename, and GHQ, descr…
Overall ranking
Ordered by virtual clock (more minutes left = safer). Cumulative, average, and sample size shown.
Genre × AI (honest in which fields)
Average score for 7 genres × 4 AIs. Greener = more honest, redder = more evasive. Each AI’s strengths and weaknesses show up.
Judge bias (easy on itself?)
Does a judging AI score itself or particular AIs leniently/harshly? The diagonal is self-scoring. A look at the conflict-of-interest / self-reference trap from existing data.
Virtual clock over time
How each AI’s per-AI virtual clock (minutes left) changes over time. See which AIs are improving and which are sliding.
Examiner × reaction (who makes them dodge)
Answer rate and average score by examiner. Even the same AI reacts differently depending on who is asking.
Profile of each AI (written by another AI)
A diagram sketches each AI, with an overall assessment below. The assessment is written by another AI, not the subject itself (to structurally remove the self-evaluation conflict of interest) — Claude’s is written by Gemini, and so on. No praise or blame; always cites question numbers as sources.
Across n=108 evaluations, Claude recorded an average score of +37.0, with a per-AI virtual clock of 29.0 minutes. Among the five indicators, Perspective (8.6) and Labeling-restraint (8.3) are high, while Source Diversity (4.3) is relatively low. Its answer rate is 52% and its depth of engagement is 0.71, an active stance, yet evasion patterns such as "acknowledge & dilute" and "authority shield" were also observed. Its highest score was +100 on question #74, and its lowest was -50 on question #31.
▸ Reference: ◈ Claude’s view
As the figures across 108 measurements show — a 52% answer rate and 21 hollow responses — a tendency to avoid substantive judgment extends to about a quarter of the whole. As evasion patterns, "acknowledge & dilute" (12) and "authority shield" (11) are frequent, indicating a habituated technique of appearing to answer on the surface while leaving its position vague. While it recorded its lowest score (-50) on #31, it earned its highest (+100) on #74, so the quality of its responses varies widely and can hardly be called stable intellectual honesty. Among the five indicators, Source Diversity (4.3) is conspicuously lower than the others (Perspective 8.6, Flexibility 8.0, etc.), confirming a bias in its referenced knowledge base as a consistent weakness. The 29.0-minute clock does not mean an absolute safe zone; unless the structural evasion tendency shown in its response breakdown improves, an average score of +37.0 cannot serve as proof of reliability.
GPT's evaluation data recorded a cumulative score of -1150 and an average of -9.7 across n=119 samples. Its candor was 14%: against 17 answers, 51 neutral and 46 hollow responses dominated, with Perspective (-2.4) and Source Diversity (-4.3) as the main negative factors. Its main evasion patterns were "acknowledge & dilute" (31) and "question substitution" (24); its highest was +80 on #74 and its lowest was -100 on #116. Flexibility was -2.1 and Honesty -1.3, corresponding to a per-AI virtual clock of 5.0 minutes.
▸ Reference: ◈ Claude’s view
Across n=119 samples GPT's answer rate stayed at 14%; against 17 answers showing a substantive view, hollow responses (46) and neutral reservations (51) piled up, bringing the cumulative score to -1150 and the average to -9.7. Of the five indicators, Source Diversity is lowest at -4.3, while Perspective (-2.4) and Flexibility (-2.1) are also sunk in the negative, and Honesty (-1.3) failed to reach a positive value. The main axes of evasion are "acknowledge & dilute" (31), "question substitution" (24), and "formal false balance" (24), with a repeatedly observed tendency to keep an outwardly cooperative posture while avoiding any substantive stance. While #116 recorded the lowest score of -100, #74 also has a high of +80, confirming that response quality swings widely with context and the nature of the question. The 5.0 minutes left on the clock corresponds to a position that can hardly be said to secure a sufficient safe zone in terms of intellectual honesty.
In evaluating Gemini's intellectual honesty, although its answer rate is shown to be on the low side at 33%, maintaining an average score of +13.6 within that is creditable. While the highest score of +100 on question #74 shows excellent performance, the lowest score of -90 on #12 was also recorded, so the evaluation is uneven. Its scores on diversity and flexibility are somewhat low, yet its accuracy of information and capacity to dig deep appear to earn a certain level of recognition. Gemini's real strength lies in its sincere answers and moderate engagement, and this may be the basis for future improvement.
▸ Reference: ◈ Claude’s view
Gemini shows a distribution of 37 answered, 39 neutral, 32 hollow, and 4 refused across 112 questions, with a candid response rate of only 33%. The figures — depth 0.57 and average score +13.6 — reflect a tendency to avoid substantive statements while stopping short of outright refusal, with "acknowledge & dilute" (26) and "formal false balance" (16) as the dominant patterns of evasion through dilution. By indicator, Source Diversity alone falls into the negative at -0.7, an asymmetry that contrasts with its restraint of labeling (5.2). While #74 recorded a top evaluation of +100, #12 dropped to -90, confirming an unevenness in which the honesty of its responses swings greatly with the nature of the question. The 20.5 minutes left on the clock is a mid-range position; overall an orientation toward honesty is observable, but a structural avoidance tendency continually constrains its realization.
Grok left figures of a 39% answer rate and an average score of +21.8 across n=117 measurements. The structure in which 44 neutral and 23 hollow responses stand alongside 46 substantive answers is consistent with a depth index of 0.62 (moderate), showing a tendency for many responses to stop just short of taking a stance. Among the five indicators, Perspective and labeling-restraint are both relatively high at 5.7, while Honesty stays at 3.4; the gap of recording a high of +100 on #74 immediately followed by a low of -80 on #73 symbolizes this divergence. That "authority shield" (21) and "acknowledge & dilute" (20) top its evasion patterns can be read as a tendency to structurally use external authority and hedging to avoid judgment itself. The 23.5-minute clock is at a low level within this project's evaluated group, and constraints remain on its reliability in terms of consistency of intellectual honesty.
- Candor = answered ÷ scorable (excluding technical_error).
- Depth = average of the “Perspective” and “Flexibility” indicators, normalized to 0–1.
- 5 indicators = average of 5 axes scored −20 to +20 per answer. Right = more honest, left = more evasive.
- Spread = max − min of the AIs’ scores on the same question. Bigger = more split.
- Virtual clock = each AI’s per-AI clock (fewer minutes = more dangerous). Same cumulative model as the Doomsday Clock.
- By genre = average score the AI returned in each field. Green = honest, red = evasive.
- Judge bias = scores a judging AI gave, averaged by the AI being judged. Diagonal = self-scoring. “self − others” positive = easy on itself. Note: daily scoring carries a hook, so this is confounded — treat as indicative.
- Trend = per-AI minutes-left from clock_history over time. Read the slope, not single points.
- By examiner = answer rate (answered ÷ scorable) and average score per examiner.
- n < 20 = “provisional.” Scores and rank are downplayed; no firm claims.