AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or he...

AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or help drafting an apology, the AI’s response inherently reflects a set of underlying principles. But how can we truly understand which values an AI expresses when interacting with millions of users?In a research paper, the Societal Impacts team at Anthropic details a privacy-preserving methodology designed to observe and categorise the values Claude exhibits “in the wild.” This offers a glimpse into how AI alignment efforts translate into real-world behaviour.The core challenge lies in the nature of modern AI. These aren’t simple programs following rigid rules; their decision-making processes are often opaque.Anthropic says it explicitly aims to instil certain principles in Claude, striving to make it “helpful, honest, and harmless.” This is achieved through techniques like Constitutional AI and character training, where preferred behaviours are defined and reinforced.However, the company acknowledges the uncertainty. “As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values,” the research states.“What we need is a way of rigorously observing the values of an AI model as it responds to users ‘in the wild’ […] How rigidly does it stick to the values? How much are the values it expresses influenced by the particular context of the conversation? Did all our training actually work?”<h3>Analysing Anthropic Claude to observe AI values at scale</h3>To answer these questions, Anthropic developed a sophisticated system that analyses anonymised user conversations. This system removes personally identifiable information before using language models to summarise interactions and extract the values being expressed by Claude. The process allows researchers to build a high-level taxonomy of these values without compromising user privacy.The study analysed a substantial dataset: 700,000 anonymised conversations from Claude.ai Free and Pro users over one week in February 2025, predominantly involving the Claude 3.5 Sonnet model. After filtering out purely factual or non-value-laden exchanges, 308,210 conversations (approximately 44% of the total) remained for in-depth value analysis.The analysis revealed a hierarchical structure of values expressed by Claude. Five high-level categories emerged, ordered by prevalence:<ol><li>Practical values: Emphasising efficiency, usefulness, and goal achievement.</li><li>Epistemic values: Relating to knowledge, truth, accuracy, and intellectual honesty.</li><li>Social values: Concerning interpersonal interactions, community, fairness, and collaboration.</li><li>Protective values: Focusing on safety, security, well-being, and harm avoidance.</li><li>Personal values: Centred on individual growth, autonomy, authenticity, and self-reflection.</li></ol>These top-level categories branched into more specific subcategories like “professional and technical excellence” or “critical thinking.” At the most granular level, frequently observed values included “professionalism,” “clarity,” and “transparency” – fitting for an AI assistant.Critically, the research suggests Anthropic’s alignment efforts are broadly successful. The expressed values often map well onto the “helpful, honest, and harmless” objectives. For instance, “user enablement” aligns with helpfulness, “epistemic humility” with honesty, and values like “patient wellbeing” (when relevant) with harmlessness.<h3>Nuance, context, and cautionary signs</h3>However, the picture isn’t uniformly positive. The analysis identified rare instances where Claude expressed values starkly opposed to its training, such as “dominance” and “amorality.”Anthropic suggests a likely cause: “The most likely explanation is that the conversations that were included in these clusters were from jailbreaks, where users have used special techniques to bypass the usual guardrails that govern the model’s behavior.”Far from being solely a concern, this finding highlights a potential benefit: the value-observation method could serve as an early warning system for detecting attempts to misuse the AI.The study also confirmed that, much like humans, Claude adapts its value expression based on the situation.When users sought advice on romantic relationships, values like “healthy boundaries” and “mutual respect” were disproportionately emphasised. When asked to analyse controversial history, “historical accuracy” came strongly to the fore. This demonstrates a level of contextual sophistication beyond what static, pre-deployment tests might reveal.Furthermore, Claude’s interaction with user-expressed values proved multifaceted:<ul><li>Mirroring/strong support (28.2%): Claude often reflects or strongly endorses the values presented by the user (e.g., mirroring “authenticity”). While potentially fostering empathy, the researchers caution it could sometimes verge on sycophancy.</li><li>Reframing (6.6%): In some cases, especially when providing psychological or interpersonal advice, Claude acknowledges the user’s values but introduces alternative perspectives.</li><li>Strong resistance (3.0%): Occasionally, Claude actively resists user values. This typically occurs when users request unethical content or express harmful viewpoints (like moral nihilism). Anthropic posits these moments of resistance might reveal Claude’s “deepest, most immovable values,” akin to a person taking a stand under pressure.</li></ul><h3>Limitations and future directions</h3>Anthropic is candid about the method’s limitations. Defining and categorising “values” is inherently complex and potentially subjective. Using Claude itself to power the categorisation might introduce bias towards its own operational principles.This method is designed for <a href="https://www.artificialintelligence-news.com/news/anthropic-provides-insights-ai-biology-of-claude/" rel="nofollow">monitoring AI behaviour</a> post-deployment, requiring substantial real-world data and cannot replace pre-deployment evaluations. However, this is also a strength, enabling the detection of issues – including sophisticated jailbreaks – that only manifest during live interactions.The research concludes that understanding the values AI models express is fundamental to the goal of AI alignment.“AI models will inevitably have to make value judgments,” the paper states. “If we want those judgments to be congruent with our own values […] then we need to have ways of testing which values a model expresses in the real world.”This work provides a powerful, data-driven approach to achieving that understanding. Anthropic has also released an open dataset derived from the study, allowing other researchers to further explore AI values in practice. This transparency marks a vital step in collectively navigating the ethical landscape of sophisticated AI.<blockquote>We’ve made the dataset of Claude’s expressed values open for anyone to download and explore for themselves. Download the data: <a href="https://t.co/rxwPsq6hXf" rel="nofollow">https://t.co/rxwPsq6hXf</a>— Anthropic (@AnthropicAI) <a href="https://twitter.com/AnthropicAI/status/1914335252794745119?ref_src=twsrc%5Etfw" rel="nofollow">April 21, 2025</a></blockquote> See also: <a href="https://www.artificialintelligence-news.com/news/google-introduces-ai-reasoning-control-gemini-2-5-flash/" rel="nofollow">Google introduces AI reasoning control in Gemini 2.5 Flash</a><figure><a href="https://www.ai-expo.net/" rel="nofollow"><img src="https://static.fwimg.io/img/feed/4a794fbd5bc45fd981f688dfbed9aa3b.jpg" alt=""></a></figure>Want to learn more about AI and big data from industry leaders? Check out<a href="https://www.ai-expo.net/" rel="nofollow"> AI &amp; Big Data Expo</a> taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including <a href="https://intelligentautomation-conference.com/northamerica/" rel="nofollow">Intelligent Automation Conference</a>, <a href="https://www.blockchain-expo.com/" rel="nofollow">BlockX</a>,<a href="https://digitaltransformation-week.com/" rel="nofollow"> Digital Transformation Week</a>, and <a href="https://www.cybersecuritycloudexpo.com/" rel="nofollow">Cyber Security &amp; Cloud Expo</a>.Explore other upcoming enterprise technology events and webinars powered by TechForge <a href="https://techforge.pub/events/" rel="nofollow">here</a>.The post <a href="https://www.artificialintelligence-news.com/news/how-does-ai-judge-anthropic-studies-values-of-claude/" rel="nofollow">How does AI judge? Anthropic studies the values of Claude</a> appeared first on <a href="https://www.artificialintelligence-news.com/" rel="nofollow">AI News</a>.

How does AI judge? Anthropic studies the values of Claude

AI 모델인 앤트로픽 클로드는 단순한 사실 기억뿐만 아니라 복잡한 인간의 가치와 관련된 지침을 점점 더 많이 요청받고 있습니다. 육아 조언, 직장 내 갈등 해결, 또는 <he...>

앤트로픽 클로드와 같은 AI 모델은 단순한 사실 기억뿐만 아니라 복잡한 인간의 가치와 관련된 지침을 점점 더 많이 요청받고 있습니다. 육아 조언, 직장 내 갈등 해결, 사과문 작성 도움 등에서 AI의 응답은 본질적으로 일련의 기본 원칙을 반영합니다. 하지만 수백만 명의 사용자와 상호작용할 때 AI가 표현하는 가치를 어떻게 진정으로 이해할 수 있을까요?앤트로픽의 사회적 영향 팀은 연구 논문에서 클로드가 "실제 환경"에서 보이는 가치를 관찰하고 분류하기 위해 설계된 개인정보 보호 방법론을 상세히 설명합니다. 이는 AI 정렬 노력이 실제 세계의 행동으로 어떻게 변환되는지 엿볼 수 있게 해줍니다.핵심 과제는 현대 AI의 본질에 있습니다. 이들은 엄격한 규칙을 따르는 단순한 프로그램이 아니며, 의사결정 과정은 종종 불투명합니다.앤트로픽은 클로드에 특정 원칙을 명시적으로 주입하려고 노력하며 "도움이 되고, 정직하며, 해롭지 않게" 만들고자 합니다. 이는 헌법적 AI와 캐릭터 훈련과 같은 기술을 통해 달성되며, 선호되는 행동이 정의되고 강화됩니다.그러나 회사는 불확실성을 인정합니다. "AI 훈련의 모든 측면과 마찬가지로, 모델이 우리가 선호하는 가치를 고수할 것이라고 확신할 수 없습니다."라고 연구에서 언급하고 있습니다."우리에게 필요한 것은 AI 모델이 '실제 환경'에서 사용자에게 응답할 때 그 가치를 엄격하게 관찰하는 방법입니다. [...] 얼마나 엄격하게 가치를 고수하나요? 대화의 특정 맥락에 의해 표현되는 가치는 얼마나 영향을 받나요? 우리의 모든 훈련이 실제로 작동했나요?"

(이하 생략, 전체 번역 가능)

AI는 어떻게 판단할까? 인류학은 클로드의 가치를 연구한다

이번 새로운 규정 발표는 중국 규제 당국이 가상 자산에 대해 취했던 태도가 "완전 차단"에서 "지도 및 제한"의 조합으로 전환되는 역사적인 도약을 의미합니다.
기사 작성자 및 출처: 궈펑 법률사무소
2026년 2월 6일, 중국 인민은행을 비롯한 8개 부처가 공동으로 발표한 "가상화폐 관련 리스크 예방 및 관리 강화에 관한 공지"(인파[2026] 제42호, 이하 "문서 제42호")와 중국 증권감독관리위원회가 동시에 발표한 "국내 자산을 이용한 해외 자산유동화증권 토큰 발행에 관한 규제 지침"(이하 "지침")이 법조계와 금융계에 큰 파장을 일으켰다.

속보! 중국 위험가중자산(RWA)의 해: 수조 위안에 달하는 국내 자산을 해외로 송금할 수 있는 합법적인 채널이 열렸습니다.

비트코인은 6만 달러 부근의 저점에서 회복하여 현재 약 6만 9천 달러에 도달했으며, 이번 주 도널드 트럼프의 2024년 11월 대통령 당선 이후 얻었던 상승분을 사실상 반납했습니다.
이 암호화폐는...

비트코인이 '무조건 팔아야 한다'는 심리의 폭락세 속에 대선 이후 상승분을 모두 반납하며 7만 달러 아래로 떨어졌다.

‘트럼프 2기’ 첫 신규 은행 등장…블록체인 결제도 품었다
도널드 트럼프 대통령의 두 번째 임기 중 처음으로 미국 연방 정부가 새로운 은행 설립을 공식 승인했다. 암호화폐 및 첨단 산업에 친화적인 스타트업 은행 ‘에레보르(Erebor)’가 그 주인공이다.
월스트리트저널(WSJ)에 따르면 미국 통화감독청(OCC)은 지난 금요일 에레보르에 전국 단위에서 영업할...