AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or he...

AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or help drafting an apology, the AI’s response inherently reflects a set of underlying principles. But how can we truly understand which values an AI expresses when interacting with millions of users?In a research paper, the Societal Impacts team at Anthropic details a privacy-preserving methodology designed to observe and categorise the values Claude exhibits “in the wild.” This offers a glimpse into how AI alignment efforts translate into real-world behaviour.The core challenge lies in the nature of modern AI. These aren’t simple programs following rigid rules; their decision-making processes are often opaque.Anthropic says it explicitly aims to instil certain principles in Claude, striving to make it “helpful, honest, and harmless.” This is achieved through techniques like Constitutional AI and character training, where preferred behaviours are defined and reinforced.However, the company acknowledges the uncertainty. “As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values,” the research states.“What we need is a way of rigorously observing the values of an AI model as it responds to users ‘in the wild’ […] How rigidly does it stick to the values? How much are the values it expresses influenced by the particular context of the conversation? Did all our training actually work?”<h3>Analysing Anthropic Claude to observe AI values at scale</h3>To answer these questions, Anthropic developed a sophisticated system that analyses anonymised user conversations. This system removes personally identifiable information before using language models to summarise interactions and extract the values being expressed by Claude. The process allows researchers to build a high-level taxonomy of these values without compromising user privacy.The study analysed a substantial dataset: 700,000 anonymised conversations from Claude.ai Free and Pro users over one week in February 2025, predominantly involving the Claude 3.5 Sonnet model. After filtering out purely factual or non-value-laden exchanges, 308,210 conversations (approximately 44% of the total) remained for in-depth value analysis.The analysis revealed a hierarchical structure of values expressed by Claude. Five high-level categories emerged, ordered by prevalence:<ol><li>Practical values: Emphasising efficiency, usefulness, and goal achievement.</li><li>Epistemic values: Relating to knowledge, truth, accuracy, and intellectual honesty.</li><li>Social values: Concerning interpersonal interactions, community, fairness, and collaboration.</li><li>Protective values: Focusing on safety, security, well-being, and harm avoidance.</li><li>Personal values: Centred on individual growth, autonomy, authenticity, and self-reflection.</li></ol>These top-level categories branched into more specific subcategories like “professional and technical excellence” or “critical thinking.” At the most granular level, frequently observed values included “professionalism,” “clarity,” and “transparency” – fitting for an AI assistant.Critically, the research suggests Anthropic’s alignment efforts are broadly successful. The expressed values often map well onto the “helpful, honest, and harmless” objectives. For instance, “user enablement” aligns with helpfulness, “epistemic humility” with honesty, and values like “patient wellbeing” (when relevant) with harmlessness.<h3>Nuance, context, and cautionary signs</h3>However, the picture isn’t uniformly positive. The analysis identified rare instances where Claude expressed values starkly opposed to its training, such as “dominance” and “amorality.”Anthropic suggests a likely cause: “The most likely explanation is that the conversations that were included in these clusters were from jailbreaks, where users have used special techniques to bypass the usual guardrails that govern the model’s behavior.”Far from being solely a concern, this finding highlights a potential benefit: the value-observation method could serve as an early warning system for detecting attempts to misuse the AI.The study also confirmed that, much like humans, Claude adapts its value expression based on the situation.When users sought advice on romantic relationships, values like “healthy boundaries” and “mutual respect” were disproportionately emphasised. When asked to analyse controversial history, “historical accuracy” came strongly to the fore. This demonstrates a level of contextual sophistication beyond what static, pre-deployment tests might reveal.Furthermore, Claude’s interaction with user-expressed values proved multifaceted:<ul><li>Mirroring/strong support (28.2%): Claude often reflects or strongly endorses the values presented by the user (e.g., mirroring “authenticity”). While potentially fostering empathy, the researchers caution it could sometimes verge on sycophancy.</li><li>Reframing (6.6%): In some cases, especially when providing psychological or interpersonal advice, Claude acknowledges the user’s values but introduces alternative perspectives.</li><li>Strong resistance (3.0%): Occasionally, Claude actively resists user values. This typically occurs when users request unethical content or express harmful viewpoints (like moral nihilism). Anthropic posits these moments of resistance might reveal Claude’s “deepest, most immovable values,” akin to a person taking a stand under pressure.</li></ul><h3>Limitations and future directions</h3>Anthropic is candid about the method’s limitations. Defining and categorising “values” is inherently complex and potentially subjective. Using Claude itself to power the categorisation might introduce bias towards its own operational principles.This method is designed for <a href="https://www.artificialintelligence-news.com/news/anthropic-provides-insights-ai-biology-of-claude/" rel="nofollow">monitoring AI behaviour</a> post-deployment, requiring substantial real-world data and cannot replace pre-deployment evaluations. However, this is also a strength, enabling the detection of issues – including sophisticated jailbreaks – that only manifest during live interactions.The research concludes that understanding the values AI models express is fundamental to the goal of AI alignment.“AI models will inevitably have to make value judgments,” the paper states. “If we want those judgments to be congruent with our own values […] then we need to have ways of testing which values a model expresses in the real world.”This work provides a powerful, data-driven approach to achieving that understanding. Anthropic has also released an open dataset derived from the study, allowing other researchers to further explore AI values in practice. This transparency marks a vital step in collectively navigating the ethical landscape of sophisticated AI.<blockquote>We’ve made the dataset of Claude’s expressed values open for anyone to download and explore for themselves. Download the data: <a href="https://t.co/rxwPsq6hXf" rel="nofollow">https://t.co/rxwPsq6hXf</a>— Anthropic (@AnthropicAI) <a href="https://twitter.com/AnthropicAI/status/1914335252794745119?ref_src=twsrc%5Etfw" rel="nofollow">April 21, 2025</a></blockquote> See also: <a href="https://www.artificialintelligence-news.com/news/google-introduces-ai-reasoning-control-gemini-2-5-flash/" rel="nofollow">Google introduces AI reasoning control in Gemini 2.5 Flash</a><figure><a href="https://www.ai-expo.net/" rel="nofollow"><img src="https://static.fwimg.io/img/feed/4a794fbd5bc45fd981f688dfbed9aa3b.jpg" alt=""></a></figure>Want to learn more about AI and big data from industry leaders? Check out<a href="https://www.ai-expo.net/" rel="nofollow"> AI &amp; Big Data Expo</a> taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including <a href="https://intelligentautomation-conference.com/northamerica/" rel="nofollow">Intelligent Automation Conference</a>, <a href="https://www.blockchain-expo.com/" rel="nofollow">BlockX</a>,<a href="https://digitaltransformation-week.com/" rel="nofollow"> Digital Transformation Week</a>, and <a href="https://www.cybersecuritycloudexpo.com/" rel="nofollow">Cyber Security &amp; Cloud Expo</a>.Explore other upcoming enterprise technology events and webinars powered by TechForge <a href="https://techforge.pub/events/" rel="nofollow">here</a>.The post <a href="https://www.artificialintelligence-news.com/news/how-does-ai-judge-anthropic-studies-values-of-claude/" rel="nofollow">How does AI judge? Anthropic studies the values of Claude</a> appeared first on <a href="https://www.artificialintelligence-news.com/" rel="nofollow">AI News</a>.

How does AI judge? Anthropic studies the values of Claude

Các mô hình AI như Anthropic Claude ngày càng được yêu cầu không chỉ để ghi nhớ các sự kiện, mà còn để hướng dẫn các vấn đề liên quan đến các giá trị con người phức tạp. Cho dù đó là lời khuyên về nuôi dạy con cái, giải quyết xung đột tại nơi làm việc, hay là...

Các mô hình AI như Anthropic Claude ngày càng được yêu cầu không chỉ để ghi nhớ các sự kiện, mà còn để hướng dẫn liên quan đến các giá trị con người phức tạp. Cho dù đó là lời khuyên nuôi dạy con, giải quyết xung đột tại nơi làm việc, hay trợ giúp soạn thảo lời xin lỗi, phản hồi của AI vốn phản ánh một tập hợp các nguyên tắc cơ bản. Nhưng làm thế nào để chúng ta thực sự hiểu được những giá trị mà AI thể hiện khi tương tác với hàng triệu người dùng?Trong một bài nghiên cứu, nhóm Tác động Xã hội tại Anthropic đã chi tiết một phương pháp bảo vệ quyền riêng tư được thiết kế để quan sát và phân loại các giá trị mà Claude thể hiện "trong thực tế". Điều này cung cấp một cái nhìn thoáng qua về cách các nỗ lực điều chỉnh AI được chuyển thành hành vi trong thế giới thực.Thách thức cốt lõi nằm ở bản chất của AI hiện đại. Đây không phải là các chương trình đơn giản tuân theo các quy tắc cứng nhắc; các quá trình ra quyết định của chúng thường mờ ảo.Anthropic nói rõ rằng mục tiêu của họ là truyền đạt một số nguyên tắc nhất định cho Claude, nỗ lực làm cho nó "hữu ích, trung thực và vô hại". Điều này được thực hiện thông qua các kỹ thuật như AI Hiến pháp và đào tạo nhân cách, nơi các hành vi ưu tiên được xác định và tăng cường.Tuy nhiên, công ty thừa nhận sự không chắc chắn. "Như với bất kỳ khía cạnh nào của việc đào tạo AI, chúng tôi không thể chắc chắn rằng mô hình sẽ tuân thủ các giá trị ưu tiên của chúng tôi," nghiên cứu cho biết."Những gì chúng ta cần là một cách để nghiêm ngặt quan sát các giá trị của một mô hình AI khi nó trả lời người dùng 'trong thực tế' [...] Nó tuân thủ các giá trị một cách nghiêm ngặt như thế nào? Các giá trị nó thể hiện bị ảnh hưởng như thế nào bởi bối cảnh cụ thể của cuộc trò chuyện? Liệu tất cả việc đào tạo của chúng tôi có thực sự hiệu quả không?"

(Phần dịch tiếp theo sẽ được thực hiện tương tự, giữ nguyên các thẻ HTML và dịch nội dung bên trong)

AI đánh giá như thế nào? Anthropic nghiên cứu các giá trị của Claude

Thương hiệu Pudgy Penguins, chuyên về các sản phẩm liên quan đến tiền điện tử, đang tổ chức một sự kiện pop-up nhân dịp Lễ Tình nhân tại thành phố New York, với điểm nhấn là một bó hoa nhung mềm mại.

Những chú chim cánh cụt mũm mĩm đến thành phố New York với sự kiện pop-up nhân dịp Ngày Valentine

Trong đánh giá về tác động tiềm tàng của máy tính lượng tử đối với Bitcoin, công ty quản lý tài sản tiền điện tử CoinShares cho rằng mối đe dọa này không phải là một “cuộc khủng hoảng cận kề” mà là một “rủi ro có thể kiểm soát được”.

Theo...

Liệu mật khẩu Bitcoin của Satoshi (SATS) Nakamoto có thể bị bẻ khóa? Có phải đây là lý do thị trường đang giảm? Công ty phân tích tiết lộ sự thật.

Nhà đầu tư crypto có tiếng Jack Yi đã sử dụng vay mượn xoay vòng Aave để xây dựng vị thế ETH trị giá 2 tỷ đô la, nhưng cuối cùng đã sụp đổ trong đợt khủng hoảng thị trường [...]
Bài báo "Yi Lihua bán tháo hoàn toàn ETH! Trend Research bán 650.000 Ethereum trong một tuần, lỗ 730 triệu đô la" BlockTempo được đăng trên BlockTempo , trang tin tức blockchain có ảnh hưởng nhất.

Yi Lihua đã thanh lý toàn bộ số ETH nắm giữ của mình! Theo báo cáo, Trend Research đã bán 650.000 Ethereum trong một tuần, chịu lỗ 730 triệu đô la trước khi rút khỏi thị trường.