11/22/2023 0 Comments Backstab move synonym![]() Specification gaming has been observed in numerous AI systems. Some research on alignment aims to avert solutions that are false but convincing. An AI system was trained using human feedback to grab a ball, but instead learned to place its hand between the ball and camera, making it falsely appear successful. As AI systems become more capable, they are often able to game their specifications more effectively. This tendency is known as specification gaming or reward hacking, and is an instance of Goodhart's law. As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. But designers are often unable to completely specify all important values and constraints, and so they resort to easy-to-specify proxy goals such as maximizing the approval of human overseers, who are fallible. To specify an AI system's purpose, AI designers typically provide an objective function, examples, or feedback to the system. Aligning AI involves two main challenges: carefully specifying the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment). ĪI alignment is an open problem for modern AI systems and a research field within AI. In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows: "If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire." Different definitions of AI alignment require that an aligned AI system advances different goals: the goals of its designers, its users or, alternatively, objective ethical standards, widely shared values, or the intentions its designers would have if they were more informed and enlightened. Alignment research has connections to interpretability research, (adversarial) robustness, anomaly detection, calibrated uncertainty, formal verification, preference learning, safety-critical engineering, game theory, algorithmic fairness, and the social sciences. Research challenges in alignment include instilling complex values in AI, avoiding deceptive AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Other subfields of AI safety include robustness, monitoring, and capability control. ĪI alignment is a subfield of AI safety, the study of how to build safe AI systems. Many leading AI scientists, such as Geoffrey Hinton and Stuart Russell, argue that AI is approaching superhuman capabilities and could endanger human civilization if misaligned. ![]() Some AI researchers argue that more capable future systems will be more severely affected since these problems partially result from the systems being highly capable. Today, these problems affect existing commercial systems such as language models, robots, autonomous vehicles, and social media recommendation engines. Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed, when it faces new situations and data distributions. They may also develop unwanted instrumental strategies, such as seeking power or survival, because such strategies help them achieve their given goals. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful ways ( reward hacking). Misaligned AI systems can malfunction or cause harm. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned. To avoid this difficulty, they typically use simpler proxy goals, such as gaining human approval. It can be challenging for AI designers to align an AI system because it can be difficult for them to specify the full range of desired and undesired behavior. A misaligned AI system pursues some objectives, but not the intended ones. An AI system is considered aligned if it advances the intended objectives. In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans' intended goals, preferences, or ethical principles.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |