Our method to alignment analysis

There’s presently no recognized indefinitely scalable answer to the alignment downside. As AI progress continues, we count on to come across numerous new alignment issues that we don’t observe but in present techniques. A few of these issues we anticipate now and a few of them will likely be solely new.

We imagine that discovering an indefinitely scalable answer is probably going very troublesome. As an alternative, we goal for a extra pragmatic method: constructing and aligning a system that may make sooner and higher alignment analysis progress than people can.

As we make progress on this, our AI techniques can take over an increasing number of of our alignment work and finally conceive, implement, examine, and develop higher alignment methods than now we have now. They are going to work along with people to make sure that their very own successors are extra aligned with people.

We imagine that evaluating alignment analysis is considerably simpler than producing it, particularly when supplied with analysis help. Due to this fact human researchers will focus an increasing number of of their effort on reviewing alignment analysis achieved by AI techniques as an alternative of producing this analysis by themselves. Our purpose is to coach fashions to be so aligned that we are able to off-load virtually all the cognitive labor required for alignment analysis.

Importantly, we solely want “narrower” AI techniques which have human-level capabilities within the related domains to do in addition to people on alignment analysis. We count on these AI techniques are simpler to align than general-purpose techniques or techniques a lot smarter than people.

Language fashions are significantly well-suited for automating alignment analysis as a result of they arrive “preloaded” with quite a lot of data and details about human values from studying the web. Out of the field, they aren’t unbiased brokers and thus don’t pursue their very own targets on the planet. To do alignment analysis they don’t want unrestricted entry to the web. But quite a lot of alignment analysis duties might be phrased as pure language or coding duties.

Future variations of WebGPT, InstructGPT, and Codex can present a basis as alignment analysis assistants, however they aren’t sufficiently succesful but. Whereas we don’t know when our fashions will likely be succesful sufficient to meaningfully contribute to alignment analysis, we expect it’s vital to get began forward of time. As soon as we prepare a mannequin that could possibly be helpful, we plan to make it accessible to the exterior alignment analysis group.