Blog

Better NLP Models: OpenAI GPT-2

Peter Bruce

August 23, 2019

BLOG_Better NLP Models - OpenAI GPT-2

I’ve been told that, in conversation, I jump in and finish other people’s sentences for them. Now there’s an app for that: GPT-2 released by OpenAI, founded by Elon Musk. GPT-2 is a natural language program that, given a prompt, will write (mostly) intelligible content. OpenAI's stated mission is “to ensure that artificial general intelligence (AGI) … benefits all of humanity.” Natural Language Processing (NLP) includes applications such as text classification, language creation, answering questions, language translation, and speech recognition.

GPT-2 is a weaker version of the full program that OpenAI has developed, but kept largely under wraps.  The organization is worried that its capabilities are so powerful that they could, as critic Jeremy Howard (an Australian data scientist and entrepreneur) put it, spread “the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter.” 

As stated in this article on the feared dangers, OpenAI said its new natural language model, GPT-2, was trained to predict the next word in a sample of 40 gigabytes of internet text. The end result was the system generating text that “adapts to the style and content of the conditioning text,” allowing the user to “generate realistic and coherent continuations about a topic of their choosing. ”

I tried out a public interface to GPT-2, with a lead phrase of “When Churchill replaced Chamberlain…” and GPT-2 carried on:  

“with King George VI and George VII with King George VIII, the British Parliament passed what was called the Act for the Advancement of the Foreign Policy of Great Britain…” then switched gears after a bit with 

“In 1939, British intelligence was involved in the creation of a plan to assassinate British Prime Minister Neville Chamberlain and his associates ...The British were unable to achieve a full-scale victory against the American forces”

If you try it repeatedly, you’ll get dramatically different results each time.  The algorithm is a prediction engine of sorts; its training data consists of 40 gig worth of Reddit posts. The great advance of the last five years has been to expand prediction beyond the local task of predicting the next word, to the more global task of predicting a sensible sequence of words.  Deep learning, and recurring neural networks (RNN’s) in particular, have played a key role. RNN’s factor in sequential dependencies, which is key to being able to come up with phrases that make sense as a whole, both on their own and in conjunction with their specified predicate. This also explains why random elements in the algorithm can result in various text sequences generated from the same predicate being so unlike each other - as long as they make sense and are logically connected to the predicate, they are equally valid results. 

If you think about it seriously, it is difficult to imagine why it would be a good thing for computers to have the ability to write text that most people would have a hard time distinguishing from something human-written.  But the technology is there and hoping that bad actors will not use it is like hoping that water will run uphill.  Publicizing this AI capability, and its dangers, however, may be the best way to innoculate against its ill effects by enabling countervailing technology to detect algorithmically-written text.


Talk to one of our experts to discuss how Elder Research can use cutting-edge machine learning and AI techniques to deliver greater value on your next NLP project.

Want to dive deeper yourself?  Statistics.com offers dozens of fully online courses - learn more


Related

Sophisticated Text Analysis Is Hard, but it Works

5 Key Reasons Why Analytics Projects Fail

Why Data Literacy in the C-Suite Matters

Hiring a Data Analytics Consultant


About the Author

Peter Bruce Peter Bruce is Founder and President of The Institute for Statistics Education at Statistics.com. Previously he taught statistics at the University of Maryland, and served in the U.S. Foreign Service. He is a co-author of Data Mining for Business Analytics, with Galit Shmueli and Nitin R. Patel (Wiley, 3rd ed. 2016; also JMP version 2017, R version 2018, Python version 2019; plus translations into Korean and Chinese), Introductory Statistics and Analytics (Wiley, 2015), and Practical Statistics for Data Scientists, with Andrew Bruce, (O'Reilly 2016). His blogs on statistics are featured regularly in Scientific American online.