TY - GEN
T1 - Measuring and Controlling Text Generation by Semantic Search
AU - Lee, Jieh-Sheng
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/4/20
Y1 - 2020/4/20
N2 - Our motivation in this work is to measure patent text generation by semantic search, particularly by textual similarity in high dimensional space for neural network models. The objective is to control patent text generation by semantic search. Conceptually it is an attempt to integrate two subfields in NLP: text generation and semantic search. In our previous milestone of the PatentTransformer project, a prototype based on GPT-2 is capable of generating fluent patent title, abstract, independent claim, and dependent claim. However, beneath the surface form, the quality issue in the generated patent text was less explored. How to control text generation is also a hard problem in NLP field. We would like to address these issues in this work and experiment with different approaches. On the measurement side, this work will address the quality measurement issue from the perspective of textual similarity. Based on that, the approaches we propose include two embedding spaces, span-based textual similarity, and language model for patent claim spans. One the control side, we propose a knob-turning approach for controlling text generation based on measuring a range of textual similarity. In this way, we can search for a Goldilocks zone in which the similarity of generated patent text is close to but not too far from prior patents. We hypothesize that patent novelty may exist in such a zone.
AB - Our motivation in this work is to measure patent text generation by semantic search, particularly by textual similarity in high dimensional space for neural network models. The objective is to control patent text generation by semantic search. Conceptually it is an attempt to integrate two subfields in NLP: text generation and semantic search. In our previous milestone of the PatentTransformer project, a prototype based on GPT-2 is capable of generating fluent patent title, abstract, independent claim, and dependent claim. However, beneath the surface form, the quality issue in the generated patent text was less explored. How to control text generation is also a hard problem in NLP field. We would like to address these issues in this work and experiment with different approaches. On the measurement side, this work will address the quality measurement issue from the perspective of textual similarity. Based on that, the approaches we propose include two embedding spaces, span-based textual similarity, and language model for patent claim spans. One the control side, we propose a knob-turning approach for controlling text generation based on measuring a range of textual similarity. In this way, we can search for a Goldilocks zone in which the similarity of generated patent text is close to but not too far from prior patents. We hypothesize that patent novelty may exist in such a zone.
KW - GPT-2
KW - natural language generation
KW - natural language processing
KW - patent
KW - semantic search
KW - textual similarity
UR - http://www.scopus.com/inward/record.url?scp=85091695791&partnerID=8YFLogxK
U2 - 10.1145/3366424.3382086
DO - 10.1145/3366424.3382086
M3 - Conference contribution
AN - SCOPUS:85091695791
T3 - The Web Conference 2020 - Companion of the World Wide Web Conference, WWW 2020
SP - 269
EP - 273
BT - The Web Conference 2020 - Companion of the World Wide Web Conference, WWW 2020
PB - Association for Computing Machinery
T2 - 29th International World Wide Web Conference, WWW 2020
Y2 - 20 April 2020 through 24 April 2020
ER -