From oh.531 at buckeyemail.osu.edu Thu Apr 4 14:11:35 2024 From: oh.531 at buckeyemail.osu.edu (Oh, Byung-Doh) Date: Thu, 4 Apr 2024 18:11:35 +0000 Subject: [CaCL] 4/11: Birth of a Transformer: A Memory Viewpoint Message-ID: Hi everyone, Next week, we'll discuss the following paper: Birth of a Transformer: A Memory Viewpoint https://arxiv.org/abs/2306.00802 Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties. Best, Byung-Doh ================= Byung-Doh Oh (he/him/his) Ph.D. Candidate Department of Linguistics The Ohio State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From oh.531 at buckeyemail.osu.edu Fri Apr 5 09:57:36 2024 From: oh.531 at buckeyemail.osu.edu (Oh, Byung-Doh) Date: Fri, 5 Apr 2024 13:57:36 +0000 Subject: [CaCL] 4/11: Birth of a Transformer: A Memory Viewpoint In-Reply-To: References: Message-ID: I forgot to mention that we'll be meeting in Oxley 102 on this day due to the planned construction in our usual meeting room. ================= Byung-Doh Oh (he/him/his) Ph.D. Candidate Department of Linguistics The Ohio State University ________________________________ From: Oh, Byung-Doh Sent: Thursday, April 4, 2024 2:11 PM To: cacl at lists.osu.edu Subject: 4/11: Birth of a Transformer: A Memory Viewpoint Hi everyone, Next week, we'll discuss the following paper: Birth of a Transformer: A Memory Viewpoint https://arxiv.org/abs/2306.00802 Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties. Best, Byung-Doh ================= Byung-Doh Oh (he/him/his) Ph.D. Candidate Department of Linguistics The Ohio State University -------------- next part -------------- An HTML attachment was scrubbed... URL: From court.22 at buckeyemail.osu.edu Sun Apr 14 23:03:02 2024 From: court.22 at buckeyemail.osu.edu (Court, Sara) Date: Mon, 15 Apr 2024 03:03:02 +0000 Subject: [CaCL] This Week: Dai et al. 2023 Message-ID: Hi all, This week, let's read Dai et al. (2023). Abstract and link are below. Sara Link: https://aclanthology.org/2023.findings-acl.247.pdf Abstract Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as metaoptimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at https://aka.ms/icl. -------------- next part -------------- An HTML attachment was scrubbed... URL: