Merrill 2021 – Formal Language Theory Meets Modern NLP

Notes for merrill21_formal_languag_theor_meets_moder_nlp

Really nice overview of formal languages and complexity

1 regular languages

built from union, concatenation, and Kleene star operations
Kleene's Theorem: regular languages are the languages recognized by DFA and NFA (note that DFA and NFA recognize same set of languages!)

1.1 learning

trying to learn the smallest possible DFA that will match a set of strings is NP-complete
trying to learn smallest DFA, given polynomial amount of acceptance and equivalence queries is possible

2 data structures

with an infinitely long data structure (stack or counter), we can have more complex languages
strings recognized by push down automata and counter include \(a^nb^n\). Note that this language can't be recognized by DFA

2.1 limitations of counter

\(k\) -Dyck language is set of all matching paren strings where there are \(k\) types of parens
can't recognize \(k\) -Dyck language
- why can't you just keep a counter for each type of paren? It matters which paren came earlier in the string, not just how many are open. For example, you won't be able to reject [(])
important: can't recognize complex context free languages that require compositionality

2.2 limitations of stack

can match any deterministic context free language, which is strict subset of CFL
push down automata not strictly more powerful than counter

2.3 non-determinism

CFLs are strict superset of DCFLs
DCFLs recognized by non-deterministic push down automata

2.4 learning

grammar induction problem: learn grammar rules from samples
frame data structures as differentiable objects
model learns to use data structure

3 aside: why is it called context free?

it's called context free, becaues each one of the productions rules is applied without attending to context. That is, when I apply \(S \rightarrow a\), I don't need to know that perhaps, the \(S\) is being expanded as part of \(aSa\) or \(aaSaa\)

4 where does language fall in chomsky hierarchy?

Turing Machines, and finite Turing Machines (Linear Bounded Automata) are at the top of the hierarchy
linguists believe language is mildly-context dependent, which will require and LBA
what is the right formalism? Tree Adjoining Grammar or Combinatory Category Grammar?

5 hypercomputation

more powerful than TMs? Doesn't that violate the claim that TMs are universal model of computation?
hypercomputation models are infinite computers vs TMs, which are finite computers that can recognize an infinite amount of strings
there are countable infinite TMs, so only countable infinite amount of languages can be recognized. But in fact there are uncountably infinite number of formal languages (?)

6 weighted finite state automata

Goal: given a string, assign a number
Architecture: Like a DFA, but the transitions are weighted
How do you get the number? Sum the path-scores for all the paths induced by the string in the DFA.
How do you ge tthe path-score? Multiply the transition weights

7 Henkel matrices

infinite matrix used to represent any formal language
connection to WFAs: For function \(f: \Sigma^* \rightarrow \mathbb{Q}\), Henkel matrix \(H_f\) has finite rank iff there exists a WFA computing \(f\). Furthermore \(\text{rank}(H_f)\) is the number of states in this WFA.

8 Neural networks

8.1 hypercomputation

Say that the neurons can be specified by real number weights of arbitrary precision. Then, the neurons are capable of hypercomputation, since each weight can encode

8.2 restriction to rational number weights

If the weights are rational numbers of arbitrary precision, then the neural networks are Turing-complete. However, note that arbitrary precision is not possible for any practical application. And it's unlikely that the right weights will be found by gradient descent.

8.3 binarized neurons

weights can still be rational numbers, activations are \({0,1}\).
much weaker than TM

9 saturated networks

previously, when we talked about NNs, we included the possibility of infinite precision weights
to relax this constraint, consider saturated networks, in which all the weights are multiplied by \(\rho\) where \(\rho\rightarrow\infty\).
then, e.g., anything like a \(\sigma\) function activation will become a step function

10 saturated networks as automata

empirical conjecture: the saturated network captures what is most stable about the original network

10.1 sRNNs

empirically equivalent to finite-state automata

10.2 sLSTMS

resemble counter automata

10.3 saturated stack RNNs

stack augmented RNNs are RNNs equipped with a stack data structure
can model Dyck languages

10.4 sCNNs

less expressive than FSA

10.5 sTransformers

limited form of counting

11 open questions

how to tell if model is learning deep formal knowledge, or spurious surface knowledge

12 Bib

Bibliography

[merrill21_formal_languag_theor_meets_moder_nlp] Merrill, Formal Language Theory Meets Modern Nlp, CoRR, (2021). link.