Mathematics,  Organic Chemistry

How to Count Isomers. Long Live Pólya.

Mathematics is rich in technique and arguments. In this great variety one of the most basic tools is counting. Strangely enough, it is one of the most difficult. – Herstein, Topics in Algebra

 

Orgo

Early in the first semester of every organic chemistry class (“orgo”), students are taught to enumerate the structural isomers of a given hydrocarbon. For example, butane, pentane, hexane, etc. It’s an often-infuriating exercise designed to drive home the concept of molecular shapes and bond counting. But this problem also contains a hidden math puzzle that tormented researchers for the better part of a century.

Before explaining, it helps to get a basic idea of the isomer exercise itself starting with the skeletal drawings shown. It’s a simple idea. Either end of any line segment represents a carbon atom. Every vertex (carbon) has four connections. What’s not shown is assumed to be hydrogen. That’s it. For more, see [IR]. 

The next part is a little more involved. The skeletal drawings below correspond to the chemical formula $C_6H_{14}$, which is a hydrocarbon in the alkane family (general formula $C_nH_{2n+2}$). Now the lines can be maneuvered in various ways. The challenge is to do so without altering the chemical formula or violating the valence of carbon. The chemistry details are best left to the various links in the reference section at the bottom ([KA], et al), but as shown, it’s mostly a visual puzzle. 

Figure 1a: Straight-chain (acyclic) isomers of hexane in skeletal form (hydrogens not shown). Notice how all retain their atomic scorecard of $C_{6}H_{14}$

Figure 1b: Isomers for n=3 through n=6

For orgo students, exercises like this tend to be a brute force endeavor and part of the challenge is knowing when to stop. Butane, for example, is fairly simple: there are only two solutions, as shown above. Yet that simplicity fades as the number of carbons increase. For example, looking at the answer for hexane, one may wonder whether a bond could be attached at the fourth spot too. This would then yield six isomers instead of five.

Figure 2: comparable isomers

But imagine both are spinning around in space. Could they be distinguished? If one flips around a bit, won’t it mirror the other? This is where orgo students can get in trouble: they fail to notice the symmetries.

Figure 3: symmetries on isomers

Symmetries are also the key to the underlying math conundrum: How does one enumerate the distinguishable elements in a set that has permutations? For the orgo problem, it’s how many distinct isomers are there for a given hydrocarbon? And how does one count these without overcounting?

These turned out to be difficult problems. It took almost seventy years from when they were first articulated to when they finally got resolved. And as it often happens with open questions in mathematics, the pursuit led to many innovations in the field with applications far beyond the problem itself.

 

History

[Counting] can sometimes be done by a brute force case-by-case exhaustion but such a routing is invariably dull and violates a mathematician’s sense of aesthetics. One prefers the light, deft, delicate touch to the hammer blow. – Herstein, Topics in Algebra

 

One of the early attempts at the isomer counting question was by Arthur Cayley in a series of papers that included “On the Mathematical Theory of Isomers,” [CI] and “On the Analytic Forms Called Trees” [AC]. It was in this latter paper that he introduced the notion of “trees,” i.e., graphs without cycles. He aligned these trees with molecular shapes and began exploring the isomer counting question. A few major outcomes. First, he conceded that there was no closed-form formula for the alkane counting problem. But he did derive an iterative formula for the number of “topologically different” planted trees. This means trees that are distinguishable (like the hexane question above) but with an identifiable root node (unlike the isomers above). His formula is shown below: the value of $T_i$ corresponds to the number of trees of size $i$.

$ 1 + \sum_{i=1}^n T_ix^i = \prod_{i=1}^n(1 – x^i)^{-T_i} $

It may not be obvious how to use this. First, start with known values (based on eyeballing trees of that size). That is, let $T_1 = T_2 = 1$ and then compute an expression for the next term, $T_3$. Set any higher terms to zero. That new term is found by aligning the coefficients of like powers after expanding. This is repeated one term at a time.

The algebra quickly becomes gruesome though. In fact, Cayley only got to $n=11$ before making numerical errors, but the formulation still proved useful to future researchers. Cayley later extended this work by producing other iterative formulas to compute tree counts with the restriction of valence added [AC3]. In doing so, he computed the first several values of the alkane (or “paraffin,” as it was known then) isomer counts. Cayley’s efforts helped shape modern graph theory. 

Cayley Excerpt

After a period of dormancy on this topic, Henry Henze, a chemist at University of Texas at Austin, published the 1931 paper with colleague C.M.J. Blair, “The Number of Isomeric Hydrocarbons of the Methane Series” [HH]. In it, they solved the counting question using a heuristic recursion algorithm. Specifically, they looked at the related problem of alcohols (primary, secondary, tertiary) from which they inferred isomer counts for similar alkane components. This paper and its novel approach re-energized study of the problem. Recursion is now the basis for most (if not all) computer-based solutions. 

A few years later, the Hungarian mathematician George Pólya published “Combinatorial Enumeration of Groups, Graphs and Chemical Compounds” [GP]. This landmark paper expanded on these pioneering works by introducing an innovative mathematical framework (Pólya Enumerations) that proved useful to a wide variety of permutation-aware counting problems, not just isomers.

 

Pólya and Symmetry

Pólya begins his 1936 masterpiece with a related question. Given a regular polygon, how many ways can the vertices be colored with a fixed set of colors? He considers an octahedron with three color choices (three red, two blue, one white). The figure below shows the various ways a octahedron can be rotated and reflected. It does not show the $\frac{6!}{3!2!1!}=60$ possible colorings suggested by this question.

Figure 4: symmetries of an octahedron (source: https://symotter.org/)

For an easier start, consider the ways in which two colors can be applied to the corners of a square. There are $2^4 = 16$ ways to do this (i.e., one of two colors in four locations). Let $X$ denote the set of all colorings. What patterns emerge? There’s a big hint in the yellow boxes. Colorings within those yellow boxes vary only by orientation per rotations and reflections on that square. It’s not possible to reorient any coloring such that it appears in a different yellow set. In other words, these sets form a partition of the set $X$.

 Figure 5: Two color square permutations partitioned by orbits

 

Now there are eight rigid motions possible, three rotations, four reflections and the identity (no change). These motions form an algebraic group called the dihedral group of a square, denoted $D_4$. Together, the group $D_4$ acting on $X$ defines an equivalence relation whose equivalence classes are called orbits–these are just the yellow sets above. Formally, the orbit of an element $x \in X$ is the set of objects in $X$ that $x$ is sent to under the action of those rigid motions. In this case, the elements in an orbit are indistinguishable since they can be transformed into one another through the rotations and reflections acting on them–much like the rotated isomers mentioned above. The number of orbits is then the number of distinguishable objects (with respect to these motions) in the set. Counting orbits is therefore the key to solving the isomer problem.

Figure 6: the dihedral group of a square

 

Permutations like those in $D_4$ can be expressed in a short-hand called cycle notation where the numbers refer to the labelled vertices, $\{1, 2, 3, 4\}$. In this form $D_4$ is the following set:

$\displaylines{D_4 = \{(1)(2)(3)(4), (1 2 3 4), (1 3)(2 4), (1 4 3 2), \\
(2)(4)(1 3), (1)(3)(2 4), (1 2)(3 4), (1 4)(2 3) \}}$

Above, $(1 3)(2)(4)$ indicates that element $1$ goes to $3$, $3$ goes to $2$, while $2$ and $4$ are unchanged.

This cycle notation can be further abbreviated by combining each cycle into a single term (a monomial) with a subscript for the order of the cycle and a superscript for the number of times it occurs. For example, $(2)(4)(1 3)$ becomes $f_1^2f_2$, meaning one two element cycle composed with two one element cycles.

$\displaylines{(1)(2)(3)(4) = f_1^4 \\
(1 2 3 4) = f_4 \\
(1 3)(2 4) = f_2^2 \\
(1 4 3 2) = f_4 \\
(2)(4)(1 3) =f_1^2f_2 \\
(1)(3)(2 4) =f_1^2f_2 \\
(1 2)(3 4) = f_2^2 \\
(1 4)(2 3) = f_2^2}$

Collectively these monomials can be grouped by like terms. This produces the following polynomial expression for $D_4$:

$ f_1^4 + 2f_1^2f_2 + 3f_2^2+ 2f_4 $

These can then be grouped by like terms and normalized by the size or order of the dihedral group, which is just $2n$ or $8$. This is called a cycle index of the permutation.

$ Z_{D_4}(f_1, f_2, f_3, f_4) = \frac{1}{8}(f_1^4 + 2f_1^2f_2 + 3f_2^2+ 2f_4 )$

Or more generally,

$ Z_G(f_1, f_2, …, f_m) = \frac{1}{|G|} \sum z_k \prod_l f^{j_l}_l$

Pólya had the ingenious idea to replace these monomials with representations of the content being applied to each permutation. He called these figures. They are also referred to as both weights and coloring functions. This is done by replacing $f_i$ with a polynomial of the same degree where each term corresponds to a corresponding configuration. If one such configuration is $(c_1, c_2, …, c_m)$ then $f_i = c_1^i + c_2^i + … + c_m^i$.

For the above example with blue or red vertices, the coloring function is just $b^i + r^i$ (for blue and red). Substituting gives

$  \frac{1}{8}((b + r)^4  + 2(b + r)^2(b^2 + r^2)  +3(b^2 + r^2)^2 + 2(b^4 + r^4))$

which after regrouping leaves

$ b^4 + b^3 r + 2 b^2 r^2 + b r^3 + r^4$

This final form is called a generating function [GF]. The magic here is that the coefficients of each term reveal the unique counts of each pattern or configuration. These counts are the orders of the corresponding orbits. These configurations counts are often called the pattern inventory.

For the above coloring problem, this generating function indicates that there is only one way to paint the vertices all red or all blue (the coefficients on $b^4$ and $r^4$), two ways to paint two red and two blue ($2b^2r^2$), one way to paint three blue and one red ($b^3r$), and one way to paint one blue and three red ($b r^3$).

The sum of the coefficients is six, as six of the sixteen total colorings are distinguishable, as expected (see figure 5). This latter step computes the same orbit counts as the function in Burnside’s Lemma but Pólya’s is arguably easier to construct and more intuitive too.

A formal statement of his enumeration theorem, its proof and more examples can be found in [JJ] and [KT]. Some of these examples are quite fun, like “how many three note chords can be formed from a twelve note musical scale?”

 

Recursion and Generating Functions

The generating function method described in the previous section works well for fixed sized permutations, but more is needed to produce a general expression of size $n$. Recall the brute force method of removing elements one by one and reattaching what remains onto a shorter chain (see Figure 1). This is an example of a recursive process. Mathematically, this is a function with one or more initial conditions that generates a new value from its previous (or next) terms. A famous example is $a_n = a_{n-1} + a_{n-2}$ with $a_0 = a_1 = 1$, which produces the Fibonacci series, $1, 1, 2, 3, 5, 8, 13, 21, …$.

The trick for the isomer problem will be to impose recursion on the previously described cycle indexes. As a setup, suppose that a recursive relationship exists between generating functions for successive values of $n$, say

$ A_n(x) = 1 + \frac{x}{6}(A_{n-1}(x)^3 + 3A_{n-1}(x)A_{n-1}(x^2) + 2A_{n-1}(x^3)) $

As a generating function, $A_n(x)$ may also be written as a polynomial per the previous discussion of cycle indices. 

$ A_n(x) = \sum_0^{n}{R_k x^k}$

To solve this recursion, apply the same principal described for solving Cayley’s equation. Start with a known value of $R_k$ and apply those values to the first terms of the equation, leaving a placeholder for the next value. Additionally, set all higher coefficients to zero. For example if $ R_1 = 1$ then

$A_2(x) = 1 + R_2x $

Next, substitute this expression into the recursive formula and expand terms.

$ A_2(x) = 1 + \frac{x}{6}(A_1(x)^3 + 3A_1(x)A_1(x^2) + 2A_1(x^3)) $

$ A_2(x) = 1 + \frac{x}{6}(1 + 3 + 2) = 1 + x $

It follows that $R_2$, the coefficient on $x$, is also 1. Continuing…

$ R_1 = 1, R_2 = 1 $ 

and set both into the first 3 terms of the equation, leaving a spot for the next value:

$A_3(x) = 1 + x +  R_3x^2 $

Now substitute this expression into the recursive formula and expand terms.

$ A_3(x) = 1 + \frac{x}{6}(A_2(x)^3 + 3A_2(x)A_2(x^2) + 2A_2(x^3)) $

$= 1 + \frac{x}{6}((1 + x)^3 + 3(1 + x)(1 + x^2) + 2(1 + x^3))  $

$ A_3(x)  = 1 + \frac{x}{6}(6 + 6x + 6x^2 + 6x^3) $ 

$= 1 + x + x^2 + x^3 + x^4$

Once the coefficients are aligned, the next value can be inferred, which is $R_3 = 1$. As for the other higher terms, they need to be discarded since it was assumed the higher order terms were all zero. Rinse, lather, repeat.

The algebra will become unbearable after a few iterations, but thankfully there are tools like Mathematica and MatLab that can churn on symbolic equations with ease. However, there are even better tactics for solving these types of equations, as will be discussed in the computation section below.

Details on these sorts of problems can be found at [GF]. There’s also an entire online textbook on this subject [HW].

 

Constructing an Isomer Solution

The technique for constructing a cycle index for an arbitrary alkane chain begins by examining alkyl groups, $C_nH_{2n-1}$, which are just carbon chain fragments with a free site available for bonding. 

Figure 7: Alkyl molecule

Each of these attachment points ($R_1$, etc.) is itself an alkyl group with at most $n-1$ carbon atoms. This implies recursion. The trick is to use the generating function for the (n-1)th term, $A_{n-1}(x)$, as the coloring function for the nth term, $A_n(x)$.

This can be constructed as follows. To adjust for the carbon difference between $n-1$ and $n$, multiply the cycle index by $x$, e.g., $xA_{n-1}(x)$. This shifts the terms by one ($x$ becomes $x^2$, and so on) thereby converting the previous count of $n-1$ to $n$. Because of this shift, the equation includes a replacement term for the $1$ that disappeared in the multiplication by $x$. This represents a lone hydrogen atom. Finally, the three locations suggest the $S_3$ symmetry. The resulting cycle index is

$ A_n(x) = 1 + xZ_{S_3}(A_{n-1}(x)) $

$ A_n(x) = 1 + \frac{x}{6}(A_{n-1}(x)^3 + 3A_{n-1}(x)A_{n-1}(x^2) + 2A_{n-1}(x^3)) $

The end result (by either iterative calculation as shown in the previous section or by software-based computation) is the following generating function whose coefficients again are just the counts for the respective alkyl molecules. This means that $x$ is the figure for an individual alkyl molecule. 

$ A(x) = 1 + x + x^2 + 2x^3 + 4x^4 +  8x^5 + 17x^6  + … $

For a full explanation of the above, see the video [PP]

There are several ways to attack the corresponding alkane isomer question, but any solution will again be recursive. The idea is to analyze the cases by which an alkane could be formed and then sum the components into a single cycle index. One early method focused on splitting the alkane into centered and bi-centered trees (later revisited in [RS]). Another analyzed the graph using the respective orbits of vertices and edges (details in [SR]).

As a rough sketch of that latter approach consider the two base cases shown below. The first labels a single carbon. The other labels an arbitrary bond between two carbons. The idea is to describe each case, then consider their respective orbits in a larger molecule.

Figure 8: Alkane base cases

Case 1 suggests an $S_4$ symmetry for the four elements that surround the labelled carbon atom (vertex). This gives the following cycle index, which as it was with the alkyl examples, requires multiplying the inner cycle index by $x$ in order to shift the exponents to an $n+1$ state.

$ P(x) =  xZ_{S_4}(A(x)) $

$ = \frac{x}{24}(A(x)^4 + 6A^2(x)A(x^2) + 3A^2(x)A(x^3) + 6A(x^4))  $

Case 2 splits the molecule at a labelled bond to form two alkyl groups. From the bond’s perspective though, there are two permutations, or an $S_2$ symmetry. But they also must both be alkyl radicals and not hydrogens. This means the “+1” term in $A(x)$ must be removed. There’s no need for an outer $x$ term since this is strictly a bond connection. The corresponding cycle index is

$ Q(x) =  Z_{S_2}(A(x) – \,1) $

$ = \frac{1}{2}[(A(x) – 1)^2 + A(x^2) – \,1]$

The rest is best explained in the Sandia paper [SR] or in the same video from above. It uses a theorem from graph theory that relates the size of the orbits for the labelled vertex and bond classes. This requires adding a factor for the alternating group $A_{n/2}$ to represent the even $n$ valued carbon chains and then summing to produce this final expression

$N(x) = P(x) \,- \,Q(x) \,+ \,\sum x^n A_{n/2}(x)$

$ = P(x) \,- \,Q(x) \,+ \,A(x^2) \,- \,1$

$ = \frac{x}{24}(A(x)^4 + 6A^2(x)A(x^2) + 3A^2(x)A(x^3) + 6A(x^4)) – \frac{1}{2}[(A(x) – 1)^2 + A(x^2) + \,1]$

The corresponding generating function is shown below with coefficients corresponding to the isomer counts for each value of $n$. 

$ N(x) = 1 + x + x^2 + x^3 + 2x^4 + 3x^5 + 5x^6 + 9x^7 + 18x^8 + 35x^9+ 75x^{10} + … $

 

A more modern variant of this was done in 2002 by a Cal Tech mathematician and his colleague at Bell Labs. It uses the original idea of Cayley and combines it cleverly with Polya’s enumerative technique. The end result is the same, but mathematically it’s arguably clearer. See [RS] for details.

 

Computations

There are numerous programs available online that churn out these isomer counts. Many are tree recursion algorithms–for example, this version written in Java (and a variant written in python). Both appear to be $O(n^3)$. But this version in python written by Mingda Zhang is using the cycle index method and it’s dead simple too–its performance is also $O(n^3)$.

A comment about implementing these symbolic expressions. The coefficients are of course stored in an array that grows as the program recurses. But dealing with composed terms like $A[x]*A[x^2]$ requires an understanding of how to compose (multiply) generating functions. A good reference is found starting on slide 24 of these lecture notes. Based on this rule, it follows that

 When $A(x) = \sum_{n=0}^{\infty}a_n x^n$ and $A(x^2) = \sum_{n=0}^{\infty}a_n x^2$.

 Then $C(x) = A(x)A(x^2)$ has coefficients of the form $c_n = \sum_{k=0}^{n}a_k a_{2*(n-k)}$

Note that the individual coefficients are finite sums. Thus computing these terms requires a loop per the above expression. This is exactly what Zhang’s python script is doing.

Lastly, this in-depth article (written in German, but the pdf can be uploaded to google translate) describes a recursive software implementation that’s also worth a look. The C++ code is available for download there too.

Note that a common theme in most of these programs is the careful handling of large integers (BigInteger in Java, CDN library in C++, etc.). Python handles this natively.

 

Asymptotics

Early studies of the problem were limited to small values of $n$ as anything larger proved too complex to compute by hand. This should give a hint to the final question: how fast does it grow?

Here’s a better hint. For $n=10$ (“decane”), there are 75 isomers. For $n=30$ (“triacontane”) there are 4,111,846,763. And at $n=100$ (“hectane”) it starts to approach the number atoms on our planet ~$10^{40}$. This of course is ridiculous: there can’t be more isomers than atoms but theoretically, this is how the counts grow. The exact number is shown below.

$5,921,072,038,125,809,849,884,993,369,103,538,010,139$ 

Pólya also answered the asymptotic question in his famed combinatorics paper. He did this by considering the radius of convergence of a related series and from that constructing upper bounds. It ends up being $O(q^n)$, where $q$ is inverse of the corresponding finite radius of convergence, $\rho$ and where $\frac{1}{e}>\frac{1}{\rho}$.

Specifically, he showed that the number of indistinguishable acyclic alkane isomers grows at $(\rho^n n^{5/2})^{-1}$. In his paper, he claimed a value of $0.35 < \rho < 0.36$. 

$ N(x) < (\rho^n n^{5/2})^{-1} \,\,, \rho \approx .35 $

Experiments in the graphs below show a good match of  $\rho = 0.35668$ up to $N=100$ at least. This was done using Google Sheets and includes error plots for the model fit.

Figure 9: straight-chain isomer counts modeled with $\rho = 0.35668$ up to $n=100$. Click to view full screen.

 

Closing Thoughts

To therefore answer the questions posed: a) yes it’s solvable, b) but there is no closed-form formula, and c) the values grow exponentially. All the not-so gory details are found in the surprisingly readable translation (from German) of Pólya’s original paper [GP] with supplemental essays by computational chemist, R. C. Read, another expert in the field [AB].

It should also be apparent that this technique can be applied to a wide range of molecules, not just alkanes. Similar cycle indices can be constructed for alcohols, phenols, alkenes, alkynes, etc. It has even been applied to stereoisomers. See the [SR] for a detailed dissection.

A final point. Chemists have pointed out that some of these isomers can not exist due to steric interference and other molecular constraints. So, while the math produces a well-argued numeric value, the physical reality is likely lower.

 

References

Detailed coverage of the math and chemistry in this post:

[SR] “Sandia Report: Enumerating Molecules,” Faulon, Visco, Roe, 2004 (excellent and comprehensive!).

[GP] Combinatorial Enumeration of Groups, Graphs and Chemical Compounds, Pólya, G., 1936. Translated to english with additional essays by Read, R. C., 1987. (textbook)

[AB] Chemical Applications of Graph Theory, Balaban. A. T., Read, R. C., and others, 1975 (textbook).

[HH] “The Number of Isomeric Hydrocarbons of the Methane Series,” Henze/Blair, 1931.

[CI] “On the Mathematical Theory of Isomers,” Cayley, A., 1874.

[AC] “On the Analytic Forms Called Trees,” Cayley, A., 1881.

[AC3] “On the analytical forms called Trees, with application to the theory of chemical combinations“, Cayley, A., 1896.

[KT] Explainer on Pólya’s Enumerations, Keller/Trotter (online book).

[RS] “On Cayley’s Enumeration of Alkanes (or 4-Valent Trees),” Rains/Sloane, 2002. 

[PP] Enumerating Chemical Compounds, Prochazka, P. YouTube.

[JJ] MIT Lecture on Enumerations, online pdf

[OS] Orbits and Stabilizers Primer, online pdf

[HW] GeneratingFunctionology, online textbook by H. Wilf.

[GF] Princeton CompSci Tutorial on Generating Functions

[LT] Libretexts Chemistry on Alkane Isomers

[KA] Khan Academy on Isomers

[IR] On Skeletal Drawings

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: