Chemical space is really big (2014)

gpcr1949 · on June 25, 2021

It's important to note that although chemical space is quite large, most of this space is not easy to synthesize and also is not chemically feasible, stable or desirable. Another interesting "small" subset of chemical space is ZINC [0] which is a database of about a billion commercially offered compounds, meaning that manufacturers at a minimum think they can easily make them (and effectively the fulfilment is quite high when random compounds are ordered, e.g. 95% in this paper where they did molecular docking simulations on the entirity of this database to find new melatonin receptor modulators [1]). Concerning exploration of chemical space, one area that might be of interest here is the quite effective smooth(ish) movement through structure-property space using VAEs.[2]

[0] https://zinc.docking.org/ [1] "Virtual discovery of melatonin receptor ligands to modulate circadian rhythms" https://www.nature.com/articles/s41586-020-2027-0.pdf [2] "Automatic Chemical Design Using a Data-DrivenContinuous Representation of Molecules", https://arxiv.org/pdf/1610.02415.pdf

jhirshman · on June 25, 2021

We've been working on these types of chemical search optimizations problems across a variety of industries, and I'd like to echo this comment. Despite the fact that most of the space is unexplored, the act of exploring it for the sake of exploring it is often unwise. A vast, vast majority of the time a naive or even statistically driven search will fail if the goal is to find something "new." The reality is that the path to a truly new innovative chemical is hard to anticipate and even harder to optimize for plus the curse of dimensionality means that our intuition for how hard that search really is is hopelessly misguided.

If you're interested in related problems, my company, Uncountable, is looking for software engineers. https://www.uncountable.com/careers. We emphasize that the most important thing for organizations to do today is structure their data. It's the best chance to take specialized internal knowledge and put it to use to find new chemicals.

captainmuon · on June 25, 2021

This reminds me of something I always wanted to ask. Are the majority of molecules in the body "named" or "purposeful" molecules, like haemoglobin, vitamins, water, lipids, DNA, etc., or is there a lot of random stuff, where just some atoms are arranged arbitrarily? Ignoring for a second the trival thing that you can make really long polymers, you have mutations in DNA and so on - I would could those into the first case. What is the ratio of "encyclopaedic" molecules (discovered or not) to "random stuff" (useless or not)?

frisco · on June 25, 2021

Everything has a name, but it is generally a “systematic” name[1] rather than a one-off descriptive name. Even DNA is a systematic name for the monomer (de-oxy-ribose-nucleic-acid is one of the defined nucleic acids bound to a ribose sugar missing an oxygen at the 2-position carbon).

Biology uses an enormous space of small molecule structures (to say nothing of proteins, which have their own naming schemes) and few have names you might recognize generally, but all have useful systematic names that biologists and chemists can quickly parse.

As a twist, most systematic naming schemes don’t produce unique labels, so there’s often multiple ways to say the same thing, and different discipline subcultures have different biases in this regard.

Edit: re-reading OP, another interpretation is that they’re asking what percent of molecules in the body aren’t involved in biology. The answer to that is probably something that approximates 0%. At the end of the day, the combined interaction of all of this chemistry is what biology is, and everything is more or less everywhere. (…concentration is everything.)

[1] https://en.m.wikipedia.org/wiki/Systematic_name#In_chemistry

murphyslab · on June 26, 2021

Not everything can nor does have a "name". First, there are kinds of molecules which we have not yet imagined yet, for which we do not yet have an IUPAC naming system. One could be devised, however that's an ongoing task.

Think of the endohedral fullerenes — metal atoms stuck inside of buckyball cages — those have really only been directly in chemists' sights for 30 or so years. [0]

Another pair of extrema are 2D materials and network solids, which are effectively massive molecules. Again, we don't have a great naming system for them, even if we could properly catalogue all of the bonds (and enclosed species). And more practically, one is probably better off with a set of atomic coordinates to describe them.

[0] https://en.wikipedia.org/wiki/Endohedral_fullerene

frisco · on June 26, 2021

I mean, sure.

derefr · on June 25, 2021

> what percent of molecules in the body aren’t involved in biology. The answer to that is probably something that approximates 0%.

I think the answer might be different (and more interesting) if constrained to human biology. What percentage of the variety of “stuff” in our bloodstream/tissues, that came in from our air/water/food and then maybe got metabolized a bit, is now of a form entirely inapplicable to anything going on — or that could even potentially go on — in a human body? How much is pure “waste to be excreted” from the human perspective, with both no use in keeping it around, but also no danger in keeping it around?

(I know that this is at least what allantoin is for the species that make it; but we’re not one of those.)

I presume a lot of toxins that get inactivated by the liver end up in such a form.

whoisburbansky · on June 25, 2021

Depending on your definition of "no use in keeping it around." Apparently it's fairly useful as a moisturizer and as a skin protectant. You're still correct, of course, in the sense that it's not metabolically useful, but it's kind of amusing to me that it's got these other cosmetic effects that are in some sense useful.

sbierwagen · on June 26, 2021

One example, we're discovering thousands of new proteins in human cells because we collectively assumed short protein chains wouldn't fold into anything functional: https://blogs.sciencemag.org/pipeline/archives/2020/03/11/ne...

"Mitochondria is the powerhouse of the cell", right? So it's supposed to be in a cell, right? Well, mitochondria seem to just hang out in blood, running just fine. Nobody knows why. There is even evidence they move between cells: https://blogs.sciencemag.org/pipeline/archives/2020/02/03/fr...

The cell is packed full of different species of RNA. We keep discovering new types of RNA, performing various unknown signaling tasks: https://blogs.sciencemag.org/pipeline/archives/2019/10/03/en...

As always, we don't know what we don't know. At the high school level, the textbooks will present a complete, sensible model of cell biology, because a textbook that lists off a thousand molecules each labeled with "Not sure what this does" would not be a very emotionally satisfying textbook.

The more I learn about biology, the more scared I am of taking drugs. The wikipedia article on any given drug will tell you it bonds to receptor so-and-so and produces effect A and side effect B etc etc. Well sure, it does that, but it's not like it's a guided missile. The molecule ghosts through the cell membrane and rattles through the metabolic machinery, bouncing off transcription proteins, sticking and unsticking, until a few million out of the untold sextillion you ingested manage to find the target and bond. (Temporarily! All human drugs must not bind too tightly, or else they disable that receptor, and you die)

I have mild seasonal allergies. When the complaints from my coworkers about my persistent cough get loud enough, I take a Loratadine pill from the bottle on my desk, and fifteen minutes later, they go away. How on Earth does it do that? The immune system is a vast, poorly understood, constantly introspective machine. After a billion years of cellular warfare it has defenses on defenses on defenses all made specifically to not be disabled. And yet Loratadine reaches through the ranks of spinning buzzsaws and turns it off like flicking a switch. How?

And, why? It shuts off just the seasonal allergies. Why do I have a specific cellular pathway that makes me cough occasionally when I inhale pollen? It's certainly not disabling the entire immune system, or I would be rapidly eaten from the inside out by my gut bacteria, or the thousands of fungal species coating my skin and mucus membranes. Just the pollen receptors. It boggles the mind.

malux85 · on June 25, 2021

http://biochemical-pathways.com/#/map/1

correcthorse123 · on June 25, 2021

I couldn't give you a ratio but I'd think it's quite a high ratio. There probably aren't many molecules that don't have either a chemical (i.e. have some function in a pathway) or physicochemical influence.

Judgmentality · on June 25, 2021

Not saying you're wrong, but why do you think that? I'd have thought the same thing about DNA, but I keep being told most DNA is junk (although I wouldn't be surprised to find out later we just don't know what it's for).

TheSpiceIsLife · on June 26, 2021

https://www.discovermagazine.com/health/our-cells-are-filled...

It looks like even the portion of DNA that isn’t metadata could also be there so mutations and damage are less likely to occur in critical portions.

correcthorse123 · on June 26, 2021

I think this because:

(a) Access for all but the smallest molecules to cells is tightly regulated by e.g. membrane transport proteins.

(b) similar to a, tissue where exchange with the outside world can occur, such as the intestine and lungs, are even more regulated and heavily guarded by the immune system. Apart from small molecules with the right lipofilicity, some minerals, potentially small peptides, or entities with other mechanisms of entry like viruses, nothing gets in (excluding endocytosis by e.g. immune cells).

(c) Anything entering the circulation will be processed by the liver eventually, where all kinds of enzymes target a broad range of structural motifs to break down molecules into 'non-foreign' building blocks to be reused.

(d) I can't think of any molecules not belonging to a particular known class. There is water, elemental ions, carbohydrates/sugars, peptides/proteins, lipids, and RNA/DNA, and small molecules (e.g. intermediate products). All of these except a subset of small molecules and heavy metals can be either broken down into 'known' parts or disposed of (not completely though; over time waste builds up which is likely part of why we age). Now if there are many inert small molecules, they would show up in all kinds of analytical tests. Inert or not, we can classify all of them chemically. There is a lot of stuff for which we don't know the exact function of course, and as these systems are highly complex and dynamic, functionality can be broad and context dependent.

By the way, the term junk DNA has various meanings in different contexts. In the context of non-codig DNA, this "junk" e.g. plays a role in epigenetic regulation as it influences physical accessibility for transcription. Also, DNA is relatively stable and unlikely to interfere with other cellular processes in the same way that random small molecules would.

I could definitely be wrong though. I'm almost done with my biomedical engineering masters, but over the years I turned to software and all the chem and bio knowledge is becoming rusty very quickly. It's also a field in which knowledge doesn't age well as it has been growing quite fast.

adt2bt · on June 25, 2021

I wonder how effective AI will be at enabling us to navigate chemical space for certain desired compounds. Knowing nothing about the problem, is it something akin to the protein folding challenge that AlphaFold[0] recently did well at?

Side note: I love Derek Lowe's writings. I don't know what it is, but every time I see a chemistry related link bubble up in HN, I have a gut feeling it was written by him. And I'm usually impressed. His Things I Won't Work With[1] series is amazingly well written.

[0] https://deepmind.com/blog/article/alphafold-a-solution-to-a-...

[1] https://blogs.sciencemag.org/pipeline/search/Things+I+Wont+W...

timr · on June 25, 2021

Derek is right about the vastness of chemical space, but I go back and forth on the claim (frequently made by those in drug discovery) that AI cannot possibly extrapolate to spaces of this size, for at least three reasons:

* Image space and text space are also vast, and yet we've had good success applying AI in these areas. I have yet to see a convincing argument that these spaces aren't equally large.

* It's a bit of a red herring: actual drug discovery programs are not exploring "all of chemical space". They're usually focused on "lead series" of much more constrained molecules.

* There are actually context-independent signals that can be used to generalize AI methods. The generalization is far from perfect, but it's not like every one of those 10^60 molecules are entirely different from every other molecule in the set. There are clusters and patterns and trends that can be exploited for gain -- this is what makes "medicinal chemistry" an academic field, and not merely an exercise in fortune-telling.

Personally, I think the bigger problem applying AI and ML to drug discovery is less the "vastness of chemical space" (a proposition that makes med-chemists feel secure about their jobs), and more that the datasets in drug discovery suck. There's tons of siloing of data, none of it is consistent, and you can't even depend that two assays for the same target, measured in the same lab, years apart, will yield consistent data. It's a total mess.

dnautics · on June 25, 2021

So text space is trivially vectorizable, at the character level and even for difficult languages like Russian chunk-vectorisable with some care. How do you encode the difference between houamine A and atrop-houamine A, while keeping the similarities, without resorting to empirical measurements and classification, which could yield reasonable vectors, but will take 2-5 years of a highly trained grad student's labor to obtain and put into the training corpus

timr · on June 25, 2021

> How do you encode the difference between houamine A and atrop-houamine A,

There are now lots of ways of encoding molecules. So many, in fact, that it's not really worth debating the merits of any particular method.

ECFP fingerprints shoved into a fully connected NN work surprisingly well for a large class of problems. Molecular graph convolutions (of which there are now many flavors) also work well. The field is to the point where people are doing ensembles of different encodings, and seeing what works for any particular problem.

> without resorting to empirical measurements and classification, which could yield reasonable vectors, but will take 2-5 years of a highly trained grad student's labor to obtain and put into the training corpus

Well, you're sort of touching on my last paragraph with this. The classifier, featurization, etc., usually matters less (a lot less?) than the quality of the assay data. So I agree in that respect.

dnautics · on June 25, 2021

> Molecular graph convolutions "also work well"

What is the metric for "working well" here?

My point is a graph convolution will have a hard time distinguishing the haouamine atropisomers, because they have the same graph, but very different activities.

This sort of weird shit is the norm. Like some molecule that isn't the actual active form but needs cyp450 to be activated. But it won't be if you add a methyl group that reduces it toxicity, because that methyl blocks cyp (not your target even). But only in college student pre-phase I volunteers, who happen to be predominantly white male college students looking for beer money

In the end your corpus is going to have to fall back to experiment, which is just so much slower in chemistry than the trivially parallel "scrape every Wikipedia article, twitter tweet, and reddit comment" that you have access to for text corpora; it's more like trying to use ml to decode Chinese, if we only had 3000 Chinese documents to work with.

timr · on June 26, 2021

> What is the metric for "working well" here?

Model fit. Validation on a holdout set. All of the usual standards for ML performance.

> My point is a graph convolution will have a hard time distinguishing the haouamine atropisomers, because they have the same graph, but very different activities.

Isomers can be represented. The typical problem with isomers is that you don't have good data on them, because real-world laboratory solutions are rarely pure.

> Like some molecule that isn't the actual active form but needs cyp450 to be activated.

Drug metabolism is a common set of endpoints for ML models. Nothing you're describing is so mysterious or difficult that the problems "break" machine learning in this domain.

> In the end your corpus is going to have to fall back to experiment

All ML models in this space depend on experiment.

rsfern · on June 25, 2021

There are a ton of interesting graph neural network architectures that have been developed over the last 5-6 years, they are getting pretty good at modeling the properties of molecules and crystals

Edit: saw your other comment on GNNs. I don’t think it’s like a solved problem or anything, but I also don’t think it’s intractable. e.g. https://arxiv.org/abs/2012.00094 seems relevant

krab · on June 25, 2021

I think that the biggest issue isn't the chemical space but the complexity of biological systems. It's hard to tell what the molecule will do. We just don't have a good enough simulator. AlphaFold is definitely a helpful step but more are needed in the same direction.

(In 2010 - 2012 I worked in a laboratory that did small compounds screening and I was building some tools to explore the chemical space)

derefr · on June 25, 2021

That’s specifically about chemicals presumed for physiological use/purpose, though, no?

Let’s take biology out of the equation. Can we predict chemical structures will give rise to interesting physical properties when you have a large amount of the compound around?

For some examples, could we ever potentially have a formula or model that would allow us to predict:

• whether any other molecules will possess the “infectious amalgamation” effect that gallium has on other metals it touches?

• what chemicals would create a “non-stick” surface like Teflon does?

• what types of chemicals would be both strongly colored in the visible spectrum and also chemically stable, i.e. what chemicals would make especially good pigments?

• whether something will be a room-temperature superconductor?

• whether something would make a better non-reactive glassware than borosilicate glass?

• what types of chemicals will be especially good at being electrolytes?

And I’d especially like to know, whether there’s some property that unites all “chemicals with interesting physical properties” like this, i.e. whether it’d be possible to have a model that tells you that a chemical is likely to have unusual physical properties in some way, despite not being able to predict the specific effect it’ll have.

jamestimmins · on June 25, 2021

Can anyone give an ELI5 for what the limitation is in terms of processing these computationally? Is the challenge that it's difficult to model how a molecule will interact with another molecule, so you have to do it with atoms and test the interaction across every other molecule in the search space?

For context I got a B- in high school chemistry and haven't looked back.

*Edit: "do it with atoms" is confusing in this context. I mean do it in the real world outside of bits.

whatshisface · on June 25, 2021

Molecules obey known laws of physics and can in principle be simulated exactly. That is not practical with present-day computers because it's quantum-mechanical and has an exponentially large state space. Heuristic and approximate methods are used to pare this down, sacrificing absolute truth, leading to results that are not very reliable. That is why experiments are still done in chemistry labs even though everything that happens in a chemistry lab has been "understood" since the 1930s.

Chemists focus in on the least simulatable problems because most interesting chemistry happens right on the border of not happening at all. Molecules that are very easy to calculate are ones that small energy errors don't matter for. That makes them either incredibly stable or incredibly unstable, but chemistry happens near the boundary.

dmurray · on June 26, 2021

I upvoted this because it's an interesting take, but it doesn't feel like it's true. Chemists pick new molecules to experiment with primarily because they're similar to old ones that solve a particular problem (cure a disease, colour an emulsion, explode very fast...), not because they tried computational simulation of those molecules and couldn't overcome precision/chaos issues.

whatshisface · on June 26, 2021

If a molecule solves a particular problem, then it is neither so unstable that a poor simulation could predict its instability, nor so stable that a poor simulation could predict its stability. The molecules that are the easiest to predict computationally tend to be the least interesting. The energy difference between the bound and unbound states has to be small for it to be interesting which automatically puts it in the realm of difficult simulation. So no, chemists do not pick difficult molecules on purpose - they do so because difficulty and interestingness originate from the same place.

41b696ef1113 · on June 26, 2021

>...not because they tried computational simulation of those molecules and couldn't overcome precision/chaos issues.

Existing simulations are slow and inaccurate. Why waste your time generating computational results that you do not trust when you can just iterate on existing molecules and have them in hand, ready for testing?

t3po7re5 · on June 25, 2021

There's an open-source project run by a team out of IBM thats trying to use RL to navigate the space more effectively looking for anti cancer molecular compounds

https://github.com/PaccMann/paccmann_rl

Throw6away · on June 25, 2021

"There are, I think, two reactions to this. One is despair, of course, which is always an option in research, but not a very useful one."

These are words to live by.

fuzzfactor · on June 26, 2021

You have got to have a critical mass of projects to maximize the odds that despair can not become complete.

euske · on June 25, 2021

I want also to add that the programming space, or software specification space, is also mindbogglingly big, if not bigger than the chemical space. People should know how many possibilities of small details exist for implementing a teeny trivial feature, because that's the way it is. Everything around us has a billion-gazillion parameter space, and all we're seeing is just a chance occurrence.

fuzzfactor · on June 26, 2021

>Everything around us has a billion-gazillion parameter space, and all we're seeing is just a chance occurrence.

So true.

Definitely one of the differentiators you can focus on when adding a software feature, is to consider the vast number of possibilities to accomplish it successfully.

And with natural science like this it can be the vast number of possibilities that will never work, if there even exists one viable solution. Plus the biggest space will always be where there is no documentation since no experiments have yet been done.

So the most promising stuff is when you don't know if what you want to do will even be possible.

Therefore these type projects must always be started against this exact headwind.

Which is why so few of them even get started.

And when they do, they face the most oppressive noise-to-signal ratio and often need to do it for an ungodly long time before any real sign of success can even be seen on the horizon.

When you think about it software projects might often be difficult to finish, while science projects more often are difficult to start.

jlafon · on June 26, 2021

Many have commented here on the computational challenges of enumerating, storing, and searching very large virtual libraries. While molecules can be represented and stored as strings, that's an oversimplification of the problem (from a CS perspective). Scientists often want to search these large libraries in 2D and 3D, which requires computing & storing those coordinates. It can be cost prohibitive just to store the 3D coordinates for massive virtual libraries, even for large pharmaceutical companies. We've done this for 10^10 molecules [0]. If you are reading this and find this type of problem interesting, checkout www.eyesopen.com/careers

[0] https://pubs.acs.org/doi/10.1021/acs.jcim.9b00779 "Virtual Screening in the Cloud: How Big Is Big Enough?"

ta988 · on June 26, 2021

also a molecule doesnt just have a single set of 3d coordinates. You have conformers, solvent effects that affect their conformation as well... Often in a biological system it can be a minor conformer shape that is the strongly binding one (case of taxol). If you want to compute physical properties such as NMR, VCD, ECD you need to optimize your molecules in a completely different way (quantum mechanic calculations of electron shells) than just with a simple mechanical modelling approach to just do docking... we are talking about a 1s calculation per molecule vs a 24h or more for a mid-sized molecule.

aazaa · on June 25, 2021

> A startling percentage of their compound backbones consisted of three- and four-membered rings concatenated into structures that no one can be sure are even stable, and are certainly beyond the current powers of organic synthesis. Adding those in would make the compound set many orders of magnitude larger, since only 0.005% of the original compound graphs were even considered as starting points. ...

Some may question what's so hard about a few billion entries.

The thing to keep in mind is that enumeration like GDB necessarily yields a combinatorial explosion of molecules. Every molecule that gets passed to the next step can generate many millions or more children. Isomeric forms make the problem even more challenging. It's worth noting the very limited repertoire of elements: "C, N, O, S, and halogens." This excludes, for example, phosphates, which are part of the DNA and RNA backbone, and phospholipids, although even oligonucleotides and many medium-sized lipids are too large to be part of GDB-17. Each atom added pushes the enumeration further away from feasibility.

Then there's the problem of figuring out if you've already enumerated a molecule. That problem has no known polynomial-time solution (graph isomorphism), although GDB's low node count mitigates the problem.

There are ways to deal with the combinatorial explosion so that you can search a vast molecular space without actually enumerating all of the molecule products. But to compile statistics like those in the paper that Lowe was writing about, you need to enumerate to some degree.

https://pubs.acs.org/doi/10.1021/ci300415d

actually_a_dog · on June 26, 2021

> Then there's the problem of figuring out if you've already enumerated a molecule. That problem has no known polynomial-time solution (graph isomorphism), although GDB's low node count mitigates the problem.

It's not just the low vertex count that makes this an easier problem than general GI. The state of the art GI algorithm for general graphs runs in quasipolynomial time 2^{O((log n)^c)}, with c=3 claimed. [0]

However, structural constraints implied by chemical plausibility should (probably[1]) usefully limit the class of graphs that need be considered to those of bounded treewidth. A polynomial time GI algorithm was recently published for graphs of bounded treewidth. [2]

[0]: https://en.wikipedia.org/wiki/Graph_isomorphism_problem#Stat...

[1]: If nothing else, maximum degree restrictions imposed by chemical valence should impose a useful upper bound on treewidth. Although I (somewhat surprisingly) can't find a paper offhand stating such an explicit bound for treewidth in terms of maximum degree, I did find one stating an asymptotic bound of (1 - C * exp (-4.06√d)|V| for sufficiently large max degree d, and some constant C, which lends some plausibility to this line of thought: https://journals.plos.org/plosone/article?id=10.1371/journal...

[2]: https://arxiv.org/abs/1803.06858

j-wags · on June 25, 2021

If you'd like to check out what current chemical database files look like, "Enamine REAL" is a fairly widely-known one.[1] My understanding is that this file is a mix of their ACTUAL in-stock inventory, as well as the product of running a small number of high-reliability reactions on each compound in that inventory. So it serves as a "vendor catalog" file, where everything in here can be ordered from Enamine and synthesized+delivered to your door in a few weeks.

Another approach I've heard of for iterating through every molecule in a large region of chemical space is to START with a large molecule dataset, then for each molecule, predict the result of performing simple reactions on it. For each reaction product, do your full analysis, and only store the result if the analysis indicates it is noteworthy. This, in effect, lets you scan over a larger region of chemical space than you can fit in memory.

[1] https://enamine.net/compound-collections/real-compounds/real...

jpollock · on June 25, 2021

What makes 1 billion rows a large search space?

What makes 150 billion rows incomprehensibly large?

With a molecular weight of 500 we're talking something on the order of terabytes of data (for 1b molecules)?

It certainly sounds like a tractable amount of data.

What computer problems are stopping us from generating a compound, computationally testing it for stability, and adding it to the list and then searching?

_ihaque · on June 25, 2021

Even at 500Da or less, it's much, much larger than that.

You may be interested in the work of Jean-Louis Reymond's group, who have done more or less exactly what you suggest: https://gdb.unibe.ch/downloads/

GDB-11, with 11 heavy atoms, has 26.4M structures (110.9M stereoisomers -- molecules aren't 2D). Going up to 13 gives you 970M molecules. Going up to 17 (still mostly below 250Da) is 166,400M.

There's a lot of space up there below 500Da.

sseagull · on June 25, 2021

Also note that that only includes organic molecules. There's another 85-ish natural elements in the periodic table that could be important, but is much harder to synthesize or compute.

Although including heavier elements can blow past 500 Da pretty easily.

whatshisface · on June 25, 2021

The heavier elements start acting more like continuous systems and less like quanta legos, as you get more and more states per eV. Transition metals and lanthanides don't get their own combinatoric explosion until literal sticks are stuck on the smooth balls in coordination chemistry.

whatshisface · on June 25, 2021

I have learned over time that state space size has nothing to do with problem difficulty. Sorting finds and answer in a space with n! possibilities in n log(n) time.

Chemistry is difficult not because of the large number of chemicals but because there hasn't been a lot of structure discovered in them to allow the sort of compounding subcase solving that makes searching a sorted list tractable. The structure that has been discovered can be found in chemistry textbooks and has names like "so-and-so's rule" which can be applied to boroalkanes with between 5 and 12 vertices excepting 6, unless the cage is charged in which case you should treat it like it has one fewer vertex, unless the charge is -2 and the original vertex count is between 8 and 13, in which case...

Those rules are much better than a table as measured by information compression but you can't discover them unless you start with the table filled mostly filled out already.

jpollock · on June 25, 2021

Except, 1 billion isn't typically considered _big_ to even perform full scan searches.

It's only big if it's expensive to search all of them, and that problem is very easily parallelizable.

Which makes it a $$->time equation? If it costs 1s for 1 machine to check an item, you will need 32 machines to check a billion in a year, or 12000 to search in a day.

So, it's the is_match operator that's expensive, rather than data storage? Is the match check minutes long or needing human oversight?

whatshisface · on June 26, 2021

Checking a molecule can take years of computer time on a supercomputer, depending on what you are checking for. There is no upper bound to the difficulty of checking because you can always ask for a more complex situation, and you are usually asking for those situations because the simple things are usually already discovered.

rodrigosetti · on June 25, 2021

Chemical simulation is very hard (involves solving multiple np-complete problems).

But that is one of the expected applications of quantum computers (simulate quantum systems).

optimalsolver · on June 26, 2021

The number of possible chemical compounds (organic and inorganic) far exceeds the number of atoms in the observable universe. The 166 billion number is just a small selection of promising candidates from that space, not the entire set.

rsfern · on June 25, 2021

150 billion is the size of a pretty restricted subset. The estimate for general druglike compounds is 10^60

Severian · on June 25, 2021

"I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space.”

fuzzfactor · on June 26, 2021

Like they say at the NASA museum, "There's space for everyone."

Synaesthesia · on June 26, 2021

Alexander Shulgin explored potential psychedelic drugs ands points out that thousands may still be out there, waiting to be discovered.

In one example he looked at nutmeg oil and found over 5000 compounds (!) which is a lot to explore.

xwdv · on June 25, 2021

What kind of chemicals are we currently searching for in the chemical space?

hypertele-Xii · on June 25, 2021

Cure for cancer? Better batteries. Materials for fusion reactors. Replacement for plastic. All sorts of things.

d_silin · on June 25, 2021

Catalysts!

optimalsolver · on June 25, 2021

https://opencatalystproject.org/

BeFlatXIII · on June 25, 2021

A better LSD

fuzzfactor · on June 26, 2021

Which was created by accident over 70 years ago.

Since then diverse enthusiasts have pursued focused progress in many directions and have come up with numerous compounds they find worthwhile that would have not even been possible way back then.

When they can get it they still seem to prefer the same old stuff though.

And that's quite a long time and a lot of experimentation.

dekhn · on June 26, 2021

LSD wasn't exactly created "by accident". Hoffman already knew that molecules with a similar structure had psychoactive effects, he was trying to maximize the potency.

fuzzfactor · on June 27, 2021

Was he the first one to take it and ingested it by accident?

That might be what I am mis-remembering.

optimalsolver · on June 26, 2021

>When they can get it they still seem to prefer the same old stuff though.

Brand recognition.

amelius · on June 26, 2021

If chemical space is so big, how does mass-spec accurately determine compounds?

Synaesthesia · on June 26, 2021

The article is talking about the possible chemical space, which is vast. Mass spectrometry breaks up a sample into constituent compounds based on mass, if it’s a biological compound could even be thousands. Having separated the compounds by mass you can then do further analysis of them.

Y_Y · on June 25, 2021

There's a couple of xkcd comics relevant to this. Anyway the space of possible compounds is mind-bogglingly huge and that's impressive. At the same time it's countable, and as countable things go, it's not even so big. The kind of hugeness that keeps me up at night is the "long line" or the phase space of the cosmic fluid.

carl_dr · on June 25, 2021

Genuine question: what are the “long line” or the phase space of the cosmic fluid?

I found the Wikipedia page https://en.m.wikipedia.org/wiki/Long_line_(topology) but like a lot of such pages, they are opaque unless you are familiar with the topic. Consequently, I have no idea if this is the long line you are referring to.

Oh, and which xkcd comics?

Y_Y · on June 25, 2021

Good questions. You got the right long line. The normal number line, the "real line" can be thought of as taking a half-open unit interval [0,1) and putting one at each integer to "fill in the gaps". This is already very long. If you did the same thing, but stuck together each point on the real line, then you'd get the long line.

Cosmic fluid is a model of the universe as a fluid, like an ideal gas or something. The phase space is is the space of all the possible states it can get in to (shape, flow, pressure etc.) You may call this state space depending on how you think about spacetime. See e.g. https://en.m.wikipedia.org/wiki/Equation_of_state_(cosmology... .

xkcd: 435, 794, probably others

ChrisArchitect · on June 25, 2021

Anything new on this since 2014?

ganzuul · on June 26, 2021

BTW nukes are femtotech.