Introduction

The role of experiments is crucial to confirming hypotheses and making discoveries in chemical science. However, the procedures used may take a long time due to limitations of the method, costs of reagents/catalysts, difficulties in waste handling, operational delays, and a considerable amount and complexity of the analyzed data. Therefore, two strategies are primarily used to decrease time and human resources for performing experiments: automation of data acquisition (e.g., in automated chemical syntheses1,2,3, in mass spectrometry-based proteomics4,5 or high-throughput microscopy6,7) and automation of data interpretation (chemical space exploration8,9,10, NMR data11,12,13, and mass spectrometry—MS—data14,15,16,17,18).

However, one can think about a third feasible strategy, to use previous results (already existing data) for hypothesis testing, thus reducing the number of experiments. The fundamental limitations of the strategy include the possible lack of accessible scientific data and its management with FAIR19 (findable, accessible, interoperable, and reusable) principles. This disadvantage can be eliminated by maintaining common open databases of experimental data20,21 with detailed descriptions of experiments within the laboratory or by using web applications that enable remote collaborative research in a shared analysis environment22. Another important disadvantage is the lack of dedicated software for the implementation and deployment of chemically efficient algorithms to search/extract data.

In a typical organic synthesis workflow, chemists select particular experimental conditions for the optimization of a reaction to achieve the maximal outcome for the desired product (Figure S1, as an example). Next, the reaction and sample preparation are carried out, followed by detection and characterization of the chemical compositions of the studied system using an appropriate analytical system (Fig. 1a). High-resolution mass spectrometry (HRMS) is an excellent, very often used method to execute this strategy due to its high speed of analysis, sensitivity and ease of data accumulation23. HRMS is widely used in analytical chemistry24, organic25,26,27 and inorganic chemistry28, proteomics29, metabolomics30, petroleomics31, metal complex catalysis32,33,34,35,36, organocatalysis37, polymer science38, and material science39, among many other directions.

Fig. 1: Overview of the “experimentation in the past” concept.
figure 1

a A commonly used standard experimental workflow for reaction optimization with limited manual human interpretation of results and incomplete reaction space investigation. As an example, we highlight analysis of high-resolution mass spectrometry (HRMS) data that contains significant amount of information about reaction mixture. b “Experimentation in the past” concept is achieved using a machine learning (ML)-powered search engine with the discovery of new pathways in Mizoroki-Heck and hydrothiolation reactions in a large amount of already existing data. The search consists of hypothesis generation using various methods (manual, automated fragment-based, and enabled by large language models, LLM), and search itself, which uses already existing HRMS data.

Within a routine research pipeline, HRMS-equipped laboratories produce mass spectral data every day. During a relatively short period of time, data storage may contain tens of thousands of recorded files. Some spectra weigh several gigabytes (e.g., reaction monitoring spectra at high resolution), in overall, leading to terabytes of recorded information being stored on computer drives. Currently, manual analysis connects experiments with MS data (Fig. 1a). This approach imposes serious limitations associated with incomplete interpretation coverage of the analyzed data due to human factors. Mainly, only the desired product and a few known byproducts are looked at, leaving most MS signals unattended. Within a few years of experimental work in the laboratory, terabytes of data are accumulated and stored.

Thus, many new chemical products have already been accessed, recorded, and stored with HRMS but remain undiscovered. Therefore, the development of methods that can screen terabyte-scale databases and collect molecular patterns opens the way for cost-efficient and environmentally friendly chemistry discoveries while operating on existing stored data with no new experiments needed.

In this study, we demonstrate that in the case of mass spectrometry, data analysis can be implemented as a search engine with automated ion detection algorithms (Fig. 1b). An ultimate digital tool for accelerated discovery would allow searching for automatically generated ion candidates in a vast array of existing complex mass spectra with high accuracy in a reasonable amount of time and hardware resources. The algorithm can not only search known/existing products but also comprehensively search for unknown products, transformation pathways, contaminants, etc. The proposed approach makes already existing data a perfect source for reaction discovery in advantageously Green and Sustainable ways (no chemicals are consumed; no waste). Importantly, even though we only reveal the presence of ions with specific molecular formulas, the user may supplement the study further by designing experiments to verify the structure manually using either orthogonal methods such as NMR (Nuclear Magnetic Resonance) spectroscopy, or by obtaining tandem mass spectrometry (MS/MS) data. In our examples, we show how this can be done.

In the case of this approach to achieve this aim, a powerful algorithm to search compounds in large-scale MS data is a key requirement. To date, search algorithms for complex (with more than one compound in the spectrum) mass spectrometry data have been actively used, mainly in metabolomics40,41,42 and proteomics43,44,45,46,47,48 studies. The search is primarily based on matching peaks in the experimental MS/MS spectrum with peaks in the theoretical spectrum obtained from the peptide sequence49. There are also examples of structural annotation and exploring genomics50 and metabolomics51,52,53,54,55 datasets with MS/MS data. FastEI56 software uses Word2vec to transform the electron ionization spectra into embeddings with further large-scale spectrum matching. However, typical workflows have limited applicability due to incomplete chemical space coverage, i.e., the narrow application scope of such engines. Moreover, although already implemented in some packages57,58, we would like to stress the importance of isotopic distribution59 patterns, which leads to false detections (see SI Section S2 for a relationship between the isotopic distribution information and false positive rate).

Furthermore, annotated training data inaccessibility in supervised machine learning (ML) for mass spectrometry continues to be a major bottleneck due to the lack of human resources, time for labeling data, and high dimensionality of mass spectra. Model learning requires up to several thousand labeled ions to achieve good performance. Synthetic data can be used to solve this problem (see SI Section S3 for details on simulated spectra use). Artificially generated spectra were previously used in ML model training and showed their applicability in MS tasks: atomic pattern recognition60, deisotoping15, and the “inverse problem” of molecular identification15. MS spectra augmentation techniques are also widely studied61,62.

Considering the disadvantages above, this work proposes an approach for searching in large stored arrays of mass spectral data with a focus on reaction discovery (Fig. 1b). The models were trained on synthetic data. The key contribution of the work is the development of a search engine, called MEDUSA Search, that allows the finding of ion isotopic distributions in a tera-scale database (in our case, more than 8 TB of 22 000 spectra) of multicomponent HRMS spectra with different resolutions in an acceptable time (see the “Reaction discovery approach” section for dataset description). The engine is able to confirm basic hypotheses of the presence of ions of interest in a wide range of applications (i.e., support all possible ion formulas with different charges). As an illustrative example, we applied the developed algorithm to HRMS data accumulated by many research groups studying a large scope of diverse chemical transformations, including the well-known and industrially relevant Mizoroki–Heck reaction (see SI Section S1). The data were collected over a few years and remained abandoned. We demonstrate that new transformations may be discovered upon automated search of archived data. Since this reaction has been widely known and studied numerous times previously by various scientific groups, it is important to demonstrate the advantage of the developed computational approach to reveal “surprising” transformations, which have been overlooked in manual analysis for years.

In this way, the concept of “experimentation in the past”, an approach to research when a researcher uses experimental data made earlier instead of conducting a new experiment, was achieved with the discovery of novel catalyst transformation pathways in cross-coupling and hydrogenation reactions. Importantly, data reuse and repurposing are already common in fields such as proteomics and metabolomics. However, research related to organic chemistry is quite limited21.

Results and discussion

Overview of the search engine

To proceed with the reaction discovery workflow, it is first necessary to develop a search engine, which underlies the proposed approach. The Machine-Learning-Powered search pipeline developed in MEDUSA Search consists of five overall steps, as illustrated in Fig. 2 and described in the text below. The multilevel architecture of the system is inspired by existing web search engines and is crucial to achieve satisfactory search speeds (see SI Section S4 for search speed tests).

Fig. 2: Description of the search engine pipeline.
figure 2

First, the engine takes as input molecular formulas and charges of searched ions. They can be derived from the reaction system using hypothesis generation method (through fragment-based or large language model, LLM, guided approach) or defined manually (A). Then, it searches all spectra files that contain the two most abundant isotopologues’ peaks of each input ion (B). The peak is represented by its mass-to-charge ratio—m/z. These spectra files are called the candidates. Cosine distance threshold is calculated for them (C1). Then, an algorithm that searches for the isotopic distribution by input formula within a single spectrum is performed for all candidate mass spectra (C2). Additional machine learning (ML) models attempt to decrease the number of false positive search answers (C3).

Importantly, all the ML models were trained without the use of large number of annotated mass spectra. This was done by generating synthetic MS data with the construction of isotopic distribution patterns from molecular formulas and the following data augmentation to simulate measurement errors of the instrument (see SI Section S3 for details).

Before searching, we need to generate a list of hypothesis reaction pathways on the basis of our prior knowledge about the reaction system (Fig. 2, step A). Here, we design this system around breakable bonds and the recombination of corresponding fragments. If a user understands which bonds may break and form, they may supply individual fragments that will be automatically combined to create a query ion. However, we also allow BRICS63 fragmentation or the use of multimodal LLMs to perform this fragmentation (see Section S5 for examples of generated hypotheses). The development of new hypothesis generation methods is an open problem, and any new work in the field can be easily integrated into this system.

Input information about the chemical formula and charge allows us to calculate the theoretical “isotopic pattern” of the ion. The two most abundant isotopologue peaks are searched in inverted indexes (see SI Section S6.1 for details) with an accuracy of 0.001 m/z (Fig. 2, step B). Mass spectra that contain these peaks are called candidates. The following isotopic distribution search will be performed on them.

After a coarse spectra search, an isotopic distribution search of the query ion is performed for each candidate spectrum. This step includes 1) initial ion presence threshold estimation; 2) in-spectrum isotopic distribution search; and 3) filtering false positive matches. Descriptions of each step are given below.

The in-spectrum isotopic distribution search algorithm returns the cosine distance as a metric of similarity between theoretical and matched isotopic distributions (see SI Section S6.2 for algorithm details). The automatic decision of whether there is an ion in the spectrum or not depends on the estimated maximum cosine distance (i.e., ion presence threshold), which depends on the formula of the query ion (see Figure S8d for the threshold/formula relationship). A machine learning (ML) regression model is implemented (Fig. 2, step C1) to determine the ion presence threshold with the input ion formula (see SI Sections S3, S6.3, and S6.4 for data generation, data encoding and performance evaluation, and hyperparameter tuning, respectively).

The in-spectrum isotopic distribution search algorithm (Fig. 2, step C2) matches peaks from the experimental candidate mass spectrum with peaks from the theoretical isotopic distribution; at each step, the cosine distance is calculated, which allows the selection of the most similar peaks. If no peak is found, it is replaced with a peak with an intensity equal to the median of the noise. If the final cosine distance is less than the ion presence threshold, estimated on Step C1, the ion is considered to be found (for more details, see SI Section S6.2).

An additional ML classifier (Fig. 2, step C3) detects false positive ion presence verification with information about neighboring peaks (see SI Section S3 for training data generation). This problem usually appears as selecting the searched distribution as a part of another distribution. One of the most prominent examples starts with M + 1, while M is also present (see SI Section S6.5 for performance and interpretability studies; Section S6.6 for hyperparameter tuning; Section S7 for false positive examples).

To facilitate the work with the search engine, the Command Line Interface (CLI) was developed using the Click Python package (see SI Section S8 for more information).

Reaction discovery approach

Having various hypotheses about the course of possible new reactions, it is necessary to cover as much chemical space as possible. In this work, combinatorial generation of molecular formulas of proposed products (i.e., rule-based generation of molecular formulas with unique structures but different substituents) was performed to connect reaction discovery with the automated mass spectral ion search in already existing data. The FAIR description data from previous experiments (Fig. 1b) are also essential for validating the search results in practice.

The search for novel reactions included more than 20,000 mass spectra without any prior knowledge of their composition (Fig. 3b). The search procedure placed no limitations on the filename, the researcher’s name, who recorded the spectrum or any other aspect of decreasing the search space. One commonly used method for visualizing complex data is through the application of the t-SNE dimensionality reduction technique64. To demonstrate the high diversity of the archived data set, two t-SNE plots were created. As shown in Fig. 3a, the compounds registered in the analyzed mass spectra cover the chemical space well. In Fig. 3b, each point represents a spectrum, and similar mass spectra are located close to each other on the plot (see SI Section S9 for t-SNE plot generation details and to see the enlarged version of the t-SNE map). It is evident that various workers record diverse spectra that contrast from one another. Moreover, one can also see common projects, where multiple people record similar spectra. Instrument operator C has the most widespread distribution of mass spectra, which matches their primary role — recording data for sample drop-off service for the entire institute.

Fig. 3: Existing data and the proposed novel product generation procedure.
figure 3

a t-distributed stochastic neighbor embedding (t-SNE) map of chemical structures encoded with Morgan fingerprints. Molecules were collected via random sampling from the PubChem database and from compounds that were registered in the mass spectra used in the research. Source data are provided as a Source Data file; b t-SNE map of the archived MS data used in the research (see Figure S20 for the enlarged version). Each point represents a unique mass spectrum. Different colors indicate instrument operators (coded by letters) who recorded mass spectra. Operator C registers mass spectra for the entire institute. Source data are provided as a Source Data file; c Functional groups and ligands, which were used in the generation process; NHC—N-heterocyclic carbene, Ar — aryl group, Nu — nucleophile, EWG — electron-withdrawing group. d The generation of ion formulas involves a complete enumeration of all functional groups and ligands for each core; e Bar chart illustrating the number of detected ions, categorized by the type of transformation. Source data are provided as a Source Data file.

The discovery of intermediates in organic reactions is essential to understand the mechanism and propose new strategies for reaction design and optimization. Electrospray ionization mass spectrometry (ESI-MS) is widely used in these studies65,66,67,68. It is also used as a method to characterize synthetized products69,70. To demonstrate the applicability of the developed search engine, it was used to find new transformation pathways in Pd/NHC-catalyzed (NHC = N-heterocyclic carbene) reactions71 with the combined generation of ion formulas (Fig. 3c). For each formula component (functional group or NHC-ligand), which is contained in one of the 13 analyzed structural cores (Fig. 3d), the molecular formula was calculated. The total number of generated ion formulas was 520, and 400 out of which had unique mass. Importantly, HRMS without fragmentation techniques can provide information only about molecular formulas; thus, structural isomers cannot be distinguished.

Once the hypothesis set was generated, the method was applied to attempt to verify them using previously collected data and retrieved laboratory notebooks. A search pipeline (Fig. 2) was run for each of 520 generated ions through the entire tera-scale HRMS database (see SI Section S9 for data set information), with a total computational time of 3–4 days (8–11 minutes per ion). As a result, the engine detected many isotopic distribution patterns. However, most of the search engine answers could not be validated because of the lack of FAIR description data needed for recognition of the initial composition of the reaction mixture. Nevertheless, some samples were checked via laboratory notebooks. The collected results (see SI Section S10 for MS spectra) included the following:

  1. 1.

    The presence of corresponding azolium salts (m/z 147) in all reactions associated with M/NHC catalysis72 (Fig. 4a);

    Fig. 4: The search engine enables the discovery of novel products in old data (HRMS spectra of previously unknown reaction products are indicated by dotted purple arrows).
    figure 4

    a MEDUSA Search has registered widely known H-NHC and Ph-NHC ions, as well as a newly discovered [NHC-ethynyl]+ ion in a Pd/NHC-catalyzed Sonogashira reaction mixture. The isotopic distribution-informed search process allows the detection of previously unknown ethyl-NHC products; b MEDUSA Search registered unknown vinyl-NHC fragments in the Mizoroki-Heck reaction (Dipp – 1,3-diisopropylphenyl, Py – pyridine); c MEDUSA Search registered phenyl-substituted vinyl-NHC ions in the Pd-PEPPSI (Pyridine-Enhanced Precatalyst Preparation Stabilization and Initiation) catalytic hydrogenation reaction.

  2. 2.

    The presence of known [phenyl-NHC]+ ions73 (m/z 223) in cross-coupling reactions (Fig. 4a);

  3. 3.

    The presence of a recently discovered [ethynyl-NHC]+ ion74 (m/z 247) in the Sonogashira reaction (Fig. 4a);

  4. 4.

    The presence of an unknown [ethyl-NHC]+ ion (m/z 251) in the Sonogashira reaction (Fig. 4a);

  5. 5.

    The presence of unknown [vinyl-NHC]+ (m/z 273) and [vinyl-phenyl-NHC]+ (m/z 591) ions75 in the Pd/NHC-catalyzed Mizoroki-Heck reaction in the spectra recorded by different researchers in different years (Fig. 4b);

  6. 6.

    The presence of an unknown [vinyl-NHC]+ ion (m/z 325) in Pd/NHC catalyzed the hydrogenation reaction (Fig. 4c).

Figure 3e presents statistics regarding the number of ions detected during the search procedure. All of these ions had unique masses. The preferred type of transformation is phenyl–NHC coupling. Compared with other types of transformations, vinyl-NHC coupling is infrequent. The obtained results are correlated with quantum chemical study of transformation pathways (see SI Section S15 for full information about quantum chemical study). Notably, the validation of the search results using laboratory notebooks was only possible for a limited number of mass spectra. For most ions, it is unclear in which reactions they were discovered and if they truly correspond to the assumed structural formula. Thus, further experimental validation is needed (Fig. 5).

Fig. 5: Experimental validation of the discovered reaction pathway.
figure 5

a The formation of [BIMe(CH)2COOBu]+ ion was proven with ESI-HRMS; b the formation of [IPrCHC(Ph)COOBu]+ ion was proven with ESI-HRMS; c MS/MS spectrum of [IPrCHC(Ph)COOBu]+ ion; d vinyl-NHC and ethyl-NHC reaction products (Dipp – 1,3-diisopropylphenyl, Mes - mesityl).

In addition to the main search procedure employed for the identification of previously unknown products in Pd/NHC-catalyzed reactions, an alternative example of search engine capabilities was pursued through the discovery of nickel-catalyzed hydrothiolation reaction side products76 (see Section S10.2 for details).

Experimental validation

The formation of the catalyst transformation products shown in Fig. 3d is strongly related to the corresponding reaction mechanism. Previously, we conducted several Mizoroki-Heck and cross-coupling reactions (e.g., Sonogashira, Suzuki, Buchwald-Hartwig, etc.) catalyzed by Pd/NHC complexes with different NHC ligands and halogen substituents. During the investigation of the reaction mechanisms via ESI-MS spectra of the reaction mixtures, the coupling products [NHC-H]+, [NHC-Ph]+, [NHC-O]+, and [NHC-N]+ were found. On the basis of these observations, the key role of R-NHC coupling and M-NHC bond cleavage in the evolution of M/NHC complexes under catalytic reaction conditions was revealed73. The formation of catalytically active molecular M/NHC catalysts and “NHC-free” cocktail-type catalysts, including the formation of H-NHC salts77 and O-NHC coupling78, was described first in terms of the number of C‒C coupling reactions.

In the Sonogashira reaction, the previously unknown product of the ethynyl-NHC coupling product was isolated, and possible reaction pathways were described74. The ethynyl-NHC coupling product is very reactive and may undergo various transformations. Using the described approach for hydrogenated derivatives of the product revealed the presence of the [NHC-(CH2)2-Ph]+ product in the ESI-MS spectra of the Sonogashira reaction mixtures (Fig. 4a). Presumably, the process occurs via a kind of transfer hydrogenation reaction.

Similar to the discovery of ethynyl-NHC and aryl-NHC coupling products, we envisioned the possibility of the formation of two different vinyl-NHC coupling products (Fig. 4b) before and after the insertion step in the Mizoroki–Heck reaction. Both products were observed in the experimental reaction mixtures. Here, we also aimed to perform experimental validation of the observed reaction. To do that, original lab notebooks were retrieved, and corresponding experiments were found. Mass spectrometry analysis of the reaction mixtures of the Mizoroki–Heck reaction between p-methoxyiodobenzene and butyl acrylate, catalyzed by the Pd/NHC complex [BIMePh]+[BIMePdI3]-, revealed the formation of [BIMe(CH)2COOBu]+ (Fig. 5a). The molecular formula was confirmed with ultrahigh-resolution mass spectrometry. The experiment involving the formation of [IPrCHC(Ph)COOBu]+ was a mercury test for distinguishing between homogenous and heterogeneous catalysis. We excluded mercury to avoid interference with reactive species79 and kept other conditions as in the original experiment. The molecular formula was also confirmed with ultrahigh-resolution mass spectrometry (Fig. 5b); the chemical structure was verified with MS/MS experiment (Fig. 5c).

We also conducted experiments using different NHC ligands (see SI Section S11 for experimental details). The possibility of vinyl-NHC coupling in the process of Pd/NHC transformation under the Mizoroki–Heck reaction was tested with five different NHC palladium complexes, as illustrated in Figure S35. We used Pd complexes with different co-ligands to prove the generality of vinyl-NHC coupling under catalytic conditions. The scope is summarized in Fig. 5d. The vinyl-NHC coupling products were registered, confirming the proposed reaction (see SI Section S12.1 “Ultrahigh-resolution mass spectra”, Figures S37S45). The vinyl-NHC product was found in all studied cases, independent of the ligand in the complex, with an ultrasmall definition error for all of them. Along with vinyl-NHC, ethyl-NHC was also detected in all the investigated reaction mixtures, for (BIMe)PdI2Py, (SIMes)PdCl(allyl), and (PIPr)PdCl(allyl), with very low errors m/z errors of less than 0.3 ppm and low errors of less than 1 ppm in the case of the (IMes)PdCl(allyl) and (SIPr)PdCl(allyl) complexes. In all MS experiments, we set configurations to prevent transformations during the recording of mass spectra (see SI Section S13 for more information). Pressure sample infusion ESI-MS reaction monitoring for the discussed vinyl–NHC coupling process was also performed to confirm that ions can be observed across multiple modalities of the reaction data collection (see SI Section S12.2).

Finally, in the transfer hydrogenation reaction (Fig. 4c), another type of ethynyl-NHC coupling could be observed. Indeed, the search revealed the formation of the corresponding product. The described transformation sheds light on the dynamic nature of catalytic systems and opens opportunities for the development of Pd-catalyzed imidazole ring functionalization reactions.

To gain insights into the mechanisms of the discovered transformations and additionally confirm their feasibility from a theoretical point of view, a DFT quantum chemical study was performed, which confirmed the reaction channel for the vinyl-NHC coupling and identified the possibility of this newly discovered reaction occurring (see SI Section S15 for the computational results).

In this work, a robust ML-based computational engine for reaction discovery was developed. First, we start with automated methods for compound hypothesis generation. The selected candidates are then passed to the search engine. The combination of an isotopic distribution-based algorithm with two additional machine learning models made it possible to reduce false positive ion detection, which was crucial to increase search performance in databases of various objects of study. The steps of the search workflow were optimized, synthetically and experimentally validated. The interpretability of the models allowed us to obtain an understanding of how these models behave. A reduction in the ion search space with the account of isotopic distribution patterns proved the advantage of the isotope-distribution-centric approach.

The ability of the engine to use a wide range of ions with different compositions showed the excellent applicability of the system. An ion search can be performed on all MS instruments with a resolution that allows the observation of the isotopic distribution. A combination of the developed system with other computational techniques (e.g., algorithms for the prediction of ion fragments by structural formula or peptide sequence, different adduct calculators) can become a powerful analytical tool for comprehensive screening, which is vital for accelerated discovery in various scientific fields.

Moreover, even though the presence of FAIR data description is a major requirement of our approach, users can perform multiple queries to reduce the false positive rate of the system. For example, searching not only for the expected product but also for the corresponding reagents will significantly narrow down the scope for experimentation. However, ultimately, we consider this work an important step in raising awareness of how critical proper data collection and description are.

As an example, the developed search engine allowed the discovery of previously unknown M/NHC-catalyzed reaction byproducts, saving the resources needed to confirm hypotheses with the concept of “Experimentation in the past”. In this approach, two degrees of novelty were achieved:

  1. 1.

    Reaction pathway novelty — the reactions unexpected for this particular process but known and reported for other catalytic processes. In this work, we showed the formation of H-NHC salts and ethynyl-NHC coupling products. For these processes, the findings of our computational approach can be validated by comparison with those of other methods, including NMR spectroscopy and single-crystal X-ray analysis. These findings are important to connect the studied reaction with other processes and enhance catalyst development principles with relationships documented for other processes.

  2. 2.

    Totally new reactions/products (never reported before). Here, we demonstrated the possibility of a vinyl–NHC coupling process in the Mizoroki–Heck reaction. [BIMe(CH)2COOn-Bu]+[X]- and [IPrCHC(Ph)COOn-Bu]+[X]- are new compounds that have never been reported and are absent from the SciFinder and Reaxys databases. The discovered transformation appears probable on the basis of general chemistry knowledge. Moreover, we observed various hydrogenated products that raise questions about the mechanism of their formation.

Here, we demonstrated that analysis of unused (old or abandoned) data with the developed computational algorithm can reveal pathways and reactions that were not described previously. Both degrees of novelty were verified, and the feasibility of the computationally revealed reactions was confirmed.

The discovered reaction pathways and reactions/products were rigorously verified by independent replication of experiments with different ligands and ultrahigh resolution MS measurements with m/z errors less than 1 ppm. To ensure that not only the molecular but also the structural formulas are correct, MS/MS ultrahigh resolution mass spectrometry analysis was performed. Mechanistic considerations and DFT study have increased confidence in the discovered reactions even more (see SI Section S15 for more details).

We plan to continue to work on the problem of interpretation of mass spectra and hope that in the future, automated analysis of MS data will become a major source of discoveries in chemistry.

Methods

General considerations

All starting materials, catalyst precursors and solvents were purchased from the commercial sources.

Mass spectra were measured using Bruker maXis instrument equipped with an electrospray ionization source (ESI) with Time-of-Flight (TOF) analyzer and spectra were recorded with m/z 50–1500 range. Capillary Voltage was set: for the positive ion mode to –4.5 kV, Spray Shield Offset was set to –0.5 kV. For calibration of the mass spectra a low-concentration tuning mix solution by Agilent Technologies was utilized. Nitrogen was applied as a nebulizer gas (0.4 bar) and dry gas (4.0 L × min−1, 250 °C). Bruker Data Analysis 5.1 software package was used.

Ultrahigh-resolution mass spectra were recorded on a Bruker solariX XR (ICR mass analyzer, a 15 T superconducting magnet) mass spectrometer equipped with an ESI source. The m/z scanning range was 100–1500. The number of scans was 256, with 8 M data points. External calibration of the mass scale was carried out using a sodium trifluoroacetate solution (0.1 mg/mL in a 1:1 acetonitrile/water mixture). The measurements were carried out in positive ion mode (+) (ground spray needle, 4500 V high-voltage capillary; HV end plate offset: –500 V). Nitrogen was used as the nebulizer gas (0.5 bar), and dry gas was used (4.0 L/min, 180 °C).

The chromatographic analysis was carried out on a chromatograph Agilent 1200 equipped with analytical column ZORBAX SB-C18 (2.1 × 50 mm); the size of the particles of the stationary phase 1.8 μm, mobile phase acetronitrile – 0.1% water solution of formic acid, 9:1, elution in the isocratic mode, flow rate 0.25 ml min-1, temperature 25 °C, the volume of injected sample 0.01 μl. The analyzed mixture was dissolved in acetonitrile (Merck, HPLC grade).

Experimental procedure for pressure sample infusion ESI-MS reaction monitoring (PSI-MS)

The mixture of Pd/NHC complexes (SIPr)PdCl(allyl) (0.015 mmol, 10 mg) and (PIPr)PdCl(allyl) (0.015 mmol, 10 mg) (see Figure S35 for structure details), n-butylacrylate (0.042 mmol, 6 μl) and dimethylformamide (2 mL) were mixed in Schlenk tube. Potassium tert-butoxide (0.06 mmol, 7 mg) was dissolved into isopropanol (600 μl) and the solution was added to the mixture. One side of Schlenk with reaction mixture was closed with a septum, the second side equipped a tap was connected with a «double» balloon with argon. An ion source of spectrometer was connected with the Schlenk by red PEEK capillary through the septum. The reaction monitoring has been carried out during 50 minutes at 140 °C. The spectra were acquired in positive ion mode and formation of the vinyl-NHC coupling products were observed at after 10 minutes of the start.

Computational details

DFT calculations were carried out in the Gaussian 16 (revision C.01) program80 via the PBE1PBE hybrid functional81. The 6-31 G** (for H, C, N, O, and Br atoms)82 and Def2TZVP (for Pd atom)83 basis sets were employed in the calculations. The empirical Grimme correction (GD3BJ)84 was used to take into account dispersion interactions. The influence of the solvent was taken into account via a polarizable continuum model (PCM)85. N,N-Dimethylformamide was used as the solvent. The geometry was optimized with subsequent calculation of vibrational frequencies and thermodynamic parameters for all the structures. All transition state structures had one negative vibrational frequency corresponding to the considered reaction path. The remaining structures had no imaginary vibrational frequencies and represented a minimum on the potential energy surface.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.