Biomedical knowledge graph embeddings for personalized medicine: Predicting disease-gene associations

Personalized medicine is a concept that has been subject of increasing interest in medical research and practice in the last few years. However, significant challenges stand in the way of practical implementations, namely in regard to extracting clinically valuable insights from the vast amount of biomedical knowledge generated in the last few years. Here, we describe an approach that uses Knowledge Graph Embedding (KGE) methods on a biomedical Knowledge Graph (KG) as a path to reasoning over the wealth of information stored in publicly accessible databases. We built a Knowledge Graph using data from DisGeNET and GO, containing relationships between genes, diseases and other biological entities. The KG contains 93,657 nodes of 5 types and 1,705,585 relationships of 59 types. We applied KGE methods to this KG, obtaining an excellent performance in predicting gene-disease associations (MR 0.13, MRR 0.96, HITS@1 0.93, HITS@3 0.99, and HITS@10 0.99). The optimal hyperparameter set was used to predict all possible novel gene-disease associations. An in-depth analysis of novel gene-disease predictions for disease terms related to Autism Spectrum Disorder (ASD) shows that this approach produces predictions consistent with known candidate genes and biological pathways and yields relevant insights into the biology of this paradigmatic complex disorder.

It is widely known that a person's genetic background has an important contribution to several diseases and the advances in DNA sequencing technologies are contributing to understand how genetic variability determines the occurrence of diseases, promoting more accurate diagnosis and improving the development of personalized medicine directions .
While advances in DNA sequencing methods are finally converging to a point where the integration of genomics into clinical practice is becoming a reality, the interpretation of sequencing results involves the identification of a relatively small number of disease-associated variants among the large number of common variants carried by an individual. This is often hampered by the lack of knowledge about the relationships between variants, genes and diseases, which precludes the identification of disease causing mutations, leading to low diagnostic yields.
The genetic architecture of complex diseases involves a large number of genes and it is hypothesised that this effect is present even in seemingly simple diseases (Boyle et al., 2017), and that the same genes can play a central role in several diseases, sometimes apparently unrelated. The former implies that mutations in several different genes can all contribute to a disease, while the later mean that the same mutation in the same gene can lead to different diseases (Autism Spectrum Disorders Working Group of The Psychiatric Genomics Consortium, 2017).
Personalized medicine is an approach in which patients are stratified based on their clinical profile. Patient stratification can take into account disease subtype, prognosis or treatment response, using diagnostic tests to support medical decisions, including molecular and behavioural biomarkers (Fröhlich et al., 2018). Recently several molecular level approaches are being developed to understand the genetic contribution to diseases. Complex diseases are a sum of genetic and environmental factors. A great proportion of the diseases follow this pattern, as congenital or adult-onset diseases, and several developmental disorders. Some examples of complex disorders include Autism Spectrum Disorder (ASD), Alzheimer, multiple sclerosis, autoimmune diseases, and others (Hunter, 2005).
Several approaches are being applied to define sets of candidate genes, which can be associated to human diseases. These range from manual curation efforts, including crowdsourcing approaches, such as the efforts of the communities contributing to PanelApp (Martin et al., 2019), Clinvar (Landrum et al., 2018) or Clingen (Rehm et al., 2015), or more traditional curation approaches such as those from OMIM (Amberger et al., 2015(Amberger et al., , 2019, to hybrid or fully automated approaches often using data mining and machine learning to derive insights from large amounts of structured or unstructured data (e.g., Alshahrani & Hoehndorf, 2018;Himmelstein et al., 2017;Hu et al., 2021;Liang et al., 2019;Luo, Li, et al., 2019;Luo, Xiao, et al., 2019;Nunes et al., 2021;Smaili et al., 2019;Wang et al., 2019;Yu et al., 2021). The latter rely more on data obtained with text-mining methods, while the former can include a multitude of approaches using data from one or more of several publicly available biomedical, clinical or biological databases, often containing data obtained by text-mining the scientific literature. Here, we concern ourselves with approaches dealing with graph or network data, more specifically heterogeneous multi-graphs.
Increasing amounts of biological and biomedical knowledge are produced everyday. Despite all the efforts to collect and organize this information, several challenges remain in integrating all the information scattered throughout different databases and obtaining meaningful insights from this wealth of data. In this work, we explore the use of Knowledge Graph Embedding (KGE) methods (Wang et al., 2017) as a tool to model the relationships between biological entities such as genes and diseases, and gain valuable insights into their associations that can be of use in the area of personalized medicine.
For this purpose, we built a large-scale Knowledge Graph (KG) combining data from publicly accessible curated biological and biomedical databases and applied Knowledge Graph Embedding (KGE) methods as a means to extract novel information from this KG. KGE methods have seen increased use in several areas. Some of the reasons for the success of these method lie in their broad applicability, scaling capabilities and good performance (Wang et al., 2017). In the past few years, KGs and KGE methods have seen broad application in various tasks in the biological and biomedical domains, such as drug repurposing, prediction of gene-disease associations and identification of drug side-effects (Himmelstein et al., 2017;Himmelstein & Baranzini, 2015;Liang et al., 2019;Mohamed et al., 2021;Nicholson & Greene, 2020;Nunes et al., 2021).
constitutes an entity (with a given entity type) and each edge (e E) a relationship. Entities and relationships in a KG are organized in sets of triples (h, r, t), where h is the head entity, r is the relationship and t is the tail entity, h and t are vertices in the graph, while r is an edge connecting h and t. Each triplet in the KG represents a fact, where the head entity (or subject) is related to the tail entity (or object) through the relationship.
Knowledge Graph Embedding (KGE) methods learn a representation of entities in ℝ d , termed an embedding, such that the representation in the embedding space reflects their relationships with other entities in the KG. This is done by optimizing a score function: f(h, r, t). Several methods have been proposed for this task, with different score functions, such as ComplEx (Trouillon et al., 2016), DistMult (Yang et al., 2015) and TransE (Bordes et al., 2013). The resulting embedding vectors can be used for downstream supervised or unsupervised machine learning tasks.
To show the feasibility of our proposed approach in clinical settings, we applied it to a highly complex and heterogeneous disorder: Autism Spectrum Disorder (ASD). ASD is a neurodevelopmental disorder characterized by two main characteristics: communication deficits and repetitive behaviours (Diagnostic and Statistical Manual of Mental Disorders: Dsm-5, 2013). ASD segregates in families, has a strong genetic component and is clinically heterogeneous, often co-occurring with other conditions (Lord et al., 2020). Early and efficient intervention for children with ASD is fundamental, but pharmacological therapies can only be used to treat some of the associated symptoms or comorbidities, and do not target core symptoms. ASD can vary highly in the clinical presentation and in the associated symptoms. The underlying genetic causes of ASD are unclear, except when co-occurring with genetic syndromes. Given the diversity of biological mechanisms that can be affected, the development of therapeutic approaches is challenging. In recent years, several groups, including ourselves, have developed, integrative approaches based on machine learning methods to obtain insights into the genetic and phenotypic complexity of ASD beyond what can be obtained with conventional analysis methods (Asif et al., 2018Duda et al., 2018;Krishnan et al., 2016;Martiniano et al., 2020).
Here, we report an application of KGE methods to a custom-built biological KG, relating entities such as genes, biological processes and diseases, and showcase its application in the area of personalized medicine, namely for the prediction of gene-disease associations. As a use case of the applications of the gene-disease associations prediction algorithm developed, we identify and validate a set novel genes associated to ASD.
This paper is structured as follows, first we introduce the general area and the specific challenges we address, we then describe the methodology, including all data sources and software tools used. Afterwards we present our results and discuss them, focusing on the validation of genes and biological pathways predicted as candidates for implication in ASD. We conclude with an overview of the study, a discussion of the potential implications of our results and point out some future directions of this line of work.
Data is annotated with controlled vocabularies and community-driven ontologies.

| Knowledge Graph
Using data obtained from the data sources described above, we built an integrated biomedical Knowledge Graph (KG). This KG is composed of a series of biological entities and their relationships. First, we obtained the full GO OBO file. For each relationship extracted from the GO, we kept the original semantics as much as possible. The annotation qualifier was used to build the relationship types, using the annotations files for human gene products, both for proteins and for RNA. Gene-disease associations from DisGeNET v7 were then merged. All gene names were converted to Ensembl Gene IDs. Conversion of gene symbols in DisGeNET and GO to Ensembl symbols was done with Ensembl Biomart, using the pybiomart python client. 1 The KG contains five unique entity types: genes, diseases, molecular functions, cellular components and biological processes. Entities are represented by their codes in the various databases. Genes are represented by their Ensembl Gene IDs, diseases, phenotypes and disease groups are represented by Concept Unique Identifiers (CUI) from the Unified Medical System (UMLS), as obtained from Dis-GeNET. All GO terms for biological processes, molecular functions and cellular components, represented by their respective GO IDs.

| Knowledge Graph embeddings
We applied Knowledge Graph embedding methods to produce vector representations (embeddings) of the entities in the KG. In this study, we tested three KG embedding algorithms, ComplEx (Trouillon et al., 2016), DistMult (Yang et al., 2015) and TransE (Bordes et al., 2013), as implemented in the DGL-KE package (Zheng et al., 2020). Training is performed through negative sampling by corrupting triples (h, r, t) to create triples of the form (h 0 , r, t) or (h, r, t 0 ), where h 0 and t 0 are randomly sampled from the sets of h and t. We apply filtered sampling, whereby generated negative triples that are present in the KG are discarded from the set of negatives used in the training process. Table 1 contains a summary of all KGE methods used and their respective scoring functions.

| Training
We performed a 60/20/20 split of all the gene-disease associations in the KG into training, test, and validation sets, stratified to ensure that all genes and diseases are present in all sets in roughly equal amounts. For training and testing of the embedding step the set of go-go and go-gene triples was added to the training set only. The testing and validation sets consist solely of gene-disease associations. As the main objective is to predict gene-disease associations, this ensures that the method is explicitly trained to reproduce these as well as possible. Hyperparameter tuning was done using the training and test sets. For a more efficient exploration of the possible hyperparameter space we used the Optuna optimization framework (Akiba et al., 2019). Optuna is a hyperparameter optimization software package that implements several search strategies to achieve optimal coverage of high-dimensional hyperparameter spaces. The maximum number of evaluations was set to 30 and the default settings for

| Evaluation
To evaluate the performance of the KGE step we used standard ranking metrics, as calculated by the DGL-KE package: Mean Rank (MR), Mean Reciprocal Rank (MRR), HITS@1, HITS@3, and HITS@10 (the mean fraction of true results in the top 1, 3, and 10, respectively). These are defined as: T A B L E 1 Knowledge graph embedding methods used in this study and their respective scoring functions Method Scoring function where, Q is the number of elements in the ranked list and 1 ranki ≤ k is 1 if rank i < k, otherwise is 0. MRR was used as the optimization target for Optuna. Performance evaluation was done with a negative sample size of 16 and a batch size of 2048.
To avoid test set contamination, evaluation was performed with the training and validation set, withholding the test set used for hyperparameter tuning.

| Prediction of disease-gene associations
The prediction of novel disease-gene associations can be framed as a link prediction problem on the KG. In link prediction, the aim is to learn a scoring function f, characteristic of the method being employed (see Table 1), The scoring function assigns scores = f(h, r, t) to each input triple h, r, t ð Þ G, where h,t V are the head and tail entities and r E is the relationship. In this particular case, the head entities are genes, the tail entities are diseases and the relationship is the association of gene to diseases. This produces a ranking of genes for each disease, where the genes are ranked from higher to lower association to a given disease. Prediction of gene-disease associations was performed using the full KG with the best hyperparameters identified with the optimization procedure described above.

| Analysis of ASD-associated genes
From the set of predicted gene-disease associations produced as described above, we selected those involving autism-related disease terms.
Genes in the first decile of novel associations were merged to create a list of ASD candidate genes. We used this gene list to produce a network of protein-protein interactions (PPI) with edge weights, using STRING (Franceschini et al., 2013(Franceschini et al., , 2016Snel et al., 2000;Szklarczyk et al., 2015Szklarczyk et al., , 2017Szklarczyk et al., , 2019Szklarczyk et al., , 2021 and applied the Leiden community detection algorithm (Traag et al., 2019) to the PPI to identify network functional modules (biological communities), as implemented in the CDlib python package 2 (Rossetti et al., 2019). The Leiden community detection algorithm is based on modularity optimization and is able to detect partitions in the whole dataset and identify the hierarchical community structure. Using this method with the default parameters, we decomposed the network into sub-units or communities. The identification of functional protein communities in the network may uncover a priori unknown functional biological modules.
Finally, to assess the enrichment in biological pathways of each community, we used Reactome pathways (Griss et al., 2020;Jassal et al., 2020;Wu & Haw, 2017). Reactome is a manually curated, peer-reviewed pathway database, widely used for clinical research purposes. ionet (Himmelstein et al., 2017;Himmelstein & Baranzini, 2015), contain 73% of the genes present in our KG. This maximizes the predictive capabilities of our approach, and we expect to expand the KG in the future by increasing it is reach to a larger number of genetic features.

| Knowledge Graph embedding
A comparison of the performance of the KGE methods tested is displayed in Table 2. The performance metrics reported were calculated on the validation set using the optimum hyperparameter set for each algorithm, identified as described in the methods section. All methods exhibit good performance, with the TransE algorithm with l2-regularization exhibiting the best results. All subsequent analysis steps use embeddings trained using TransE with l2-regularization.

| Prediction of gene-disease associations
Prediction of new genes associated to diseases is an important task in the context of personalized medicine approaches. New case-control studies can be designed taking these new associations into account, and analysis of genetic mutations in candidate genes resulting from these associations can lead to improved diagnosis and therapeutic interventions.  Using the TransE (l2) method with the optimum set of hyperparameters, we produced genome-wide predictions of gene-disease associations, that is, we predicted the scores of all possible gene disease-associations for all 28,243 genes and 21,623 diseases, producing a total of 610,698,389 predictions.
Other approaches for the prediction of disease-gene associations have explored the use of the GO as underlying source of data, either from gene semantic similarity or from embedding of GO and other ontologies (Alshahrani & Hoehndorf, 2018;Liang et al., 2019;Nunes et al., 2021;Smaili et al., 2019). Our approach offers an excellent performance and is easy to apply and to extend. To validate our approach from a biological and biomedical point of view, we apply it to the identification of novel candidate genes for ASD. The next section describes and discusses our results.

| Use case: Prediction of genes associated to Autism Spectrum Disorder
Autism Spectrum Disorder results of a combination of environmental and genetic factors, has a strong genetic component, segregates in families and there is an estimate of up to 1000 genes potentially implicated in the disease (Ramaswami & Geschwind, 2018). While several ASDassociated genes are present in the KG, this list is non-exhaustive, as the genetic diagnosis yields for ASD are usually low (Kreiman & Boles, 2020; Savatt & Myers, 2021), indicating that a larger number of genes is probably implicated. Here, we aimed to expand the list of ASD candidate genes by producing novel gene-disease association predictions for this disorder and use the produced genome-wide ranking to identify major biological communities.

| Prediction of ASD-associated genes
For the prediction of genes associated to ASD we selected two disease terms in the KG which correspond to general forms of ASD, 'Autism Spectrum Disorders' (C1510586) and 'Autistic Disorder' (C0004352). The scores of the associations of all genes in the KG for these two disease terms were extracted from the final prediction set, produced as described previously. Rankings for the association of all genes to these two terms were derived from the scores of the corresponding association, retaining only novel associations (i.e., those not present in the KG). For both ranked lists, we selected all genes in the first decile of the ranking (see supplementary file 1). The two gene sets were merged, resulting in a list composed of 3389 genes. This list was used for subsequent analyses.

| Identification of biological communities
To identify biological pathways that can be shared by people with mutations in ASD-associated genes, we created a network consisting of the genes from the ASD candidate gene list and gene-gene interactions obtained from STRING. For further analysis, we retained the largest connected component of this network, containing 3221 genes.
The interaction network containing these 3221 genes was used to perform network community detection using the Leiden algorithm. Six communities were identified (see supplementary file 2). Enrichment analysis of each community indicates that the PPI network is enriched in several biological pathways (Figure 2). From the results of enrichment analysis (supplementary file 3) we identified the communities as corresponding to six main pathways: Metabolism; Chemical synapse transmission mediated by G Protein Coupled Receptors (GPCRs); Cytokine signalling; Gene expression; Nervous system development and Signalling. All these biological pathways are likely to be affected in ASD in different patients as they are important to the nervous system and neuronal development at some stage of brain development. There is growing evidence linking these pathways to ASD and we discuss these connections and characterize the biological communities found in the next subsection.

Chemical synapse transmission/GPCRs
There is strong genomic and functional evidence indicating that synaptic biological processes are altered in ASD (Abrahams & Geschwind, 2008;Lai et al., 2021;Leblond et al., 2014;Lionel et al., 2013;Tromp et al., 2021). Several studies suggest that mutations in genes that encode proteins that establish the connection between two neurons and the formation of a synapse such as the ones that encode neurexins, neuroligins or Shank proteins (genes that are also present in our gene-disease associations) share biological pathways including the synaptic pathways (Gong & Wang, 2015;Guang et al., 2018;Lai et al., 2021;Tromp et al., 2021). Synaptic transmission occurs between a presynaptic neuron and a postsynaptic cell. Neurotransmitters establish the communication between neurons and bind to ion channels on postsynaptic neurons to modulate voltage changes. Important modulators of neurotransmission are G protein coupled receptors (GPCRs), a superfamily of key proteins responsible for the signal transduction across cell membranes and that mediate diverse cellular responses. GPCRs mediate the regulation of synaptic transmission, F I G U R E 2 Reactome pathway enrichment of the PPI network in biological communities. The PPI network is enriched in several biological pathways with most represented being involved in six main pathways. The chart displays (X axis) the number of times each community is enriched in a term more than expected by chance, by (Y axis) the probability of obtaining the same result by chance; circle size represents the number of genes enriched in the term and circle colours the magnitude of the enrichment p-value binding specifically to some neurotransmitters, modifying the structure of the receptor and regulating the mechanism of neurotransmission (Betke et al., 2012;Lutzu & Castillo, 2021). One of the difficulties in the pharmacotherapeutic research in ASD is the identification of effective pathophysiological targets. Most of the approaches developed target brain excitatory/inhibitory imbalance caused by alterations in gammaaminobutyric acid (GABA) and glutamate receptors (DelaCuesta- Barrutia et al., 2020). However, there are other important neurotransmitter systems that are key for the proper establishment of brain excitatory/inhibitory balance as the ones regulating important neurotransmitters as oxytocin, serotonin or dopamine. These systems are primarily mediated through specific GPCRs (DelaCuesta- Barrutia et al., 2020;Gurevich, Gainetdinov, and Gurevich et al., 2016;McCorvy & Roth, 2015;Willets et al., 2009) and are also important for the brain excitatory/inhibitory balance, representing possible therapeutic targets (Marotta et al., 2020). The dysfunction of GPCRs potentially implicated in ASD, including the glutamatergic, dopaminergic, oxytocinergic or serotonergic systems, can contribute to the disorder, and new clinical directions taking these pathways into account can result in the discovery of noval treatments, as has been suggested for other brain disorders such as schizophrenia (DelaCuesta- Barrutia et al., 2020).

Gene expression
Autism research has long focused on genes involved in neuronal development and synaptic processes. Mutations in genes participating in these processes were the first to be linked to ASD and its symptomatology. However, in recent years, several studies have implicated other classes of genes and, often, the ones related with gene expression, chromatin organization and remodelling are mentioned. Genes involved in chromatin regulation determine whether other genes are turned off or not according to the need of being expressed or not. For a gene to be expressed at the right time, DNA needs to go through conformational changes from tightly to loosely packed coils. This process is controlled by chromatin remodelling complexes, and genes involved in such mechanisms are sometimes mutated in ASD and other neurodevelopmental disorders. Mutations in these complexes have been linked to ASD, Schizophrenia or Intellectual disability and other conditions (Gabriele et al., 2018).

Cytokine signalling
Although ASD pathophysiology is unclear, growing evidence also supports an important role of neuroinflammatory processes. The participation of astrocytes and microglia in ASD has been subject of study due to their roles in the regulation of immune and synaptic pathways. Elevated levels of reactive microglia and astrocytes in postmortem tissue in ASD has been reported (Matta et al., 2019). The immune system is interconnected to the nervous system and its dysfunction impacts several biological processes, including brain function and development, and behaviour (Filiano et al., 2015). Fever occurs as a body response to fight infection and is initiated by cytokines (Dantzer et al., 2008). The brain recognizes cytokines as signals of sickness (Dantzer, 2009). Cytokines are signalling molecules that mediate the communication among cells in the immune system, and are primary regulators of inflammation. Studies involving immune system alterations and ASD, including on the characterization of cytokine profiles, have been increasing in the last years (Masi et al., 2017).

Metabolism
The contribution of metabolic alterations to developmental disorders has been the subject of several studies. Metabolic alterations at different levels have also been reported in ASD, such as the ones involving biological oxidations (Bjørklund et al., 2020;Frye et al., 2013), alterations in the lipid metabolism (Luo et al., 2020;Tamiji & Crawford, 2010) and in Cytochrome P450 pathways. Oxidative stress is thought to be implicated in ASD, as shown by reports of increased levels of Reactive Oxygen Species (ROS) and increased lipid peroxidation (Bjørklund et al., 2020). Oxidative stress is an important cause of neuroinflammation and can contribute to ASD (Bjørklund et al., 2020). A significant portion of individuals diagnosed with ASD have elevated peripheral cytokines and chemokines and associated neuroinflammation (Bjorklund et al., 2016). People with ASD are considered more sensitive to oxidative stress due to glutathione imbalance (James et al., 2006), and the contribution of environmental exposure to heavy metals has also been discussed (Macedoni-Lukšič et al., 2015;Mostafa et al., 2016). Several studies have suggested that the oxidation-reduction imbalance and oxidative stress are important components of ASD pathophysiology (Yui et al., 2016). Regarding the relationship between lipid metabolism and ASD, it is known that the nervous system is enriched with important classes of lipids, thus the dysfunction of lipid metabolic pathways can play a role in the development of this disorder. Cholesterol and sphingolipids are signalling molecules with key roles in neuronal differentiation and in synaptogenesis. Cholesterol availability is essential to synapse development (Hussain et al., 2019). Several studies report abnormal levels of lipids in ASD, and some of these studies reported alterations in cholesterol and triglyceride levels in a subgroup of patients with ASD (Luo et al., 2020;Sikora et al., 2006). There is increasing evidence that alterations in fatty acid pathways may affect the nervous system leading to ASD. In line with these reports, there is evidence supporting the hypothesis that people with ASD have higher rates of lipid metabolism than controls, and that the dysregulation along the lipid metabolic pathway may contribute to ASD onset (Tamiji & Crawford, 2010).

Nervous system development
The 'Nervous system development' biological community is enriched in genes participating in mechanisms that are important for neuronal development and axon guidance, such as the Rho family of GTPases, which are proteins that act as molecular switches that regulate important cellular processes, such as growth, migration, differentiation or adhesion. These molecules are particularly important to the nervous system, as they regulate neuronal function and morphology. Recent studies suggests that Rho GTPase dysfunction has a role in ASD, as several genes encoding Rho GTPases are candidate risk genes for ASD and are incorporated in the ASD candidate gene list of the Simons Foundation Autism Research Initiative (SFARI) (see Guo et al., 2020 for a review of Rho family of GTPases involved in ASD). The SFARI database (Abrahams et al., 2013;Banerjee-Basu & Packer, 2010;Wang et al., 2012;Yao et al., 2015) is an ASD dedicated database that integrates a gene scoring module which establishes a gene rank according to the strength of the evidence that associates a given gene to the disease, based on the analyses of several studies with ASD patients. Genes like MYO9B, OPHN1, SRGAP3, OCRL or ITPR1 are genes from the Rho family of GTPases present in the SFARI gene list that are also associated to ASD in the gene-disease associations predicted with the methodology developed in this study.

Signalling
Signalling pathways are important to ASD at diverse levels, as a complex brain and neurodevelopmental disorder, and our algorithm also identifies genes associated to ASD terms as being enriched in several signalling pathways (Signalling biological community; Figure 2) such as the Wnt signalling pathway. Interestingly, the Wnt signalling pathway is evolutionarily conserved and regulates fundamental early developmental processes as cell determination and migration, cell polarity, neural patterning and organogenesis, during the stages of embryonic development (Komiya & Habas, 2008). ASD is an early-onset disorder mainly impacted by the embryonic development. The canonical Wnt pathway is thus fundamental for brain development and, consequently, for a proper synaptic function (Mulligan & Cheyette, 2016). Mutations in genes participating in the Wnt pathway have been suggested to contribute to ASD and to other psychiatric disorders (Kalkman, 2012;Mulligan & Cheyette, 2016).

| Relevance for personalized medicine
With the present work, we show that our approach, despite being of general application, identifies plausible gene-disease associations in ASD, from which useful biological insights can be derived. The top ranking genes associated to ASD in the KG identified in this study are involved in six main relevant biological communities for the nervous system and neuronal development, often referred in the scientific literature as candidate pathways for the aetiology of the disease. The methodology developed in this work can be useful for patient stratification into subtypes according to the biological pathways enriched in the biological communities implicated in the gene-disease associations identified, and can provide insights for the development of guidelines for personalized medicine approaches applied to ASD.

| CONCLUSIONS
We describe an approach to integrate biological information from several data sources and predict gene-disease associations. This is done through the construction of a KG containing biological and biomedical entities and the application of KGE techniques for link prediction of the relationships of interest in the KG.
To showcase a biological application, this methodology was applied and tested on a paradigmatic complex disorder: ASD. We showed that our approach allows for data-driven detection of sub-communities, which can be useful for patient stratification. Stratification of patients is a daunting task for complex diseases such as ASD. The identification of genes and biological communities involved in ASD could provide a possible way for effective patient stratification strategies.
The top decile of novel ASD-associated genes is enriched in six main relevant biological pathways (Metabolism; Chemical synapse transmission mediated by G Protein Coupled Receptors (GPCRs); Cytokine signalling; Gene expression; Nervous system development and Signalling), which are here reinforced as candidate pathways for ASD aetiology that can be important for the development of guidelines for personalized medicine approaches applied to ASD.
The major contributions of this work are, from a technical viewpoint, the use of a readily extensible and adaptable large-scale KG, with a considerable proportion of RNA gene products and, from an application viewpoint, a data-driven approach for the identification of genes and pathways relevant to human diseases, which we have shown to be reliable in the case of ASD. Most related approaches are disease-specific or deal with smaller gene sets or are aimed exclusively at protein-coding genes. In this study, we aimed to maximize the number of genes and we explicitly included RNA genes and the respective GO annotations. The later are much more numerous than protein-coding genes and, despite a growing body of evidence linking non-coding RNAs to human diseases, under-explored when compared to their protein-coding counterparts.
This approach has the potential for impact in several areas related with personalized medicine, namely in the analysis of genetic sequencing data, in patient stratification or in the development of novel therapeutic approaches or the identification on novel therapeutic targets.
Regarding the analysis of genetic sequencing data, one major hurdle in current practice is the establishment of reliable variant prioritization methods that can identify disease-causing genetic variants in the large amount of data generated by sequencing. Methods that associate the affected genes to a disease or phenotype have been used to address this issue and the gene rankings produced with our approach can be easily be used for this purpose. Patient stratification is one major goal of precision medicine and the characterization of subgroups of patients according to their shared clinical profiles is of major importance. The method developed in this study has direct applicability to patient stratification through the identification of shared pathogenic burden in biological pathways or gene communities. The identification of novel therapeutic targets or therapeutic approaches is another area where we expect our approach to have an impact. The identification and ranking of disease-associated genes and pathways can be particularly helpful in prioritizing or expanding the range of targets for functional studies or for the development of gene therapy approaches.
We conclude by noting that, although we focus on ASD, this approach is applicable to all diseases in the KG, especially the ones with a strong genetic contribution and with complex genetic architectures. In future studies, we plan to expand the size and the scope of the KG by adding information from other biological and biomedical databases and explore the use of other embedding methods. Work is under way to apply this approach to develop tools for the identification of disease-associated genetic variants in sequencing datasets and to develop methods of patient stratification in cohorts of subjects diagnosed with ASD.