Multimorbidity patterns with K-means nonhierarchical cluster analysis

Background The purpose of this study was to ascertain multimorbidity patterns using a non-hierarchical cluster analysis in adult primary patients with multimorbidity attended in primary care centers in Catalonia. Methods Cross-sectional study using electronic health records from 523,656 patients, aged 45–64 years in 274 primary health care teams in 2010 in Catalonia, Spain. Data were provided by the Information System for the Development of Research in Primary Care (SIDIAP), a population database. Diagnoses were extracted using 241 blocks of diseases (International Classification of Diseases, version 10). Multimorbidity patterns were identified using two steps: 1) multiple correspondence analysis and 2) k-means clustering. Analysis was stratified by sex. Results The 408,994 patients who met multimorbidity criteria were included in the analysis (mean age, 54.2 years [Standard deviation, SD: 5.8], 53.3% women). Six multimorbidity patterns were obtained for each sex; the three most prevalent included 68% of the women and 66% of the men, respectively. The top cluster included coincident diseases in both men and women: Metabolic disorders, Hypertensive diseases, Mental and behavioural disorders due to psychoactive substance use, Other dorsopathies, and Other soft tissue disorders. Conclusion Non-hierarchical cluster analysis identified multimorbidity patterns consistent with clinical practice, identifying phenotypic subgroups of patients. Electronic supplementary material The online version of this article (10.1186/s12875-018-0790-x) contains supplementary material, which is available to authorized users.


Background
In the first decade of the twenty-first century, tremendous effort was concentrated on surfacing data about multimorbidity patterns in order to increase the knowledge of how the diseases were clustered [1][2][3]. In everyday primary care settings, multimorbidity is more the norm than an exception, with a prevalence ranging from 13 to 95% in the global population, depending on the age group included and methodology used [2]. Therefore, establishing these clustered associations could inform Clinical Practice Guidelines (CPG) and guide decision-making in the clinical practice [4].
No consensus has been established about a standard model to determine multimorbidity patterns. Differences between studies have been observed, such as the unit of analysis selected (patients versus diseases), the statistical method for grouping diseases (factor analysis vs. cluster analysis), diseases included (chronic or all), and number of diseases included in the models [1,5].
To identify the multimorbidity patterns, methods that identify and separate certain population groups from others and study non-random associations between diseases in those sub-groups are needed [3,6]. There are basically two statistical methods for grouping diseases: factor analysis and cluster analysis. Exploratory factor analysis is based on correlations between diagnoses to identify the patterns; it is used to test hypothesised relationships between observed measures and latent constructs and allows the inclusion of a diagnosis in multiple factors. In contrast, cluster analysis obtains the patterns of multimorbidity based on dissimilarities between diseases; clusters tend to contain diagnoses that are similar to each other (in terms of Euclidean distances) and a diagnosis cannot be included in more than one cluster. Usually, factor analysis is used to study diseases and cluster analysis to study patients [7]. A recent comparison of the two methods concluded that cluster analysis is more useful than factor analysis for in-depth study of multimorbidity patterns [8].
Among cluster analysis methods, there are two main types of techniques: hierarchical (HCA) and non-hierarchical cluster analysis (NHCA) [9]. The first, often considered when choosing a clustering technique in biomedicine, attempts to identify relatively homogeneous groups of cases based on selected characteristics, using an algorithm that either agglomerates or divides entities to form clusters. HCA is organized so that one cluster can be entirely contained within another cluster, but no other kind of overlap between clusters is allowed. However, the technique is not particularly good when it comes to robust identification of patterns in data. The main limitations are that the hierarchical clusters are susceptible to outliers in the data, the final solution depends on the chosen distance measure, and the algorithms are not efficient to analyse large data sets, as they require a large distance matrix. Nevertheless, almost all studies to date have used HCA to analyse multimorbidity patterns [2,3].
Among the NHCA methods, K-means is the most frequently used. In contrast to HCA, this approach does not involve the construction of groups via iterative division or clustering; instead, patients are assigned to clusters once the number of clusters is specified. The results are less susceptible to outliers in the data, to the influence of choosing a distance measure, or to the inclusion of inappropriate or irrelevant variables. Algorithms that do not require a distance matrix, such as k-means, can analyse extremely large data sets [9][10][11].
The study of biological heterogeneity requires the identification of subgroups of populations with specific combinations of coexisting diseases. This "multimorbidity patient" approach identifies phenotypes of the subgroups, describes the patterns of diseases within each one, and facilitates the development of more targeted patient management [12].
The purpose of this study was to obtain the multimorbidity patterns in adult patients with multimorbidity attended in primary care in Catalonia (Spain), stratified by sex, using a k-means cluster analysis.

Design, setting and study population
A cross-sectional study was conducted in Catalonia (Spain), a Mediterranean region with 7,434,632 inhabitants, 81% of which live in urban municipalities (2010 census). The Spanish National Health Service (NHS) provides universal coverage, financed mainly by tax revenue. The Catalan Health Institute (CHI) manages primary health care teams (PHCTs) that serve 5,501,784 patients (274 PHCT), or 74% of the population; the remaining PHCTs are managed by other providers.
The CHI's Information System for the Development of Primary Care Research (SIDIAP) contains the coded clinical information recorded in electronic health records (EHR) by its 274 PHCTs since 2006. A subset of SIDIAP records meeting the highest quality criteria for clinical data, the SIDIAP-Q, includes 1,833,125 patients attended by the 1365 general practitioners (GPs). SIDIAP Q represents 40% of the SIDIAP population whose data recording scores contain information on the majority of the population of Catalonia, and is highly representative of the whole region in terms of geography, age, sex, and diseases. This study was limited to SIDIAP-Q, as the sample was representative of the population [13].
Prevalence of individual conditions, multimorbidity, and disease patterns varies by age. To obtain a more homogenous sample of multimorbidity, we identified 408,944 patients with multimoribidity aged 45 to 64 years [14] on 31 December 2010 (Additional file 1).

Coding and selection of diseases
Diseases are coded in SIDIAP using International Classification of Diseases version 10 (ICD-10) [15]. For this study, we selected all active diagnoses recorded in EHR as of December 31, 2010, except for R codes (symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified) and Z codes (factors influencing health status and contact with health services). Of the 263 blocks of diagnosis in the ICD-10, excluding the R codes and Z codes yielded 241 blocks. Non-active diagnoses, based on the presence of an end date in the EHR, were excluded. These diagnoses covered a broad list of acute diseases for which the system automatically assigns an end date (e.g., 60 days after the initial diagnosis).
To facilitate information management, the diagnoses were extracted using the 263 blocks (disease categories) in the ICD-10 structure. These are homogeneous categories of very closely related specific diagnoses. For example, Hypertensive diseases include Essential (primary) hypertension, Hypertensive heart disease, Hypertensive renal disease, Hypertensive heart and renal disease, and Secondary hypertension. To obtain consistent and clinically interpretable patterns of association, and to avoid spurious relationships that could bias the results, we considered only diagnoses with greater than 1% prevalence in each sex. All patients with multimorbidity were included.

Multimorbidity definition
Multimorbidity was defined by the presence of two or more ICD-10 diagnoses in the EHR from the 241 blocks selected.

Variables
The unit of measurement was the diagnoses included in the 241 blocks (disease categories) of the ICD-10 structure (values: 1 if present, 0 if absent). Other variables recorded were number of diseases, age (in years), and sex (women, men).
No missing values were handled, as sex and age were recorded for all patients. Wrong sex-specific diagnosis codes and diagnoses with inconsistent dates were excluded during data cleaning. Any record with no disease diagnoses was considered as a disease-free individual.

Statistical analysis
Analyses were stratified by sex. Descriptive statistics were used to summarize overall information. Categorical variables were expressed as frequencies (percentage) and continuous variables as mean (Standard deviation, SD) or median (interquartile range, IQR). Two sample tests of proportions were used to assess sex-based differences between groups Mann Whitney was used to test the non-normally distributed variable of number of blocks of diagnoses by sex.
We identified disease patterns using two steps: 1) Multiple Correspondence Analysis (MCA): A data analysis technique for nominal categorical data, was used to detect and represent underlying structures in the data set. The method allows representation in a multidimensional space of relationships between a set of dichotomous or categorical variables (in our case, diagnoses) that would otherwise be difficult to observe in contingency tables and show groups of patients with the same characteristics [16]. MCA also allows direct representation of patients as points (coordinates) in geometric space, transforming the original binary data to continuous data (Additional file 2). The MCA analysis was based on the indicator matrix. Optimal number of dimensions extracted and percentages of inertia were determined by the means of scree plot. 2) K-means clustering: From the geometric space created in MCA, patients were classified into clusters according to proximity criteria by means of the k-means algorithm. The algorithm is composed of the following steps: 1) Place K points into the space represented by the patients that are being clustered. These points represent initial group centroids. 2) Assign each patient to the group that has the closest centroid. 3) When all patients have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the patients into homogenous groups while maximizing heterogeneity across groups [9]. The optimal number of clusters is the solution with the highest Calinski-Harabasz index value. To assess internal cluster quality, cluster stability of the optimal solution was computed using Jaccard bootstrap values with 100 runs [17]. Highly stable clusters should yield average Jaccard similarities of 0.85 and above [9].

Statistics of multimorbidity patterns
To describe the multimorbidity patterns in patients, frequencies and percentages of diseases in each cluster were calculated. Observed/expected ratios ("O/E-ratios") were calculated by dividing disease prevalence in the cluster by disease prevalence in the sex group. A disease was considered to be associated with the multimorbidity pattern when O/E-ratio was ≥2 [18]. Exclusivity, defined as the fraction of patients with the disease included in the cluster over the total strata patients with the disease, was also calculated. To describe the relative position of the clusters, centrality defined as the distance of the cluster centroid to the origin was calculated. Descriptive statistics of age and the median number of diagnoses for each cluster were also obtained. Clinical criteria were used to evaluate the consistency and utility of the final cluster solution. To reduce the size of the tables, only groups of diseases with a prevalence higher than 10% in the cluster were shown. The analyses were carried out using SPSS for Windows, version 18 (SPSS Inc., Chicago, IL, USA) and R version 3.3.1 (R Foundation for Statistical Computing, Vienna, Austria).
Data were transformed using MCA (Additional file 2). K-means clustering using Calinski criterion to obtain six clusters was considered the optimal solution for both women and men. Average Jaccard bootstrap values for women and men were 0.98 and 0.90, respectively, showing highly stable solutions. A spatial representation of clusters is shown with a cluster plot for women (Fig. 1a) and men (Fig. 1b).
Six multimorbidity patterns were obtained for each sex. The three most prevalent multimorbidity patterns included 68.4% of women patients (Table 2) and 65.6% of men patients ( Table 3). The number of diseases included in each pattern varied by sex; women had a higher number of diseases than men, although there was a high coincidence (matching) between them in the type of diseases grouped.
The clusters were sorted in descending order by number of individuals included. The first cluster included about 40% of the population (40.7% of women and 38.7% of men) and no O/E ratio higher than 2 was observed in these first clusters. In these first clusters, the highest exclusivity value was 46.1% for Mental and behavioural disorders due to psychoactive substance use (tobacco) in women and 35.3% for Metabolic disorders in men.
The most prevalent cluster included coincident diseases in both men and women: Metabolic disorders, Hypertensive diseases, Mental and behavioural disorders due to psychoactive substance use, Other dorsopathies and Other soft tissue disorders (Tables 2 and 3).
Four other patterns were almost coincident between the sexes: 1) Cluster 4 (women) and cluster 3 (men), composed mostly of diseases of the digestive and musculoskeletal system; 2) Cluster 2 (women) and Cluster 4 (men), connective tissue diseases; 3) Cluster 5 was composed of a cardiometabolic pattern (obesity, hypertension and diabetes) in both groups; and 4) Cluster 6, infectious and injurious diseases (see Tables 2 and 3). O/ E ratios varied for each cluster, peaking at 8.99 for Other viral diseases and 8.24 for Other acute lower respiratory infections in cluster 6 (women) (Tables 2 and 3).
In both sexes, the most prevalent multimorbidity pattern in the oldest patients (Tables 2 and 3) were musculoskeletal system and connective tissue diseases in women (mean age: 57.4) and cardiometabolic pattern (obesity, hypertension, and diabetes) in men (mean age: 57.1).
Multimorbidity patterns considering only blocks of diagnoses with O/E ratio ≥ 2, ordered by exclusivity in women and men, showed that the highest exclusivity in women was observed in Cluster 6: 83.9% of the people who had a diagnosis of Other viral diseases are included in this cluster. They were followed by Cluster 5, which 77.0% of people with Diabetes mellitus belonged to. In men, 83.7% of people with Disorders of choroid and retina belongs to Cluster 5, and 77.6%, which includes Viral hepatitis, in Cluster 2 (Additional file 4).

Discussion
Non-hierarchical cluster analysis yielded an informative categorization of patients, generating reasonable multimorbity patterns from a clinical, practical perspective, and identified phenotypes for sub-groups of patients. Metabolic-circulatory-tobacco use-musculoskeletal pattern is the most common multimorbidity pattern    identified by NHCA in both sexes. This pattern would be classified as nonspecific because it had the lowest centrality value (0.8 for both sexes). It is the most common in the population with multimorbidity aged 45-65 years. This pattern seems to be consistent with other studies which obtained similar associations of diseases with other methods of analysis [2,3].
Other data of interest are the higher exclusivity values obtained in some clusters. For example, 77% of women who suffered diabetes mellitus have other associated diseases, such as forms of heart disease, obesity, and hypertension. These results are similar to the report from Hughes et al. that 71% of people with diabetes had multimorbidity [19]. Other coexisting diseases in the 84% of men who had disorders of choroid and retina (ischemic heart diseases, diseases of arteries, arterioles and capillaries, diabetes, other forms of heart disease, obesity, and hypertension) reflect a broad affectation of the vascular tree. Another remarkable observation in some patterns was the clustering of diseases of the same system or the presence of diseases, reflecting a complication. For example, one multimorbidity pattern consisted of seven diseases, of which five were diseases of the musculoskeletal system and connective tissue (Cluster 2, women). Another well-known example is the complications of diabetes mellitus such as disorders of choroid and retina (diabetic retinopathy) and renal failure (Cluster 5, men).
These results can be translated into clinical practice. When a disease is first diagnosed, we can suspect other associated diseases. Clinical practice guidelines could orient their recommendations toward these sub-groups (for example: arthritis, anxiety and depression). On the other hand, some results could be difficult to interpret in the context of current knowledge. Some patterns obtained included many diseases with no apparent connection between them.
In general, it is difficult to compare our results with the findings of other studies because of variations in methods, data sources and structures, populations, and diseases studied. However, there are some similarities between the current study and others. The first pattern is similar to the cardio-metabolic pattern reported by Prados et al. in adults aged 45 to 64 years (hypertension, diabetes, obesity, and lipid metabolism disorders) with an exploratory factor analysis [6]. In participants older than 50 years, another study found a cardiorespiratory factor (angina, asthma, and chronic lung disease) quite similar to our Cluster 5 in men and a mental-arthritis factor (arthritis, anxiety and depression) similar to our Cluster 2 in women [20].
The major strength of this study is the large, high-quality population database of primary care records that have been shown to be representative of a much larger population [13]. The analysis was stratified by sex and a patient-level perspective was used with NHCA. Admittedly, this analysis of almost all potential diagnoses may have added a complexity that will hinder interpretation of findings and comparison with other studies. Another major strength of this study was the operational definition of multimorbidity as the co-occurrence of multiple chronic or acute diseases [21] which allows the inclusion of the full range of diseases observed in any one patient. This is especially relevant because the boundaries between chronic and acute disease are not always clear [22,23]. The strengths of using K-means cluster analysis is that the results are less susceptible to outliers in the data, the influence of chosen distance measure, or the inclusion of inappropriate or irrelevant variables [10]. The method can also analyse extremely large data sets as in our study, as no distance matrix is required. Some disadvantages of the method are that different solutions for each set of seed points can occur and there is no guarantee of optimal clustering [12]. To minimize this shortcoming, we tested the internal validity of our solution using bootstrap methods, and the results were highly stable (Jaccard> 0.85) [17]. In addition, the method is not efficient when a large number of potential cluster solutions are to be considered [10]; to address this limitation, we computed the optimal number using analytical indexes like Calinski Harabasz [24].
A number of limitations need to be taken into account as well. The use of MCA can produce low percentages of variation on principal axes and make it difficult to choose the number of dimensions to retain. We assumed a 5-dimension solution using the elbow rule in the scree  plot to achieve the most accurate solution possible without including too many dimensions in the analysis [16].
In some clusters, an accumulative diagnosis belonging to the same chapter could be coded in multiple ways; however, use of the structure of ICD10 3-character codes that group diseases as the unit of analysis, rather than the more specific individual diagnosis, makes this improbable.
Few studies have focused on the MM patterns in patients rather than on diseases [25][26][27]. This methodology produced results that can be transferred to clinical practice, as they suggested that diseases are not equally associated with all phenotypes and there may be a genetic basis for patterns of multimorbidity.
Multimorbidity can present a problem for health services delivery, affecting patients, health professionals, and managers who are attempting to improve service delivery [28]. Our study offers a new methodological approach to understanding the relationships between specific diseases in individual patients, which is an essential step in improving the care of patients and health systems in organizations. Analysing patient profiles permitted the identification of subgroups of patients with different associated diseases.
This study illustrates the need to pay careful attention to the methods used to support policies and decision-making. The study results have implications for three fundamental areas of action: a) the need to change the orientation of clinical guidelines that focus on a single disease; b) the need to change health policy that is based on a disease rather than on the whole person; and c) the need to change current incentive policies that focus the health professional's attention on a disease rather than on multimorbidity, which includes not only diseases but also drug interactions, polypharmacy and the process of patient-health professional interactions.
Future studies on the current topic are therefore recommended, with a special focus on three major issues. First, the genetic typing of these multimorbidity patterns will identify genetic confluence in these patterns. Second, the delimitation of environment factors (alimentation, physical exercise, toxicity, etc.) associated with these patterns. Third, longitudinal studies should be done to establish the order of disease onset. Finally, the influence of polypharmacy, or the use of multiple drugs, could decrease treatment efficacy and cause unexpected adverse events or even the development of other diseases [29,30].
These findings suggest that multimorbidity patterns obtained using non-hierarchical cluster analysis identified clusters more consistent with clinical practice, identifying phenotypes of certain sub-groups of patients.

Conclusion
Non-hierarchical cluster analysis identified multimorbidity patterns consistent with clinical practice, identifying phenotypic subgroups of patients.