Gut microbiota composition correlates with diet and health …

ARTICLE RESEARCH

corrected with a line broadening of 0.3 Hz using the processor on Chenomx NMR suite 7 (Chenomx). The spectra were integrated at full resolution for data analysis (PCA, PLS-DA, CIA) with the water region (4–6 p.p.m.) excluded and the data was normalized to the sum of the spectral integral. For PPCCA data analysis, the spectra were integrated into spectral regions (0.01 p.p.m.). Two-dimensional 1 H– 1 H correlation spectroscopy (COSY) and total correlation spectroscopy (TOCSY) were acquired on a 600 MHz NMR spectrometer. TOCSY spectra were acquired with a spin lock of 65 ms. All two-dimensional data were recorded with standard Varian pulse sequences collecting 1,024 3 128 data points with a sweep width of 9.6 kHz and 32 scans per increment. Statistical methods and metabolome data analysis. Statistical analysis was carried out using R (version 2.13.2) or Stata (version 11) software packages. Kruskal–Wallis and Mann–Whitney tests were used to find significant differences in microbial taxa, clinical and biochemical measures, alpha diversity, and Healthy Food Diversity (HFD). Data were visualized by boxplots. Unless stated otherwise, box plots represented the median and interquartile ranges, with the error bars showing the last datum within 1.5 of the interquartile range of the upper and lower quartiles. We used least square linear regression for comparing alpha diversity and HFD. Median regression 52 was used to compare clinical measures and microbiota, while adjusting for age, gender, medications, and when appropriate residence location. For median regression, the median was modelled as a linear function of independent variables. Model parameters are estimated such that they mini- mised the sum of the absolute differences between observed and predicted values. P values were generated using the wild bootstrap method 53 to estimate variance. A linear quantile (median) regression for two variables—a response variable ( y ) and a predictor variable ( x )—is the following: median ( y ) 5 b 0 1 b 1 x where b 0 is the intercept (value when y 5 0) and b 1 is the slope (change in median of y for a unit change in x ). Together, these parameters describe the association between y and x , where x is a predictor of y . In the case of multiple predictor variables, each one is added to the regression equation and so the equation becomes median ( y ) 5 b 0 1 b 1 x 1 1 b 2 x 2 and now the slope b 1 is interpreted as the median change in x 1 after adjusting for x 2 . This can be likened to a laboratory experiment where the specific effect of one variable on another is isolated by holding all other relevant variables constant. Following statistical analysis of the taxonomic classifications, we estimated FDR values using the Benjamini–Hochberg method 54 to control for multiple testing. The exception to this were analyses at the genus level where we estimated the proportion of true null hypotheses with the Q -value function unless the estimated p 0 was less than or equal to zero 55 . Statistical analysis of the NMR data was performed using diverse software packages: PCA and PLS-DA analysis was performed in SIMCA-P 1 (Umetrics); permutation testing was performed in R and PPCCA was performed in R using the MetabolAnalyze package. The NMR data was Pareto-scaled before data analysis. Assignment of the spectral peaks (Supplementary Table 9) was performed using in-house libraries, statistical correlation analysis and two- dimensional NMR spectra (TOCSY and COSY). 42. Harrington, J. etal. Sociodemographic, health and lifestyle predictors of poor diets. Public Health Nutr. 14, 2166–2175 (2011). 43. McCance, R. A. & Widdowson, E. M. The composition of foods 6th edn (Royal Soc. Chemistry, 2002). 44. Claesson, M. J. et al. Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. PLoSONE 4, e6669 (2009). 45. Lilburn, T. G. & Garrity, G. M. Exploring prokaryotic taxonomy. Int. J. Syst. Evol. Microbiol. 54, 7–13 (2004). 46. Caporaso, J. G. et al. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26, 266–267 (2010). 47. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2–approximately maximum- likelihood trees for large alignments. PLoSONE 5, e9490 (2010). 48. Hamady, M., Lozupone, C. & Knight, R. Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J. 4, 17–27 (2010). 49. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. in ACM Conference on Bioinformatics Computational Biology and Biomedicine (Association for Computing Machinery, 2011). 50. Chevreux, B. etal. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 14, 1147–1159 (2004). 51. Noguchi, H., Park, J. & Takagi, T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 34, 5623–5630 (2006). 52. Koenker, R. & Basset, G. Regression quantiles. Econometrica 46, 33–50 (1978). 53. Feng, X. D., He, X. M. & Hu, J. H. Wild bootstrap for quantile regression. Biometrika 98, 995–999 (2011). 54. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995). 55. Dabney, A., Storey, J. D. & Warnes, G. R. qvalue: Q -value estimation for false discovery rate control; R package version 1.24.20 (2010).

METHODS Subject recruitment and sample collection. This study was approved by the Cork Clinical Research Ethics Committee. Subjects older than 64 years were recruited and clinically investigated in two local hospitals, which serve a popu- lation base of , 481,000 in the Cork city and county region. They were defined as (1) community-dwelling (community); (2) attending an out-patient day hospital (out-patient); (3) in short-term rehabilitation hospital care (rehabilitation; under 6 weeks stay) or (4) in long-term institutionalized care (long stay; more than 6 weeks). The mean age of the subjects was 78 ( 6 8) years, with a range of 64 to 102 years. The subjects were all of Irish ethnicity. None of the faecal samples from elderly subjects from our previous study 8 were analysed in the current analysis, because we did not have food frequency data for all that cohort. Exclusion criteria were a history of alcohol abuse, participation in an investiga- tional drug evaluation or antibiotic treatment within the previous 30 days, or advanced organic disease. Informed consent was obtained from all subjects or, in cases of cognitive impairment, by next-of-kin in accordance with the local research ethics committee guidelines. Data collected included anthropometric measurements, clinical history and status and medication history. Antibiotic use before the one-month exclusion period was also recorded for each subject. Thirteen younger adult subjects of age ranging 28–46 years, which had not been treated with antibiotics within 30 days, were also recruited by informed consent. Clinical and nutritional data collection. Habitual dietary intake was assessed using a validated, semiquantitative, food frequency questionnaire (FFQ) based upon the SLAN study 42 . Food properties were determined using the UK Food Standards Agency Nutrient databank 43 . The mini nutritional assessment (MNA) was used as a screening and assessment tool to identify subjects at risk of malnutrition. Non-fasted blood samples were collected and analysed at Cork University Hospital clinical laboratories. Cytokines were measured using validated, commercial multi-spot microplates (Meso Scale Diagnostics). Anthropometric measures included height, weight, calf and mid-arm circumference. Charlson comorbidity index, mini mental state exam, geriatric depression test, Barthel score and functional independence measures were carried out on all participants. For long-term care, day- hospital and rehabilitation subjects, a research nurse reviewed the medical records for information on disease and current medication usage. Molecular methods and bioinformatics. DNA was extracted from faecal samples, and the V4 region of the 16S rRNA gene was amplified, sequenced and analysed, as described previously 44 . Briefly, V4 amplicons were sequenced on a 454 Genome Sequencer FLX Titanium platform (Roche Diagnostics and Beckman Coulter Genomics). Raw sequencing reads were quality trimmed using the QIIME pipe- line 39 according to the following criteria: (1) exact matches to primer sequences and barcode tags, (2) no ambiguous bases (Ns); (3) read-lengths not shorter than 150 base pairs (bp) or longer than 350 bp; (4) the average quality score in a sliding window of 50 bp not to fall below 25. For large-scale assignments into the new Bergey’s bacterial taxonomy 45 we used the RDP-classifier version 2.2 with 50% as confidence value threshold. This was based on what was found suitable for V4 amplicons from the human gut environment 44 . RDP classifications were imported into a MySQL database for efficient storage and advanced querying. The amplicon reads were clustered into OTUs at 97% identity level, and filtered for chimaeric sequences using ChimeraSlayer (http://microbiomeutil.sourceforge. net/#A_CS). Representative sequences (the most abundant) for each OTU were aligned using PyNAST 46 before tree building using FastTree 47 . These phylogenies were combined with absence/presence or abundance information for each OTU to calculate unweighted or weighted UniFrac distances, respectively 48 . Principal co- ordinate analysis and Procrustes superimposition were then performed from the UniFrac distances and Food Frequency data. The amplicon sequences were deposited in MG-RAST under the Project ID 154. Metagenomes were sequenced from libraries with 91 bp paired-end Illumina reads and 350 bp insert size and assembled using MetaVelvet 49 . Samples EM039 and EM173 were sequenced from libraries of 101 bp paired-end Illumina reads with a 500 bp insert size, and subsequently assembled using MIRA 50 in hybrid with 551,726 and 665,164 454 Titanium reads, respectively. Protein sequences from enzymes were screened against the assembled metagenomes using TBLASTN with an amino acid identity cut-off of 30% and an alignment length cut-off of 200 bp. We screened the metagenome data for enzymes associated with production of butyrate (butyryl-CoA transferase/acetyl-CoA hydrolase), acetate (acetate-formyltetrahydrofolate synthetase/formate-tetrahydrofolate ligase), and propionate (propionyl-CoA:succinate-CoA transferase/propionate CoA- transferase). Genes were predicted using MetaGene 51 . NMR analysis of the faecal water metabolome. Faecal water samples were pre- pared by the addition of 60 m l D 2 Oand 10 m l tri-methylsilyl-2,2,3,3-tetradeuterio- propionate to 540 m l faecal water. Spectra of samples were acquired by using a Carr–Purcell–Meiboom–Gill (CPMG) pulse sequence with 32k data points and 256 scans. Spectra were referenced to TSP at 0.0 p.p.m., phase and baseline

Macmillan Publishers Limited. All rights reserved ©2012

Powered by