Cloud-based GWAS platform: An innovative solution for efficient acquisition and analysis of genomic data
FAR Publishing Limited
image:
The diagram illustrates a framework for cloud-based GWAS data resources, structured in a hub-and-spoke architecture with "Cloud-Based GWAS Data Resources" at its core—interconnected with six multi-omics domains. Phenomics domaincontains disease-related data, physical activity metrics, and behavioral characteristics. Research purposes are classified as disease mechanism research, physiological function assessment, and molecular biomarker discovery. This domain provides clinical phenotypes, disease diagnoses, and physiological indicators, encompassing over 10,000 variables. Neuroscience domainintegrates multi-source neuroimaging data, focusing on brain anatomy, cerebral cortex, and brain MRI. This domain includes cortical thickness measurements and more than 200 neuroimaging-derived features, supporting genome-wide association studies across .various MRI data types (as illustrated by the accompanying bar charts). Proteomics domainprovides plasma proteomic datasets quantified through multiplex immunoassay techniques, covering major studies such as the UKB-PPP, Finnish, and Icelandic Decode cohorts. Represented ancestral groups include European and East Asian populations. Microbiomics domaincovers intestinal flora, oral microbiota, and skin microbiota derived from metagenomic sequencing of over 10,000 samples. Analyses include α-diversity measurements and genus abundance profiling. Metabolomics domain incorporates a wide range of metabolic and immune profiling data, featuring multiple analytical categories including immune cells, lipoprotein, and blood biomarker. Nutrigenomics domaincontains long-term dietary habit data including dietary composition and food consumption patterns. This domain covers multiple food categories including fish, meat, and other dietary components. This domain provides over 120 food-related features for diet-genotype interaction studies. Each domain is visually differentiated by color coding for clear identification.
view more
Credit: Xiaohong Ke, Kailai Li, Aimin Jiang, Yasi Zhang, Qi Wang, Zhengrui Li, Jian Zhang, András Hajd, Weniie Shi, Ulf Kahlerts, Anqi Lin, Pengpeng Zhang, Peng Luo
Since 2005, GWAS have transformed genomic research by identifying over 50,000 disease-associated genetic variants, laying the foundation for precision medicine and drug development. Yet traditional GWAS workflows face major hurdles: acquiring large datasets (often terabytes) is slow and unreliable due to bandwidth issues, while analyzing such data demands high-performance computing (hundreds of terabytes storage, thousands of CPU cores) that strains budgets, especially for smaller institutions. Data heterogeneity—varying formats, variable naming, and reference genome discrepancies (e.g., hg19 vs. hg38)—complicates standardization and integration across databases, risking analytical bias and errors. Cloud computing offers a solution. Its scalable resources eliminate local hardware limits, cut costs via shared pools, and accelerate processing with distributed computing. Projects like the Pan-Cancer Analysis of Whole Genomes (PCAWG) and UK Biobank have proven cloud tech’s value, boosting efficiency and collaboration. Building on this, researchers developed a cloud-based GWAS platform integrating major international databases (e.g., GWAS Catalog, UK Biobank, FinnGen) and the FastGWASR R package, designed to streamline genetic analyses.
The platform’s architecture leverages Kubernetes, with 100 high-performance nodes (64-core CPU, 512GB RAM, 8TB SSD each) and hybrid storage (HDFS for raw data, object storage for intermediates). A multi-dimensional sharding strategy (by chromosome, genomic interval, project, population) and intelligent caching optimize retrieval speed and cost. Security is robust: TLS 1.3 encrypts transmissions, homomorphic encryption protects raw data during analysis, and federated learning enables secure collaboration. Access controls use role/attribute-based policies, with multi-factor authentication and JWT sessions to restrict data access. Front-end design prioritizes usability: a responsive interface (React/D3.js) adapts to mobile/desktop, with visual hierarchy guiding users to key functions. Interactive tools (Manhattan/QQ plots) and workflow templates simplify complex analyses, while guided tutorials help newcomers.
Data resources span six omics domains (neuroscience, proteomics, microbiome, metabolomics, immunology, nutrigenomics), covering 40,000+ phenotypes from global databases (e.g., UK Biobank’s brain MRI, Finnish proteomics cohorts). A standardized preprocessing pipeline ensures quality (format conversion, metadata extraction, quality checks) with weekly updates and version control for reproducibility. Machine-learning anomaly detection and multi-level imputation address data inconsistencies. Core functionalities include millisecond-scale data retrieval via B+ tree/Bloom filter indexing and predictive caching (90% of queries resolved in <100ms). FastGWASR, the integrated R package, features modular design (data acquisition, preprocessing, analysis, visualization) with optimized algorithms: sparse matrices speed LD calculations (3× faster, 65% less memory), and parallel processing adapts to available resources. The API follows RESTful principles, with concise parameters for common tasks and DSL support for advanced queries. Security includes differential privacy for individual-level data and federated learning for collaborative analysis without raw data exposure.
Performance benchmarks highlight advantages: sub-second online extraction (vs. minutes/instability in traditional platforms), 90% query efficiency, and lower hardware demands (runs on laptops). FastGWASR outperforms tools like TwoSampleMR in speed and functionality, supporting one-click MR-PheWAS and drug target workflows. Application examples showcase real-world impact: Mendelian Randomization linked metabolites (e.g., branched-chain amino acids) to type 2 diabetes risk using “ebi-met1400” data, with findings aligning to prior studies; Drug Target Validation confirmed PCSK9’s role in coronary heart disease via co-localization and MR-PheWAS, assessing 2,408 phenotypes; Multi-Omics Integration mapped gut microbiota, metabolites, and inflammation networks in inflammatory bowel disease, demonstrating efficient cross-data analysis. Limitations include potential bottlenecks with ultra-large datasets (>100M variants/millions of individuals) and gaps in rare disease/underrepresented population data (e.g., African/Latin American cohorts). Future work will expand data (single-cell RNA-seq, epigenomics), enhance algorithms (cell-type-specific GWAS), and improve accessibility (AI-assisted tools, community collaboration).
In summary, this cloud-based GWAS platform and FastGWASR package democratize genomic research by overcoming traditional barriers—inefficient data access, high costs, and complex integration. They accelerate discoveries in precision medicine, benefiting institutions of all sizes and advancing global health.
Journal
Med Research
Method of Research
Data/statistical analysis
Subject of Research
Not applicable
Article Title
Cloud-based GWAS platform: An innovative solution for efficient acquisition and analysis of genomic data
Article Publication Date
11-Nov-2025
Automated high-throughput system developed to generate structural materials databases
Large-scale superalloy dataset collection accelerated from years to days
image:
Conceptual diagram of the automated high-throughput system for generating structural materials databases.
view moreCredit: Toshio Osada, National Institute for Materials Science; Takahito Ohmura, National Institute for Materials Science
A NIMS research team has developed an automated high-throughput system capable of generating datasets from a single sample of a superalloy used in aircraft engines. The system successfully produced an experimental dataset containing several thousand records—each consisting of interconnected processing conditions, microstructural features and resulting yield strengths (referred to as “Process–Structure–Property datasets” below)—in just 13 days. Datasets are generated over 200 times faster than when using conventional methods. The system’s ability to rapidly produce large-scale, comprehensive datasets has the potential to significantly accelerate data-driven materials design. This research was published in Materials & Design, an international scientific journal, on June 20, 2025.
Background
High-precision experimental data is essential for investigating material mechanisms, formulating theories, constructing models, performing numerical simulations and machine learning and driving materials innovation. In particular, large quantities of accurate Process–Structure–Property datasets are indispensable for optimizing heat-resistant superalloy processing methods and the complex, multi-element microstructures of these materials. However, developing such databases typically requires years of continuous experimental work and substantial resource investment. These challenges have long hindered the development of high-performance superalloys.
Key Findings
This NIMS research team recently developed a new, automated high-throughput evaluation system capable of generating Process–Structure–Property datasets containing thousands of data points from a single sample of a Ni-Co-based superalloy developed by NIMS for use in aircraft engine turbine disks. These datasets include processing conditions (heat treatment temperatures), microstructural information (e.g. precipitate parameters) and mechanical properties (e.g. yield stress). The superalloy sample was thermally treated using a gradient temperature furnace developed by the team, thus mapping a wide range processing temperatures across it. Precipitate and yield stress measurements were obtained at various coordinates along the temperature gradient using a scanning electron microscope automatically controlled using a Python API and a nanoindenter. The system then rapidly evaluated and processed the collected data. As a result, the system successfully generated a volume of Process–Structure–Property data that would have taken conventional methods approximately seven years and three months to produce in just 13 days.
Future Outlook
The research team plans to apply this system to the construction of databases for various target superalloys and to the development of new technologies for acquiring high-temperature yield stress and creep data. In addition, the team aims to formulate multi-component phase diagrams—essential for materials design—based on the constructed superalloy databases, and to explore new superalloys with desirable properties using data-driven techniques. The ultimate goal is to fabricate new heat-resistant superalloys that may contribute to achieving carbon neutrality.
Other Information
- This project was carried out by a research team consisting of Thomas Hoefler (Postdoctoral Researcher, High-Reliability Heat-Resistant Materials Group (HRHRMG), Research Center for Structural Materials (RCSM), NIMS), Ayako Ikeda (Researcher, High Temperature Materials Group (HTMG), RCSM, NIMS), Toshio Osada (Group Leader, HRHRMG, RCSM, NIMS), Toru Hara (Managing Researcher, Microstructure Analysis Group, RCSM, NIMS), Kyoko Kawagishi (Group Leader, HTMG, RCSM, NIMS), and Takahito Ohmura (Director, RCSM, NIMS).
This work was conducted as part of another project entitled “Comprehensive and efficient exploration of high-temperature structural materials using multi-component compositionally graded bulk materials” (project leader: Takahito Ohmura) supported by the National Security Technology Research Promotion Fund of the Acquisition, Technology & Logistics Agency (grant number: JPJ004596). - This research was published in Materials & Design, an open access international scientific journal, on June 20, 2025.
Journal
Materials & Design
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
Automated System for High-throughput Process-Structure-Property Dataset Generation of Structural Materials: A γ/γ′ Superalloy Case Study
No comments:
Post a Comment