Exploration Of Genomics Signals
Life ScienceS – Case study
For the Precise Drug Target IdenTification
The multinational science-led bio-pharmaceutical company focused on developing life-changing medicines. (NDA)
The project goal was to build a software solution that would accelerate the exploration of genomics signals within large datasets in real-time combined with the population-based characteristics of each data point. This would help researchers in quick and precise drug target identification. Researchers should be able to define their groups of patients quickly and compare results against the control groups to generate certain plots that will bring greater insight into the specific cases analyzed by scientists.
We built and provided a team of Python Developers, DevOps, and Project Manager who excel in handling complex projects and swiftly grasped the requirements of scientists and project domains.
The team was responsible for building back-end solution from scratch, including:
– Architecture proposal
– Developing REST API using Python services in a microservices structure, which handles the requests performed by users
– ETL implementation as the data sets are received from an external source
– Integration of the computational tools
– Deployment and management of computing cluster for sufficient performance
– Deployment on the AWS cloud using Infrastructure as Code approach.
– Handling large and sparse datasets, especially when it comes to storing and operating them conventionally, which can lead to performance bottlenecks, significantly slowing down the application
– A deep need to take advantage of modules that are not often used in the real-time application, like pandas, NumPy, dask
– Setting up everything from scratch on the AWS cloud and, at the same time, ensuring that all of it fits seamlessly into the existing giant client ecosystem.
Our team has built a powerful solution that enables scientists to delve into the vast amount of genomic data and linked clinical records to identify potential targets and rapidly assess genomic relationships with user-defined case-control cohorts.
It’s a back-end solution that, via REST API – handling the requests performed by users – provides real-time search and filtering capabilities and computational tools for data manipulation. The JSON input is then presented for front-end service, where data visualization is done on various types of plots.
The data is stored in parquet files and operated on via the dask cluster, which asynchronously cooperates with the back-end services.
The entire solution is deployed on the AWS platform using Infrastructure as Code approach. The dask part of the application is set up on an AWS Parallel Cluster, which helps with the overall speed of the data handling and computational parts.
Achievements We’re Proud of:
~0.5 million rows with 40 000 columns Are Searched and Filtered In Real-Time Through Back-end Rest Api
THe Waiting Time For The Results of Statistical Comparisons for cases and controls reduced from 2 days to 2 minutes
Project became a role model For Other Client Projects and as a result, we have been assigned to New projects.
Project Has been appreciated in internal Client awards as Highlight of the Year 2021
Impact: revolutioniZing genomic data Exploration
This solution is set to revolutionize the way scientists explore genomic data, and we’re excited to see the impact it will have on the field.
The solution enables rapid hypothesis testing in the early stages of drug discovery by allowing researchers to quickly construct cohorts of their choices for further examinations. It results in a less costly and faster R&D process.
– Python, Django, Dask
– Docker, PostgreSQL
– AWS: CodePipeline, CloudFormation, CDK, KMS, SQS, Secrets Manager, Certifications Manager, CloudWatch, EC2, ECR, ECS, ELB, API Gateway, Fargate, Lambda, S3, Aurora RDS, ElastiCache Redis, SQS), IaC
Innovation is a Process
Just tell us about your project needs and we’ll get back to you as soon as possible.