Life Science Software development
Mini-ETL on AWS
Empowering ScientistS to Uncover Insights From Human-based Bio-Data
In this article, I delve into the world of Mini-ETL on AWS and how it empowers scientists in a global pharmacy company to unlock insights from complex human-based bio-data. I present the data structure we are working with, the challenges we encountered, and our approach to overcoming them. Additionally, I provide performance comparisons and ideas for enhancing such solutions.
Our work involved a substantial 50-gigabyte CSV file with half a million rows, each describing one participant in the study. The file boasted an extensive 27,000 columns, signifying a vast volume of data. (see picture 1)
Objectives: what do We Want to Accomplish?
Our primary goal is to enable scientists to derive meaningful insights and make fascinating discoveries from the data. To achieve this, we aimed to simplify the scientists’ interaction with the dataset, as working with a 50-gigabyte CSV file with 27,000 columns would be a formidable challenge.
How can we do IT?
Our strategy involved transforming the data into a format that could be easily queried using the SQL language, well-known to scientists. By adopting this approach, scientists could effortlessly extract valuable information from the extensive dataset. Each column in the file represented a specific data point from the study and included additional numbers indicating the study part and the number of possible values for that measurement. To make data processing more manageable, we can divide the large file into smaller ones based on column categories (see picture 2). This way, we could extract study-related data and the value order in each column.
AWS Glue vs. AWS ECS
AWS Glue is a really powerful tool. It’s powered by PySpark under the hood, which uses a distributed architecture for querying and transforming the data. Still, it’s also less elastic in terms of what can be run and resource allocation.
We also considered AWS ECS, which requires more effort to set up, but it offers greater elasticity regarding what can be run and resource allocation. In our case, AWS ECS proved to be the optimal choice, as AWS Glue seemed to be an overkill for our needs. ECS also has better resource management that could lead to some cost reduction of running the processing, and it’s definitely easier to write the desired code in pandas or polars – packages for Python.
Pandas and polars are similar data science packages for Python. Still, we compared both, and polars obtained much better results: the same code for transforming and splitting up the big CSV file took 30 minutes for polars and three hours for pandas. That’s a big difference – six times boost in time performance.
If you can try polars for a real use case, I recommend that because it’s a lot faster; although it’s in the early development stage, we didn’t encounter any bugs.
Data Transformation and Storage
After splitting and transforming the data, we stored it in an AWS S3 bucket. AWS’s import feature made the data ingestion into the database easy. The final database size was significantly reduced from 50 gigabytes of input CSV file to 15 gigabytes of stored data. This was possible due to filtering of empty rows that occured in transformed data, because of the input file structure.
We are considering the development of the web UI to facilitate SQL query executions for scientists, streamlining the data exploration processes. Additionally, we are exploring the possibility of converting the large CSV file into Parquet files and querying them with AWS Athena. This approach could save costs by eliminating the need for database servers, as we could store the data in S3 and query it using AWS Athena.
We embarked on a challenging journey with a large dataset of 50 GB, half a million study participants and 27,000 columns, posing a considerable hurdle for scientists. Through solution we implemented, which utilized AWS ECS with the polars package and RDS Postgresql, we empowered scientists to query the data effortlessly with SQL. We also uncovered the significant performance advantage of polars over pandas. If possible, we recommend trying polars and moving towards a complete serverless architecture to avoid the costs of running database servers.
Software Engineer at Sigma IT Poland
Member of life science project team
LEARN MORE ABOUT OUR PROJECTS AND SERVICES FOR LIFE SCIENCE INDUSTRY: Software development for Life Sciences