Research & Development projects
Let me recount the journey that sparked the creation of this article—an inspiring meetup in the vibrant city of Cambridge, a hub for pioneering research and the real-world implementation of life sciences, health, and medicine. Together with our partners at Nexer Digital from the UK, on May 11, 2023, we organized a gathering titled “Digital Solutions for Health and Life Sciences.”
Before we delve into my tech journey, allow me to introduce myself. My roots extend into the field of chemical engineering, where I dedicated four years to honing my skills in membrane process modeling. I also possess over six years of hands-on experience in the captivating realm of life sciences. I crafted my code skills for cutting-edge projects like Synthia’s revolutionary retrosynthesis SAS and the Enterprise Science Platform. However, that’s not the extent of my endeavors; I’ve also been deeply engaged in creating robust lab notebooks and devising solutions that seamlessly bridge the digital and scientific domains.
Today I am working at Sigma IT Poland, where I contribute to developing a drug development platform for a prominent biotech player. In this article, I will share my insights about smart strategies for Python parallelism and address common issues like data synchronization, memory management, and resource allocation optimization. Let’s dive into practical techniques that boost Python’s performance and explore ways to optimize your code.
Python has gained prominence in the scientific community, particularly in life sciences because of its simplicity, versatility, and extensive collection of libraries dedicated to scientific computation and analysis. The language’s intuitive and readable syntax appeals to seasoned developers and newcomers alike.
The article delves into how Python utilizes parallelism to harness the potential of multicore processors, allowing concurrent task execution and optimizing data processing and computation. This capability equips scientists to handle complex problems that demand substantial computational resources, including simulations, data analysis, and large-scale computations.
The straightforward answer lies in its extensive selection of scientific libraries. There are libraries like Pandas, NumPy, and Scipy for general purposes. Regarding data visualization and presentation, researchers can rely on Matplotlib and Seaborn. Python offers the powerful toolkit Openeye for cheminformatics, while for bioinformatics, there’s Biopython.
Additionally, Python provides a range of options for machine learning, including sci-kit-learn, TensorFlow, Keras, and many more. And that’s just scratching the surface – as you delve deeper into Python’s scientific ecosystem, you’ll discover numerous alternatives and helper tools. The abundance of scientific libraries available in Python raises the question: Why are there so many resources specifically tailored for scientific applications in the first place?
The answer is that Python is a perfect choice for the science community. Its simple and forgiving syntax makes it ideal for amateurs. While not perfect for enterprise solutions, Python’s duck typing proves excellent for prototyping and writing one-shot scripts to address specific issues. The garbage collector handles memory management, freeing you to focus on the problem. Moreover, Python boasts extensive built-in libraries, streamlining mundane tasks such as file system access, text parsing, and mathematical operations.
However, one feature of Python often overlooked when listing its pros is its ease of interfacing with low-level data and functions. This capability allows for direct connections between Python and various lab equipment, enabling the creation of high-performance libraries in low-level languages to be wrapped in easily accessible Python code.
However, like any other language, Python has its own set of limitations. One of the significant limitations is GIL, which holds relevance for life science projects.
GIL stands for Global Interpreter Lock, and it functions as a simple mutex, ensuring that only one thread at a time can access the Python interpreter. This mutex enables the global interpreter to determine which objects quickly and accurately can be deleted and which should be retained.
At first glance, this is a beneficial feature. Indeed, it is useful for certain scenarios. However, this advantage comes at a cost when it comes to multiprocessing. The GIL makes parallelization challenging in Python because only one line of code can be interpreted at any given time. This limitation is relatively manageable when a program primarily deals with input-output operations. The Python interpreter awaits responses, allowing other threads to continue their work.
However, in life sciences, parallel processing is a fundamental task—activities such as high throughput screening, protein folding, and retrosynthesis demand substantial computational resources. Whether you have four cores on your university laptop or thousands of cores from a cloud cluster, the objective is to utilize them all efficiently.
In this regard, alternatives to threads are processes. Although processes are not without flaws—they are resource-heavy and have separate memory—they allow for effective parallel computing. Generally, various types of problems arise from these scenarios. However, they can mostly be grouped into four classes: multithreading, multiprocessing, memory-bound issues, and serialization-bound issues.
Multithreading is the simplest of the classes, and if feasible, every other problem should be refactored to utilize multithreading. It is primarily designed for input-output operations, where the bulk of the heavy lifting occurs elsewhere. This could involve workers processing a queue, microservices accessed via HTTP, or executing resource-intensive database queries. The key idea is that the job is performed outside the main thread.
The most significant advantage of multithreading is its inherent scalability. The system can efficiently handle larger workloads by increasing the number of workers. This flexibility makes it an attractive choice for scenarios where scalability is a critical requirement.
Another avenue to explore is multiprocessing, a class that opts for processes over threads. In its simplest manifestation, a process can be designated for each piece of input data. However, this approach can backfire unexpectedly. The weightiness of processes is noteworthy, demanding substantial resources and time for their creation. Furthermore, competition among these resource processes can lead to suboptimal utilization.
A workaround involves segmenting data into manageable batches, each entrusted to a distinct process, thereby ensuring a dedicated processing unit for each. This strategy proves effective for many scenarios. Nevertheless, when data processing times diverge, an issue arises wherein some processes are burdened while others remain dormant.
The optimal solution entails a refactoring endeavor, transforming the problem into multithreading. By reimagining processes as dynamic workers or microservices fueled by a continuous stream of data, equilibrium is achieved. This orchestration ensures uniform workload distribution, culminating in enhanced efficiency across the board.
The initial category of concern is a memory-bound scenario. Another potential challenge when dealing with processes pertains to memory issues. With contemporary cloud solutions, reserving many cores for computations numbering in the hundreds or even thousands is feasible—especially when a financial incentive exists. However, problems emerge when each spawned process necessitates tens or even hundreds of gigabytes of memory for operation, rapidly transforming memory consumption into a constraining element. Unfortunately, in these situations, there is no universally applicable out-of-the-box remedy.
Recently, I engaged with a library comprising two distinct models: one characterized by CPU-intensive demands and the other by memory-intensive requirements, employed for cross-referencing outcomes from the former. In such instances, a viable approach involves partitioning these models and independently scaling each component.
The ultimate form of dilemma surfaces as a serialization-bound challenge. While the problem categories assume the ease of serializing input and output, reality paints a different picture when grappling with data-rich objects, often found in scientific contexts. Here, the serialization process might prove sluggish or not even implemented. What steps can be taken when faced with such a predicament?
One strategy involves outsourcing the issue to an alternate environment. I recall one of my inaugural cheminformatics assignments, where the mandate was to cross-reference 10.000 reaction templates against a staggering 10 million reactions. Initially, the task appeared formidable within Vanilla Python due to the serialization complexities. However, an elegant solution emerged once we transitioned these operations into a database setting.
An alternative approach is to leverage the Python interfaces for direct access to low-level data, allowing for object serialization directly into binaries. This is attainable even for objects functioning as mere C++ wrappers. In the event of setbacks, viable options remain. Firstly, the GIL poses a constraint solely for the predominant Python interpreter variant, CPython. Other Python implementations and supersets, such as Jython, Mojo or Codon, are immune to this limitation. If your utilized libraries are compatible, experimentation with these interpreters becomes plausible. Alternatively, if challenges persist, exploring solutions in C++ or Java and interfacing them through Python can be an effective workaround.
Finally, just as I was writing this article, the Python Steering Council voted to completely remove GIL in the long term. The authors expect the GIL-less version to be available as an experimental build as soon as the end of 2024 and as an only option in 5 years. So, if everything else fails, maybe you just need to wait.
In conclusion, let’s recap the essential takeaways from our exploration. Due to the Global Interpreter Lock (GIL), Python threads lack true parallelism; however, parallel processing can be achieved through processes. It’s advisable to reframe challenges whenever feasible, transforming them into multithreading scenarios by assigning tasks to queues, microservices, AWS lambdas, or analogous tools.
Furthermore, should all else falter, feel free to venture beyond the confines of Python. Consider alternative solutions like leveraging databases or utilizing C++ wrappers to overcome hurdles.