The study of (meta-)genomics data has been providing scientists with valuable insights into the functioning and composition of microbial communities. Latest advancements in high-throughput sequencing technologies have resulted in significant growth in the data produced and made available for further research.
However, analysis of such data requires multiple preprocessing and computational steps to interpret the genetic composition of samples. For most researchers, configuring these steps and their respective tools is a challenging task due to the complexity of the setup procedure.
Our aim is to present an overview and compare common metagenomic analysis approaches in environmental biotechnology and identify possible reasons why most suffer from a lack of reproducibility.
For this purpose, three main methods were used. First, a literature survey was performed on metagenomic analysis approaches, methodologies, and tools. Next, researchers and scientists with different educational backgrounds active in this field were interviewed. Lastly, the process of pipeline construction and bottlenecks were evaluated through hands-on experience.
By conducting this research, several common pitfalls and shortcomings of metagenomic analysis practices were identified. Since the expertise of most researchers in this field is lacking a fundamental computer science and programming background, very few would attempt developing a pipeline from scratch. Therefore, if instead, they would opt for using "ready-made" general-purpose pipelines, they would also face various difficulties in setting up and configuring them to their needs. Finally, it has been observed that many of the existing metagenomic tools are not developed and maintained according to computer science code production standards. Therefore, even the more popular tools can suffer from detrimental bugs that can render them broken and consequently deprecated. These issues along with many others can all be categorized under the short-hand of "Lack of SOPs (Standard Operating Procedures)" for developing, maintaining, and using metagenomic analysis pipelines.