John Snow Labs Now Provides Curated Data in Apache Parquet Format – For Blazing Fast Big Data Analytics on Hadoop & Spark

All datasets are now also available in the new highly optimized format, which delivers an order of magnitude faster query speeds, as well as substantial storage savings, according to multiple industry benchmarks.

Lewes, Delaware – August 1, 2016 – John Snow Labs now delivers all datasets in Apache Parquet format. The new format drastically accelerates queries on common benchmarks. It also reduces disk space, bandwidth as well as CPU usage. It is available alongside with the existing CSV and JSON data formats and can be found on all subscriptions. 

Apache Parquet is an efficient and a general-purpose columnar file format. It is self-describing, language-independent and also supports multiple compression algorithms and partitioning for big data sets and nested data structures. John Snow Labs is the first to deliver a data repository in Parquet format in the healthcare space, which is experiencing fast growing adoption of big data analytics technologies.

Parquet was designed for Apache Hadoop and has been adopted by Apache Spark, Cloudera Impala, Hive, Presto and Apache Drill. The majority of big data analytics platform now recommend it as the most efficient, highest performing data format. Here are recent publicly available benchmarks:

IBM evaluated multiple data formats for Spark SQL showed Parquet to be:

• 11 times faster than querying text files
• 75% reduced data storage thanks to built-in compression
• The only format to query large files (1 TB in size) with no errors
• Higher scan throughput on Spark 1.6

examined different queries and discovered that Parquet was:

• 2 to 15 times faster than Avro, and far faster than CSV
• 72% smaller on a wide table and 25% smaller on a narrow table

United Airlines
also published that Parquet was:

• 10 times faster than CSV on Presto and 3 times faster than CSV on Hive

According to the founding team, “Our customers expect us to optimize and test the data we provide for whichever analytics platform they use – often for multiple ones. For big data platforms, Apache Parquet is emerging as the gold standard, and we are thrilled to be the first to support it across our entire data catalog. Our customers benefit in two ways. They get turnkey data in an optimized format and do not need to spend time and effort on reformatting, plus they get the day-to-day productivity boost from screaming fast query performance.”

About John Snow Labs:

John Snow Labs provides turnkey data for scientists across 15 areas of healthcare. Their service helps in the analysis of healthcare data. The company is specialized in data engineering to optimize storage, bandwidth and data access performance. John Snow Labs also invests in optimizing and testing clean, current and enriched datasets on the latest big data platforms. Its current partners include Cloudera and Hortonworks in big data, Atigeo and Turi in data science and open-source projects Spark, Presto and ElasticSearch.

John Snow Labs’s team believes that data science will be a major driver of progress for 21st century medicine, by providing quality DataOps and finding, cleaning, formatting, updating and publishing turnkey data for technology companies, healthcare providers, research, government and non-profit organizations.

