Rationale

Why Arbalest?

  • Lightweight library over heavyweight frameworks that can be composed with existing data tools
  • Python is a de facto lingua franca for data science
  • Configuration as code
  • Batteries included, for example, strategies for ingesting time series or sparse data (arbalest.pipeline), or integration with an existing pipeline topology (arbalest.contrib)

Use Cases

Arbalest is not a MapReduce framework, but rather designed to make Amazon Redshift (and all its strengths) easy to use with typical data workflows and tools. Here are a few examples:

  • You are already using a MapReduce framework to process data in S3. Arbalest could make the results of an Elastic MapReduce job queryable with SQL in Redshift. You can then hand off to Arbalest to define additional ETL in plain old SQL.
  • You treat S3 as a catch all data sink, perhaps persisting JSON messages or events from a message system like Kafka or RabbitMQ. Arbalest can expose some or all of this data into a data warehouse using Redshift. The ecosystem of SQL is now available for dashboards, reports, ad-hoc analysis.
  • You have complex pipelines that could benefit from a fast, SQL queryable data sink. Arbalest has support out of the box (arbalest.contrib) to integrate with tools like Luigi to be part of a multi-dependency, multi-step pipeline topology.