Launch HN: Dataform (YC W18) – Build Reliable SQL Data Pipelines as a Team

thom · on July 15, 2019

I take it everything is based on schedules, there’s no runtime magic to handle incremental updates in a timely or efficient way? I’ve been watching @frankmcsherry’s Materialize.io with interest, for example.

1ewish · on July 15, 2019

Possibly controversial, but I don't believe there are that many situations in data analytics where you really need realtime/streaming analytics. It usually comes at the cost of significant additional complexity and maintenance, that ultimately hurts reproducibility and agility.

Having said that, we are working on triggering table builds from watching changes to consumed data sets, which paired with an incremental table build would allow you to reliably achieve latency of under a few minutes.

devbug · on July 15, 2019

I disagree.

Near realtime (say sub 1-minute) feeds can be leveraged to extract a lot of value out of the data you're collecting. Perhaps maybe not for your typical SaaS startup, but for free-to-play video games you end up leaving so much value on the table if you can't quickly (automatically) respond to spend or churn indicators.

I think a good rule of thumb for this is looking at how fast your decisions are made. If you're making decisions daily then refreshing your data every 24 hours is probably good enough. If you're making decisions every hour, then every 60 minutes is probably good enough.

Another factor is scale. When you're dealing with 6-figure CCUs and are trying to optimize conversion or retention through split-testing, figuring out which variants are anomalously poorly performing quickly can save you a whole lot of money.

I reckon there's at least a 50 titles that can benefit from streaming analytics, which immediately affects anywhere from 100-500 employees (analysts, engineers) and likely influences over 1000 (broader company). That's a significant portion of the field.

cgio · on July 15, 2019

That’s why I don’t like the term real time. Decision time should be the criterion.

salamander014 · on July 15, 2019

Also, a true Data-Warehouse has it's data frozen on a set recurring schedule, say every hour or every 24 hours. This is so that users running related jobs at different times throughout the day see the same results and are working with the same data.

thekhatribharat · on July 15, 2019

1. What do you think of NoETL and SQL++? Ref: https://www.couchbase.com/products/analytics

2. Do you think adopting a tool like Dataform makes more sense at mid-stage companies? Ref: https://thinkgrowth.org/the-startup-founders-guide-to-analyt...

G2H · on July 15, 2019

1. I think NoETL products are complementary solutions. NoETL can be a great option for ad hoc queries and when performing a one-off analysis.

But for your company's analytics, you want to build a suite of well defined, reliable and documented datasets that power you dashboards and your regular analytics.

Another way to say it is that with NoETL, you write one query to answer one complex question. With ETL (actually ELT), you define core datasets with which you can answer many questions.

2. In terms of adoption, products like Dataform are helpful from the day you start having data in your data warehouse and have a full time analyst. It's basically giving superpowers to your analysts.

thekhatribharat · on July 15, 2019

> well defined, reliable and documented datasets

Can you elaborate this?

NoETL targets semi-structured and schema-less data models prevalent in NoSQL/HTAP data stores, so if well defined relates to well defined schemas, it maybe doesn't apply here.

As far as reliability (aka reproducibility) and documentation is concerned, I guess it's more about building tools that promote sound engineering processes like version control, modules (aka reusable snippets), documentation (could be Jupyter notebooks, etc.) and these tools could be built for NoETL systems too.

buremba · on July 15, 2019

Congrats on your launch! How is this different from DBT?

G2H · on July 15, 2019

Our framework (and the CLI interface) is pretty similar to be honest, except it’s in JS. We also spent quite a bit of time focusing on performance.

The main difference is that we provide an all-in-one web platform for teams to develop, test and schedule their SQL pipelines. With one web interface and in 5 min, you can write a query, test it, add it to a schedule, push the changes to GitHub (or submit a PR) and monitor your jobs.

We make it easy to enforce software engineering best practices like version control and isolated deployments. For example, our version control works similarly to Looker. Analysts can work simultaneously from different branches while not having to use different tools nor use the command line.

buremba · on July 15, 2019

What’s the rationale behind using JS instead of Python/Jinja? Any insights on that?

1ewish · on July 15, 2019

Great question :)

1. Speed. We wanted compilation to really, really fast. 2. In all honesty we just weren't big fans of Jinja, having used it for quite a while. JS templating is OK out the box, and we are considering React like syntax in the future. 3. Love it or hate it, NPM packages are pretty easy to work with and we are working on a number of packages at the moment. 4. When you start to look at things like UDFs and Cloud functions which enables some really cool use cases, JS seems to be prevailing (in Snowflake and BigQuery at least).

I will admit though that we do usually get a bit of a shocked reply when people hear it's not Python!

buremba · on July 16, 2019

I see your point but I think that data analysts are usually familiar with Python, not JS. The UI is cool and intuitive, it looks like you copied most of the concepts except Jinja from DBT. That's perfectly OK though, let's see where it goes!

opportune · on July 16, 2019

>In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts.

Do you have any plan to add a visual interface? Unfortunately many analysts I work with are not really capable in SQL, or don't fully grok SQL features.

Also, how you plan on creating a moat against cloud companies putting the functionality of your project in their own proprietary solutions? For example, in Azure you can already use the Azure Data Lake stack of ADL Storage + Analytics + Data Factory to do what you are doing. These have IDE integrations, CICD integrationos, git integrations, are first class citizens in their cloud interfaces, etc.

1ewish · on July 16, 2019

A UI for building data transformations (SQl queries) isn't on our immediate roadmap, but we have thought about it.

We are taking a bit of a bet here that the industry is moving towards SQL, and we see a growing interesting in analysts wanting to learn it.

Moat: Always a concern! Short term we think most of our unique value comes from the collaborative web development experience which is not OSS currently. Also staying platform agnostic is important to us. Longer term, we have some plans which without going into details mostly focuses on building value around a community and our SQL ecosystem, but not ready to get into details just yet.

thundergolfer · on July 16, 2019

Check out the demo. There’s a significant UI available.

polskibus · on July 15, 2019

Apart from being cloud-focused, what's the difference between dataform and a classic ETL engine like SQL Server Integration Services?

G2H · on July 15, 2019

Great question. Beyond being on the cloud and everything that goes with that, I'd say that there are two fundamental differences with a product like SSIS.

1. Dataform focuses on transformations happening _inside_ the cloud data warehouse. We don't move data between systems. More and more businesses are moving from traditional ETL processes to ELT and centralising their raw data in their warehouse. Dataform help businesses manage the T in ELT.

2. Dataform is built with software engineering best practices in mind. In Dataform, all your transformations are written in code (mostly SQL) instead of a GUI like SSIS. The code can be version controlled, edited and fits better the large amount of transformations teams have to deal with.

polskibus · on July 15, 2019

Is there a tutorial to run it on-prem? For example for development or testing purposes?

What data warehouses are currently and which are on the roadmap? Is ClickHouse somewhere in your plans?

1ewish · on July 15, 2019

On-prem: Right now our IDE is only available as SaaS, although we will be looking at this in the near future. You can develop and test projects with the CLI and deploy them yourself but no tutorials for setting this up beyond the basics yet: https://docs.dataform.co/guides/command-line-interface/

Warehouse support: Athena/Presto and Azure are top of mind. I've not come across ClickHouse before but I'll definitely add it to our tracker!

khaledh · on July 16, 2019

What's the core transformation technology running the SQL transforms? Are you relying on the cloud data warehouse itself (i.e. BigQuery, Amazon Redshift or Snowflake) to run the SQL transforms? Asking because the transform workload usually dominates the query workload if they're not properly separated.

justjico · on July 16, 2019

Just from the docs, it appears to rely on the warehouse itself. It’s basically SQL + js templates make data engineers/analysts jobs easier.

1ewish · on July 16, 2019

Yup, the warehouse or query engine does the heavy lifting, we do the orchestration and make easier to develop these pipelines.

dataminded · on July 16, 2019

Any success getting through the VRM process at a financial services firm? Your product looks very exciting.

G2H · on July 17, 2019

Thank you! We haven't been through VRM yet but that's about to come up. If you have any tips, would love to hear your thoughts. I am at ghh@[domain]

dmundhra · on July 16, 2019

Does it work cross databases? Ie, if data is stored in multiple databases and needs to be combined for analytics? Currently, we are getting all data to S3 via DMS and running a script on top of this to combine data

1ewish · on July 16, 2019

Dataform still requires you to move your data to a single place where it can be queried together, whether that's S3 with Presto, Redshift spectrum, or something else.

There are lots of great products for moving data like this, Stitch, Fivetran are both great and we've used them ourselves, but they don't help you manage complex data transformation processes.

dttos · on July 16, 2019

Looks like yet another ETL/ELT tool.

We have been using composable with quite a bit of success: https://dev.composable.ai

meritt · on July 16, 2019

I'm routinely confused about the role these sort of tools play. I conceptually understand what they're doing but weren't these problems solved a long time ago with materialized views and foreign data wrappers?

The hard part of ETL for me has always been gracefully handling the outliers, the data that you need to look at within the context of multiple rows (e.g. duplicate rows that aren't exactly the same so it's not a simple SELECT DISTINCT) in order to make the correct decision, or the entity matching/mapping/categorization that often is necessary. Having the lookup tables that often need manual oversight when new entries show up that haven't yet been properly tagged. Or if you're ingesting address data and you want to normalize it through a geocoder, where exactly does that occur in these SQL-templating "pipelines"?

I feel like these are basically focused on moving data between different DBs and generating group-by queries to populate some rollup tables?

G2H · on July 16, 2019

Indeed. Dataform however only does the T (transformation) in ELT. Most businesses are centralising an exploding diversity of raw data in the data warehouse and Dataform helps them manage the data there.

One other main difference with a product like composable is that Dataform is built with software engineering best practices in mind. The transformations are written in code (SQL) rather than a GUI. It enables version control, reusable components for example and makes it easier to manage a large number of interdependent transformations.

dttos2 · on July 17, 2019

To be fair, these other tools do this as well.

mgberlin · on July 15, 2019

Does this add any value beyond what PDTs in Looker provide?

G2H · on July 15, 2019

Definitely!

We see that PDT doesn't scale well with the the amount of data complexity you usually have to deal with in the warehouse.

On top of that, the results of PDTs are only accessible to Looker. As your data team matures, there will be many other users and applications that need to take advantage of your transformed data in the warehouse. (e.g. modelling in Python)

Dataform also brings other features not available in PDTs. For example:

- Data validations tests (assertions) to ensure data quality in all your data applications (including Looker)

- Incremental tables: You can define tables that update incrementally instead of being rebuilt from scratch, hence delivering better performance and saving costs

- Reusable code. You can define reusable code snippets across your entire project

scapecast · on July 16, 2019

I much agree with this response. The term that I've heard to summarize all of the above is "semantic layer". That layer will run in the warehouse, vs. some external dashboard tool.

starik36 · on July 15, 2019

Is there a roadmap for adding other data sources?

1ewish · on July 15, 2019

CTO here, What are you after? :)

Athena/Presto, Azure are next in our sights.