Fivetran has tons of native connectors and is adding new sources and destinations all the time. As of this writing, they have 157 in their directory.
Even though they have so many connectors, and most of the main data sources companies need are represented, there are thousands of SaaS applications out there and a seemingly infinite number of legacy and home-grown systems. This is why a lot of Fivetran customers still need to build out some custom connectors to ensure they are capturing all the data they need for analytics.
In this article, you’ll learn:
- The basics of how Fivetran custom connectors work
- How to work with less-sophisticated APIs, and
- How Fivetran billing works for custom connectors
The Basics of Fivetran Custom Connectors
So your company has decided to use Fivetran as your ELT solution. You’ve hooked up many of the +100 native connectors provided by Fivetran, but then you find out you need to ingest some data from an API that isn’t supported by Fivetran. Lucky for you, Fivetran has provided their users a way to take data from any source and load it into your data warehouse! By utilizing Fivetran connectors for AWS Lambda, Azure Functions, or Google Cloud Functions, you can ingest any data you want just by sending the proper JSON response. No matter what cloud function you go with, Fivetran has a specific request and response cycle that your function must be able to handle in order for Fivetran to ingest your data.
When Fivetran makes an HTTP call to your custom connector, it send a json request with the two keys, state and secrets:
- state is a JSON object that typically will keep track of where your connector left off. This could be a date in which you last queried, the last page your connector synced, or any other data required for your custom connector to run. On the initial sync, Fivetran sends a blank state, and your connector's response should set the state after each run.
- secrets is an optional JSON object configured in the Fivetran UI. This can be used to pass sensitive information needed for your connector to run, such as a username and password or an API key.
In order for Fivetran to be able to ingest your custom connector data, Fivetran expects a returned result with the following format:
- state is REQUIRED and must be a JSON object with the updated state.
- insert is OPTIONAL and contains the tables and the records to be updated or inserted to Fivetran
- delete is OPTIONAL and contains the records to be deleted from their tables
- schema is REQUIRED and must contain each table with the primary key(s) for that table.
- hasMore is REQUIRED and signifies whether there are more records to add during this sync. If set to `true`, Fivetran will immediately call your custom connector again with the updated state.
Now that you know what exactly Fivetran needs to ingest your data, there are some items to consider when writing connector code to ingest data from an API you don’t control. First off, due to the philosophy of Fivetran, there is no checking between the data within your connector’s response and your destination warehouse.
This means whatever your connector has within your connector’s insert key will be considered as a row to insert or update. If the key already exists within your data warehouse, then it will simply overwrite the data currently in that row even if the data is exactly the same.
Since Fivetran charges credits based on how many rows you ingest, it is up to you as the developer to determine what rows actually need to be updated, deleted, or inserted.
What if There’s No Updated Timestamp?
API maturity plays a key role in this design. A well designed and mature API will have a way for their users to query by some sort of updated_at field to only give you records that have been updated since a given date. This makes your life as a developer much easier since all you need to do is get any new records, query for any records that have been updated since the last time your custom connector function was called, and then return those records to Fivetran within your insert response.
When the API doesn’t have some sort of way to get only updated records, there are a couple of options.
- You could design a system that constantly is scanning the API and keeping track of records that update, and then have your custom connector function ask for that data, which could be data intensive, or
- You can simply re-ingest all of the API’s data again and let Fivetran reinsert all of the data. Reinserting all the data again is more costly and takes more time for your sync to finish, so it’s better to only update the tables you actually need to update.
What if There’s No Primary Key?
Another possibility you may have to deal with when trying to ingest an API’s data is an endpoint with no primary key for a record. If the endpoint supplies an ID, then your life is easy, and you can just use that as that table’s primary key. If you don’t supply a primary key, Fivetran will supply its own primary keys of _fivetran_batch and _fivetran_index.
If you allow Fivetran to assign these keys, every sync will be a new row within your data warehouse, and thus will rack up credits quickly. Additionally, this means you may get a lot of duplicated data. Unfortunately if the API you’re using doesn’t have a primary key, there’s no solution that fits every API to ensure a primary key. Since the schema allows you to have multiple primary keys, you can make a primary key of multiple columns within the table. It’s up to you to determine which columns in a table would make a record unique.
How Fivetran Billing Works for Custom Connectors
Once you’ve followed proper version control with git, thoroughly tested your code to ensure it operates the way you expect it, and have set up your desired cloud function on Fivetran, it’s time to figure out how Fivetran charges via the Fivetran Logs native connector as well as the fivetran_audit table. Between these two, you should be able to figure out your Monthly Active Rows (MAR), to figure out which tables are costing you the most credits, and how to reduce your MAR by table.
Like the native connectors, Fivetran provides a fivetran_audit within your function schema. There are two important columns in this generated table:
- table the table that was modified. This is useful to know which tables are taking up the most of your credits
- rows_updated_or_inserted tells you the number of rows within the table that were updated or inserted. This corresponds with the number of records your connector had in the insert key of the response.This is useful if you notice your connector is adding too many or too few rows. Additionally, you can use this table when trying to test when you’re figuring out the primary key based on multiple columns. If you have too few records being inserted, it could be that your primary key combination is not unique across all records.
Fivetran Logs Connector
This free connector provides several tables that are helpful with debugging. It is suggested you use this connector when designing your connector to ensure everything is working properly and you don't get charged for broken design. The important table to look at is active_volume, which contains the MAR for your instance of Fivetran. Within this table, you will find the column monthly_active_rows which corresponds to your credit usage for the month. Somethings to note about this column
- It does not represent the number of rows for the day, but it is in fact the current total for that month.
- It resets to zero at the end of the month.
- It is the number of unique rows that have been inserted, updated, or deleted during the month measured. This is measured on a primary key basis, and this primary key is only charged once per month. Thus, you can insert the same row a million times in a month and only get charged for that row once during the month.
- It is on a table by table basis, so your total MAR is the sum of your end of the month MAR for every table in the active_volume table
If you’ve made it this far by reading from top to bottom, congrats! I expect most will use this as a reference, and will hopefully refer back to it several times in the future.
- Any record in your custom connectors insert key will be inserted or updated even if the data is the same in your destination.
- In order to optimize your MAR, you have to design the logic on which records need to be updated, and that logic is highly dependent on the APIs you consume.
- Always assign a primary key for your tables if you can so you don’t duplicate data in your destination and don’t get charged for extra rows.
- Both the Fivetran audit table and the Fivetran Logs Connector are your friends for debugging and for figuring out what tables are costing you the most.
- A primary key that is inserted, updated, or deleted will only count as 1 row towards your MAR even if that same key is inserted a million times that month, and this is on a month-to-month basis.
Things change quickly, and we learn new things every day. We will be sure to update this blog and/or post as we learn more tips and tricks, and as things develop with the Fivetran platform.
Need help with a custom connector?
Check out our custom connectors for some of the connectors we’ve built or are in the process of building.
If you need one of those connectors, or need one that’s not listed, contact us for a free quote on your custom data source.