Fivetran has tons of native connectors and is adding new sources and destinations all the time. As of this writing, they have 157 in theirdirectory.
Even though they have so many connectors, and most of the main data sources companies need are represented, there are thousands of SaaS applications out there and a seemingly infinite number of legacy and home-grown systems. This is why a lot of Fivetran customers still need to build out some custom connectors to ensure they are capturing all the data they need for analytics.
In this article, you’ll learn:
The basics of how Fivetran custom connectors work
How to work with less-sophisticated APIs, and
How Fivetran billing works for custom connectors
The Basics of Fivetran Custom Connectors
So your company has decided to use Fivetran as your ELT solution. You’ve hooked up many of the +100 native connectors provided by Fivetran, but then you find out you need to ingest some data from an API that isn’t supported by Fivetran. Lucky for you, Fivetran has provided their users a way to take data from any source and load it into your data warehouse! By utilizing Fivetran connectors for AWS Lambda, Azure Functions, or Google Cloud Functions, you can ingest any data you want just by sending the proper JSON response. No matter what cloud function you go with, Fivetran has a specific request and response cycle that your function must be able to handle in order for Fivetran to ingest your data.
When Fivetran makes an HTTP call to your custom connector, it send a json request with the two keys,stateandsecrets:
stateis a JSON object that typically will keep track of where your connector left off. This could be a date in which you last queried, the last page your connector synced, or any other data required for your custom connector to run. On the initial sync, Fivetran sends a blank state, and your connector's response should set the state after each run.
secretsis an optional JSON object configured in the Fivetran UI. This can be used to pass sensitive information needed for your connector to run, such as a username and password or an API key.
In order for Fivetran to be able to ingest your custom connector data, Fivetran expects a returned result with the following format:
stateis REQUIRED and must be a JSON object with the updated state.
insertis OPTIONAL and contains the tables and the records to be updated or inserted to Fivetran
deleteis OPTIONAL and contains the records to be deleted from their tables
schemais REQUIRED and must contain each table with the primary key(s) for that table.
hasMoreis REQUIRED and signifies whether there are more records to add during this sync. If set to `true`, Fivetran will immediately call your custom connector again with the updated state.
Now that you know what exactly Fivetran needs to ingest your data, there are some items to consider when writing connector code to ingest data from an API you don’t control. First off, due to the philosophy of Fivetran, there is no checking between the data within your connector’s response and your destination warehouse.
This means whatever your connector has within your connector’sinsertkey will be considered as a row to insert or update. If the key already exists within your data warehouse, then it will simply overwrite the data currently in that row even if the data is exactly the same.
Since Fivetran charges credits based on how many rows you ingest, it is up to you as the developer to determine what rows actually need to be updated, deleted, or inserted.
What if There’s No Updated Timestamp?
API maturity plays a key role in this design. A well designed and mature API will have a way for their users to query by some sort ofupdated_atfield to only give you records that have been updated since a given date. This makes your life as a developer much easier since all you need to do is get any new records, query for any records that have been updated since the last time your custom connector function was called, and then return those records to Fivetran within yourinsertresponse.
When the API doesn’t have some sort of way to get only updated records, there are a couple of options.
You could design a system that constantly is scanning the API and keeping track of records that update, and then have your custom connector function ask for that data, which could be data intensive, or
You can simply re-ingest all of the API’s data again and let Fivetran reinsert all of the data. Reinserting all the data again is more costly and takes more time for your sync to finish, so it’s better to only update the tables you actually need to update.
What if There’s No Primary Key?
Another possibility you may have to deal with when trying to ingest an API’s data is an endpoint with no primary key for a record. If the endpoint supplies an ID, then your life is easy, and you can just use that as that table’s primary key. If you don’t supply a primary key, Fivetran will supply its own primary keys of_fivetran_batchand_fivetran_index.
If you allow Fivetran to assign these keys, every sync will be a new row within your data warehouse, and thus will rack up credits quickly. Additionally, this means you may get a lot of duplicated data. Unfortunately if the API you’re using doesn’t have a primary key, there’s no solution that fits every API to ensure a primary key. Since theschemaallows you to have multiple primary keys, you can make a primary key of multiple columns within the table. It’s up to you to determine which columns in a table would make a record unique.
How Fivetran Billing Works for Custom Connectors
Once you’ve followed proper version control with git, thoroughly tested your code to ensure it operates the way you expect it, and have set up your desired cloud function on Fivetran, it’s time to figure out how Fivetran charges via theFivetran Logsnative connector as well as thefivetran_audittable. Between these two, you should be able to figure out yourMonthly Active Rows (MAR), to figure out which tables are costing you the most credits, and how to reduce your MAR by table.
Like the native connectors, Fivetran provides afivetran_auditwithin your function schema. There are two important columns in this generated table:
tablethe table that was modified. This is useful to know which tables are taking up the most of your credits
rows_updated_or_insertedtells you the number of rows within thetablethat were updated or inserted. This corresponds with the number of records your connector had in theinsertkey of the response.This is useful if you notice your connector is adding too many or too few rows. Additionally, you can use this table when trying to test when you’re figuring out the primary key based on multiple columns. If you have too few records being inserted, it could be that your primary key combination is not unique across all records.
Fivetran Logs Connector
This free connector provides several tables that are helpful with debugging. It is suggested you use this connector when designing your connector to ensure everything is working properly and you don't get charged for broken design. The important table to look at isactive_volume, which contains theMARfor your instance of Fivetran. Within this table, you will find the columnmonthly_active_rowswhich corresponds to your credit usage for the month. Somethings to note about this column
It doesnot represent the number of rows for the day, but it is in fact the current total for that month.
It resets to zero at the end of the month.
It is the number ofuniquerows that have been inserted, updated, or deleted during the month measured. This is measured on a primary key basis, and this primary key is only charged once per month. Thus, you can insert the same row a million times in a month and only get charged for that row once during the month.
It is on a table by table basis, so your total MAR is the sum of your end of the month MAR for every table in theactive_volumetable
If you’ve made it this far by reading from top to bottom, congrats! I expect most will use this as a reference, and will hopefully refer back to it several times in the future.
Any record in your custom connectorsinsertkey will be inserted or updated even if the data is the same in your destination.
In order to optimize yourMAR, you have to design the logic on which records need to be updated, and that logic is highly dependent on the APIs you consume.
Always assign a primary key for your tables if you can so you don’t duplicate data in your destination and don’t get charged for extra rows.
Both the Fivetran audit table and the Fivetran Logs Connector are your friends for debugging and for figuring out what tables are costing you the most.
A primary key that is inserted, updated, or deleted will only count as 1 row towards yourMAReven if that same key is inserted a million times that month, and this is on a month-to-month basis.
Things change quickly, and we learn new things every day. We will be sure to update this blog and/or post as we learn more tips and tricks, and as things develop with the Fivetran platform.
Need help with a custom connector?
Check out ourcustom connectorsfor some of the connectors we’ve built or are in the process of building.
If you need one of those connectors, or need one that’s not listed,contact usfor a free quote on your custom data source.