Re-mapping Data Sources

Part of your Continual Improvement Activities in Power BI data modelling will always involve remapping data sources. If you’re using Database sources, for example, you may have been able to write a better source script to develop a view (or get one written for you) that will greatly speed up your data refresh. In our example, we are using Power BI dataflows, which means that by using parameters in a URL ([RelativePath=…]), it is possible to use a function to download the data from Amazon AWS directly and then execute the ETL steps.

YouTube player

Our initial work managed to validate that the URL remained unchanged. Initial sampling had shown that there was a difference in the URL’s across years, so the expectation was that there might have been a requirement to make further changes.

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-11.csv
https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2019-11.csv

It appears that this change is cosmetic, with the subdomain also mapping to the folder (nyc-tlc. = /nyc-tlc/). There is an anomaly within the data in that September to December 2019 data shows near double the taxi volumes of other months. When you identify anomalies like this within your data, the task is the same. Is the issue something that has been unintentionally introduced by your modelling activities, or is it “real”? From my experience, it is almost 50:50. Never be afraid to take the time to investigate anomalies. The challenge is, of course, that while these show you a point to investigate, these steps should be used regularly even when everything seems perfect. Trying to find your failings is a major part of the role.

In this case, a simple investigation of the individual files and their size made it clear that there was a mistake in the combination step. The file sizes were too similar to contain double the journey count, so the combination step was identified and it was clear that the “2019 9-12” flow was being combined twice.

The final step that will be completed before the next video will be to add three parameters to each flow, these will allow people to put a Year, Start Month and End Month, this will then be used to build the file list to then download the files for each flow.

Leave a Reply