Don’t use union in Spark when merging DataFrames — The story of null values when using union
TL;DR when merging two dataframes in Spark, do not use union but unionByName to have more safety when doing so.
We recently had an issue in our team that was hard to figure out. It looked like everything is working as it should and we did not know what was wrong. It was then we figured that union is actually not doing what we thought it would.
When you are combining or merging two DataFrames in Spark you might have stumbled upon union. We have and we were using it quite frequently. Lately another team depending on our data was actually noticing we had a lot of null values in one particular column in the data (more than usual).
When investigating this we found the problem was that there is a flaky behavior in the data that this column sometimes contains a lot of nulls, sometimes it contains what we were expecting. We found that this only happened after we have unioned two DataFrames in the resulting DataFrame. The original DataFrames we were unioning were exactly how they should be.
After going through the documentation it was clear why this was happening:
The columns of the two DataFrames were the same in terms of their names and data types but not in terms of their order in the DataFrames. This lead to this symptom we have seen. So if you are facing something similar or want to avoid it before this happens, always use unionByName as it is much safer and will keep you from having sad days:
It will also let you know if your schemas do not match up at all:
So for extra safety when handling (even strongly typed) DataFrames, always use unionByName when trying to combine two DataFrames!