Tracking users across websites: Where’s my data going?
On the 10th of November 2015, I was a guest (together with ICT Law expert Matthias Dobbelaere) on the news/talk show De Afspraak on Canvas, a Belgian TV-station. You can watch the interview (which is in Dutch) – here. This was in response to the news that the Belgian Privacy Commission (also known as CBPL) won a case against Facebook in which the social network site has to stop tracking users which are not logged in or do not have a Facebook profile. If Facebook does not comply within 48 hours, they have to pony up some petty cash.
Yes, I should have combed my hair, but that’s not the issue here. The case itself seems like a high-profile attempt of little Belgium standing up against big, evil Facebook, but we argue that FB is just one of the many, many offenders. Thought it was time for a blog post explaining the problem and sharing my small data set of tracking behavior on Belgian news websites.
Tracking: What’s the issue?
When you browse to a website, this website might have an interest in tracking what you do. And it can do this in various ways:
- It can look at which pages you are requesting. This is quite inevitable: you’re requesting information, and the server giving it you will register that. However, is it okay for that server to remember forever who he gave that information to, when and how? Much like a library remembering forever which books you loaned – it builds a profile of you, as a person.
- Save a small file on your computer (they’re called cookies), and store information in it. Cookies are part of the web by design, and can be used for good things (remember preferences, …). Unfortunately, they are mostly abused by storing, for example, a unique identification number. This information can be used to identify you when you return to the website.
- This is where you say: “Aha, mister Baert, I’ll just ignore cookies then! Or clean them out often!” Think again, companies have found ways to sneak them through, and keep them stored even after a wipe, thanks to ever-changing web standards and clever tricks like EverCookie (That’s just evil. Bad programmer! Bad!).
- Storing a unique ID is just one of the many, many ways tracking companies try to mark you.
-
Trying to link / correlate data & information about you: If you’re also logged in to a social network site in the same browser, it can try to couple your current fingerprint (Technical info about your system: browser, system clock, time, device …) to a specific name, which it can find in other cookies or in links you click.
- Check this EFF site to see how recognizable and unique basic technical info about your computer really is: Panopticlick. Not that anonymous now, are we? (Another good tester is this one)
- Finding your general location and internet provider is also trivial by looking at your IP.
- Bugs in browsers / web standards can leak other info about you, like your web history.
- Tracking what you type and delete, where your cursor hovers on the screen, how fast you scroll, … is all trivial to register using Javascript, a web technology embedded in every web browser.
- Read more about device fingerprinting on Wikipedia
- And much, much more, but you get the basics.
The most serious offenders of this tracking misbehavior are online tracking companies who are out there to monitor your clicks, searches and reading habits as you move around the Internet. These companies embed their tracking code into other, well-known websites. They’ve spread like wildfire during the last few years. Lots of money to be made, lots of ethics to ignore!
A recent report said Google has the ability to track you across 92% of the world’s top 1000 websites. For a practical example, when you browse to CNN.com, you will not only get served images and text from the CNN.com server, but your computer will also connect to 50+ other sites. Automatically. Some of these connections are harmless and useful (grabbing a video of Youtube, sound, …), other are very disputable. The point is that the average internet user does not know this is happening. For him/her, browsing to a website is what it is: you type in the URL, the right stuff appears on the screen, happy panda. In the case of Facebook in Belgium, people (not logged in to FB!) visiting a Facebook page got a cookie on their computer which was used to identify them, and learn their interests to provide targeted ads.
A side effect of tracking is that your device has to waste bandwidth, battery power and CPU cycles to execute all this tracking code, which makes pages load slower and drains batteries quicker.
Good references for further reading on tracking on or accross website are this HowToGeek article, EFF Do Not Track, this Lifehacker article, and Wikipedia: Behavioural tracking.
The Cookie Law
The good people at the EU have voted what’s called The Cookie Law. Since the 11th of May 2011, every website that wants to track information of a user using one or more cookie(s) has to inform the user, and allow that user to refuse these cookies.
Problem is, the offer usually comes down to: “accept our cookies, or you don’t get the content and should get the f*ck out”, which results in users quickly and carelessly clicking away the “nag screen”. Very often the user agreement and/or privacy policy are dense, non-transparent works of epic proportions, in a very small font and hardly readable color, tucked away in the deepest depths of the website. Much like License Agreements, people don’t read this stuff, because they’ve got shit to do. These texts were created by well-paid lawyers to be as foggy, impenetrable and ambiguous as possible:
Check this actual example out of an actual privacy policy, which basically says: “We will adapt the site to your age, location, personal interests … without collecting or saving your personal information!”
Other sites just bend or break the rules: For example in Belgium, a lot of sites simply ignore the Cookie Law, or interpret it as: “We’ve got to warn for first-party (read: our OWN) cookies, but not for third-party (read: not our own) embedded (Facebook, Google, …) cookies”. The problem is of course, that these embedded code snippets don’t just turn up on a website: the webmaster of that website puts them there, willingly.
Why? What’s in it for a website (for example, a news site) to allow this? They get money from advertising (pay per view / pay per click) and get some interesting info about how their site is used, but there’s really no way of knowing how these external companies handle this data. They can store it for weeks, months, years or indefinitely. These companies might sell this to other advertising companies. To insurance companies. To criminals. To whoever has the money / bitcoin. The data might not be stored securely and get stolen. These companies might go bankrupt, what happens to the data? Et cetera …
A small test: Belgian Newspaper Websites
As a small test, we’ll browse to some well-known Belgian newspaper websites, and see what services they connect to when we open their home page. Let it be known that I don’t intend to target or single out these websites, I just want to visualize / show what happens when a user connects to a popular, well-known website.
<TECHNICAL INFO> Testing was done on a fresh installation of the latest Firefox, version 42.0, on a Windows 10 64-bit operating system. DNS cache was flushed and all internet cache was manually cleared, so I would see all the connections a new user would see. I was not logged in to any social network service. The browser was literally used for the first time since installation. Logging connections and graph visualization was done using the excellent Lightbeam Firefox plugin. The plugin website has some nice graphs and explanation, too! </TECHNICAL INFO>
In all of the following graphs, the round element is the original site we’re browsing to, the triangular elements are the sites contacted (white lines). If the site contacted also stores one or more cookies, the line gets colored purple. If the service provides a handy custom icon, it gets displayed too, but keep in mind: all of these elements are third-party websites, even the empty triangles.
De Morgen – www.demorgen.be
- Cookie warning? NO. Cookies stored: 7.
- Connection to 43 third-party websites.
We see the usual suspects: Google, Facebook, Twitter, Spotify. Some more disturbing tracking services are contacted too: The KRUX Next Gen Data Management Platform, for example. Chartbeat is another one. What data do these sites get? What can they get out of corellation? It’s pretty easy to use a javascript to read out browser info, local time, … (see Panopticlick) and get info about me.
De Standaard – www.destandaard.be
- Cookie warning? NO (update: Fixed on 26/11). Cookies stored: 22.
- Connection to 32 third-party websites.
Usual suspects, with some added services like Gemius (‘Knowledge That Supports Business Decisions‘).
UPDATE 26/11/2015: Since publishing this article and appearing on De Afspraak, this site started showing a cookie warning and a detailed listing of what first and third-party cookies they use. Amount and type of cookies seem to be unchanged.
Het Nieuwsblad – www.nieuwsblad.be
- Cookie warning? NO. (update: Fixed on 26/11) Cookies stored: 23.
- Connection to 24 third-party websites.
A whopping 23 cookies stored, and connection to a lot of the same services as the previous pages, allowing for cross-site tracking.
UPDATE 26/11/2015: Since publishing this article and appearing on De Afspraak, this site started showing a cookie warning and a detailed listing of what first and third-party cookies they use. Amount and type of cookies seem to be unchanged.
Het Laatste Nieuws – www.hln.be
- Cookie warning? NO (update: Fixed on 26/11). Cookies stored: 9.
- Connection to 38 third-party websites.
UPDATE 26/11/2015: Since publishing this article and appearing on De Afspraak, this site started showing a cookie warning. Amount and type of cookies seem to be unchanged.
PUTTING IT ALL TOGETHER
So what happens if you would – perhaps as your morning routine – visit each news site, one after the other? What data can be shared between services tracking you – and knowing which articles you read?
You’ll end up with 61 cookies on your computer (which you weren’t asked permission for, which is required by the EU Cookie Law!) and would have made connection (in addition to the 4 news websites) to 63 unique external services. Your personal data will have crossed over to different countries, because a lot of these external services have servers in the US. Good luck figuring out what happens to your data once it crosses that border. Your device will have wasted bandwidth, battery and cpu cycles on doing all this. A lot of this also happens unencrypted, since none of these websites support HTTPS connections (but that’s a story for another time!).
You can see in the following visualization how the four news websites connect to the same external services. These external services know what you do on all these sites. Interesting and probably scary stuff, if you ask me.
Don’t take my word for it, try it out for yourself! Download Firefox here, and the excellent Lightbeam plugin here. You can also download all my data (in .JSON format, one for each website) in this ZIP file: news_sites_data_lightbeam_jeroenbaert.zip.
FIDDLESTICKS! Is there nothing I can do?
Luckily, there are some user-side options to make sure you don’t connect to third-party trackers, and to make it much harder for these companies to track your behavior online.
But before we go into that: it’s important to raise awareness about these issues and be vocal about this to the services you use. Send a friendly e-mail / tweet to get more explanation on why they use certain third-party services, and how your data is treated. Demand answers. I’m not saying all of this is done on malicious intent, some might simply be oversight, misconfiguration or being uninformed about the effects. Only changing this problem at the core will make the internet a better place.
Like I said in the interview, it has become commonplace to see privacy as the new currency on the internet. If the product is free, you are the product.
Personally I’d much rather pay up a couple of euros to use a service instead of paying for it with my privacy. I’d like to have the option, at least. For example:I quite like a lot of Google services (Gmail is great! I’m using an Android phone!), but I hate their privacy policies. I would not hesitate for a second to fork over, let’s say €25 / year to get the legally binding agreement that they will not collect any data on me. Same goes for Facebook. A handy system to stay in touch with friends? Fine, it’s a good product. Just let me pay for it on a regular basis, stop showing me ads, and let me keep control over my data.
Now, some ways you can prevent being tracked!
Basically, it comes down to equipping your browser with an add-on/plugin, a little piece of software that prevents you from contacting certain third-party trackers. Some very cool people have made what’s called a “blacklist“, a long list of all sites/URL’s which have been reported to be in the tracking business. You can think of it like Santa’s list of children who’ve been naughty.
These plugins download a blacklist, keep it up to date, and as soon as they see that a connection is being set up to these sites, it aborts the connection. In addition, most of these plugins also block ads on websites. You can selectively enable ads again for sites you wish to support. Then again, if a website is okay with tracking you without consent, why would they deserve any money by you viewing ads?
All of these are available for all popular desktop browsers (Firefox, Chrome, Safari). If you’re still using Internet Explorer, it might be time to switch to a more feature-full browser.
- Privacy Badger – A tool developed by the excellent Electronic Frontier Foundation, a group dedicated to defending your online rights.
- uBlock Origin – My personal favorite. Lots of customization. The default settings should be fine for 99% of end users, and protect you from a lot of web nasties. For advanced users: I use it in strict “nothing third party” mode. This breaks some websites, but I can selectively see which third-party content I want to load then.
For mobile browsers, the options are more limited:
- Firefox for Android recently added support for add-ons, and I recommend the excellent uBlock Origin there too. Chrome for Android does not (and probably never will) support third-party extensions.
- If you root your Android Phone, you can use the open-source AdAway tool to edit your /etc/hosts file and block adds.
- Most recent iOS versions (for iPhone, iPad, …) allow people to install extensions to the Safari browser. I’m no iUser, but the Apple Store seems to have a lot of options already.
- I can recommend Better for iOS devices. It’s made by some very awesome people!
In the past, you might have seen recommendations for Adblock Plus or Ghostery. These tools have recently been exposed to have shady deals with the adtech industry they’re supposed to protect you against. For more info, check out these articles. So at the moment, I would strongly advocate against using them, unless you literally have no other ad/tracker blocking option.
A more general option is to use a browser / network that is built around protecting your identity and hiding your real location, like the TOR Browser Bundle. Also a good option for mobile devices. Just install and surf away! I’ve talked about this on De Afspraak as well, in a previous interview. Hair was combed.
Using a VPN service might also hide your location, but a profiling process will still happen, and it doesn’t protect you from actual tracking when you also use this VPN to log in to your social network sites.
You can also move away from using services that stomp their big corporate foots all over your privacy. Use DuckDuckGo instead of Google, et cetera. I can understand that this is not for everyone, though.
There is also an initiative of having your browser raise a flag called “Do Not Track” when it contacts a website. The problem here is, there currently is no legal or technical way to force websites to obey this order. Still, nice effort, and in my opinion, it’s not a bad idea to enable this flag. You can read about how you can do that here. Just don’t rely on it.
END
Would love to hear comments/tips on how to improve this article in the comments! You can contact me on Twitter or using the comment form here.
Update: In a (kind of lame) response, Facebook as of 03/12 blocks non-logged in users from viewing public Facebook pages. After solving a Captcha, the .datr cookie, identifying users, is still placed.
Updates: - Added Better and warnings about Adblock Plus and Ghostery, thanks to @aral - Added Firefox For Android - Added Privacy Badger suggestion, thanks to @haploc - Added section about 'Do Not Track' initiative - Another revision, added option for mobile users - Added updates of Nieuwsblad.be / Standaard.be showing cookie warning - Spelling & Grammar errors fixed, thanks @verhoevenben - Updated with Facebook response of 03/12 - Updated with Google sudy of 15/12