Concerning SteamSpy’s numbers on Dirty Bomb you just have to think logically where they could come from if you developed a service like that yourself.
To compute the peak concurrent players yesterday, you just have to look at the current number of players for the most popular games, which is available from Steam’s own stats page. This number has a margin of 0% because Steam computes it by looking at its own internal database, which covers all Steam players in the world. Currently Dirty Bomb is on this list.
To get the numbers for audience in 2 weeks, you just need to look at your own profile page. It tells you which games you played in the last 2 weeks for how many hours. All you need to do now is collect this information for as many players as possible. And this is where the margin comes from. Some steam profiles are private, so you won’t get any information about their games. Also you cannot look at all player pages at the same time because this would send millions of requests to steam every second. So in reality one would update the information about each profile every couple of days, similar to how web search engines repeatedly visit web sites.
Steam’s own stats page gives you the total concurrent number of players for the last 48 hours (currently peaking at about 11 million). Compare this to your own number of players that you got from scanning the public player profiles. Now you just have to apply common statistical methods to compute the margin of error.
I highly doubt that SteamSpy’s margin includes smurfs. Identifying a smurf is a challenge even for the game developer, who has access to much more information than hours played. Even if someone devised heuristics like “a free to play shooter has x% smurfs”, it’s doubtful that applying them improves the estimated player numbers. Just consider that the number of smurfs in Dirty Bomb reportedly has increased when the Humble Bundle made it cheap to get a starter pack.
BTW to actually implement most of this, is no need to write fancy web crawlers. Steam already provides a Web API and there are libraries like [url=https://github.com/SteamRE/SteamKit]SteamKit to make it even easier. The main challenge lies in providing the infrastructure for a publicly accessible and up to date web site collecting this information about all available Steam games and a large amount of its players.