Viewing 15 posts - 1 through 15 (of 15 total)
  • Author
    Posts
  • #1721134

    Maybe someone can help me out. Has anyone in the WGA attempted an automated way to gather stats for WI geocachers? I’ve been pulling some long nights finishing up some scripts, when I realized that someone may already have a quick way to get these stats.

    If not, I’ll keep coding for the next few nights. (I’ve got a command line tool now that gets details on caches found by username…added the database…but I still have a fairly large TODO list).

    #1746167

    I hate to dim your hopes and hope you are more successful than the others before you who have tried and failed.

    Groundspeak has protected the site rather well from people who try automated gathering of stats. IP blocking and other methods of prevention are in place. Part of the reason is that the tactics used in the past required too much of the server. We have all noticed the server is already overloaded each weekend.

    If you want to compare manually entered stats for WI cachers, Cheesehead Dave has a good site HERE.

    There is also a national database HERE, but again, it is manually entered information provided by each individual cacher.

    I wish you luck!

    #1746168

    Yep, I had a PHP script that ran under cron to collect find data. It lasted for about two days before it was IP banned. I wish you luck in trying, but don’t be suprised if you end up the same way…

    #1746169

    Well that sure stinks. Just when I get the script working too! If sites like geocaching.com could just use RSS, people like us wouldn’t have to write scrapers. I have an email into the main site, but I haven’t heard a response yet. Thanks for the warning though. I may just have to rethink some of my strategy.

    #1746170

    ok, so i took server load into account…it waits several minutes between each cacher update, and only uses one web hit per cacher. I’ll let it run for a few days, and I’ll let everyone know.

    #1746171

    Another tactic you could try is bouncing the request through a number of public proxies to avoid the IP ban.

    For example: http://www.publicproxyservers.com/page1.html

    The script could just cycle throught the list.

    #1746172

    Something like this:

    function sendToHost($proxy, $host,$method,$path,$data,$useragent=0)
    {
    // Supply a default method of GET if the one passed was empty
    if (empty($method)) {
    $method = ‘GET’;
    }
    $method = strtoupper($method);
    $fp = fsockopen($proxy, 8080);
    if ($method == ‘GET’) {
    $path .= ‘?’ . $data;
    }
    fputs($fp, “$method http://$host/$path HTTP/1.0rn”);
    fputs($fp, “Host: $hostrn”);
    if ($useragent) {
    fputs($fp, “User-Agent: MSIErn”);
    }
    if ($method == ‘POST’) {
    fputs($fp,”Content-type: application/x-www-form-urlencodedrn”);
    fputs($fp, “Content-length: ” . strlen($data) . “rn”);
    }
    fputs($fp, “Connection: closernrn”);
    if ($method == ‘POST’) {
    fputs($fp, $data);
    }

    while (!feof($fp)) {
    $buf .= fgets($fp,128);
    }
    fclose($fp);
    return $buf;
    }

    #1746173

    Acutally, I’m using curl (http://curl.haxx.se/) and lynx, which does some nice web browser emulation. I’m using perl to manage everything, and I’ve set a sleep of 60 seconds between grabbing a user’s stats. Now that I think of it, I’ll add a random number between 1 and 30 to that, so from a access.log read, it’ll look like I’m just browsing geocaching.com randomly. I’m already getting the users’s stats in a random order each time (don’t want consistant logs).

    I feel that this is fair also to the geocaching.com webserver. If I had the time, I’d be hitting all of these pages by hand each night, and updating a spreadsheet. This just allows me to do that without being at my computer, yet emulates as closely as possible what normal website usage would be. I’m a paying member of geocaching.com, and I’d even consider paying more if I had easier access to this kind of information

    #1746174

    Ok guys, I know its cold outside, and digging through the snow may not seem like fun, But I really think you need to get out more.

    #1746175

    If I got out more, I wouldn’t have time to add owned stats!

    Here they are:
    http://www.igotsomestuff.com/cgi-bin/showstats.pl

    Anyone have a nice big list of WI geocacher usernames? Until I get the script that automatically finds WI cachers and adds them to the database, I’ve been digging through logs and adding people in by hand.

    Thanks!

    #1746176

    quote:


    Originally posted by houseofbrew:
    Anyone have a nice big list of WI geocacher usernames? Until I get the script that automatically finds WI cachers and adds them to the database, I’ve been digging through logs and adding people in by hand.

    Thanks!


    Yup: http://wi-geocaching.com/membership/list.php?sort=created

    #1746177

    Maybe it would be good to write a script to pull user names off the WGA’s “recent logs” page. This would prevent you from tracking inactive teams and also develop a record of new cachers. The Beast had great records, until the number of new cachers became so great that hand written records became way too much of a task.

    A link to e-mail you on the site would be good too.

    #1746178

    Great idea Cathunter.

    See, the biggest problem I’m running into is that you can’t look up someone’s stats by username. You have to use their user number. This is found by visiting the user page, and copy-and-pasting a portion of the URL which contains their user number. I’ve got a script now, that can obtain user’s numbers, but you have to feed it the exact username of geocaching.com.

    I’m pretty sure in the next few nights I’ll be able to connect to the WGA page, and use those names, look up numbers, and get them going for re-occuring stats.

    Thanks for all the advice that everyone’s given. I hope that everyone will be able to enjoy these once they are more completed.

    #1746179
    Ray

      I can provide a list of about 800 names belonging to Wisconsin cachers that have appeared on gc.com over the past two years. This represents about 2/3 of the Wisconsin cachers active during that time, and includes >80% of those who are currently active.

      #1746180

      great!
      jcb at integralpro.com

      If you could please send me an email at the above address (notice that I spelled @ out, so that spiders don’t pick up my email and spam me )

    Viewing 15 posts - 1 through 15 (of 15 total)
    • The forum ‘Old General Forum (Busted)’ is closed to new topics and replies.