These are the blog posts I've writen on API Evangelist about low hanging fruit, exploring the subject with a variety of clients. I'm always looking for new ways to introduce people to the concept of APIs.13 Apr 2016
At the request of folks on campus, I'm helping to identify the low hanging fruit when it comes to APIs a Davidson College. These are the spreadsheets, tables, and XML files I have identified so far as part of the spidering of the www.davidson.edu domain.
I will keep running the spider script I have running, looking for other opportunities for APIs. Not all of these machine readable elements will be worth reworking as an API, but the process of finding, publishing, and having a conversation about on Github will help move the conversaiton forward.
I get approached by folks all the time who are looking to do APIs at their company, organization, institution, or government agency. The reasons behind these desires to do APIs vary widely. Some want to do APIs so that they can deliver a specific web or mobile app, while many others just understand they need to get started somewhere, but are unsure of exactly where to begin with this daunting, never-ending journey.
In these situations I always tell people to start with the low hanging fruit--which means, if its already on your website, it should also be available as an API. If you are publishing data or content to your website, as HTML, CSV, XML, XLS, or JSON, you should also have it available via an API. The average company has a mess of information made available via their website, and the API journey is all about untangling this mess, and make it available for not just for use on the web, but also via mobile, messaging, bot, voice, and other emerging channels in which people are getting their information.
I've been asked to help identify the low hanging fruit for enough groups now, that I will be formalizing it as a service. My low hanging fruit process involves spinning up a web spider instance on AWS, giving it a domain as a target, and letting it slowly spider all the HTML pages, looking for the following low hanging fruit, over the course of a couple of weeks:
- Tables - Identify an HTML page with a table on it that has more than 5 rows.
- CSV - Identify an CSV file that is linked to from any HTML page within the domain.
- XLS - Identify an XLS or XSLX file that is linked to from any HTML page within the domain.
- XML - Identify an XML file that is linked to from any HTML page within the domain.
- Forms - Identify any page that has a form on it, and also index any fields available with it.
After I've spun up the spider server instance, and let it run for a few days, I can output a JSON list of any tables, CSV, XLS, XML, and forms that the spider has found. I publish each of these as JSON files, which can then be reviewed for possible API candidates. Ideally each entry gets fleshed out more, giving it a human readable title, a description, apply some tags, and hopefully identifying the source of the data, and better understand some of the goals around why it was published in the first place.
Once the low hanging fruit bot is released, I follow up with a manual review of the website being targeted, looking for common entities and objects that emerge from the top level, and sub navigation -- reverse engineering the sites information architecture, so it can also be considered alongside the other low hanging fruit that is defined. I would say this process helps identify many of the top level organizational and business motivations for sharing information, where the spidering, and targeting of low hanging fruit represents the under-currents, or less obvious organizational and business motivations behind making information available.
That is my low hanging fruit approach. It is a pretty crude, yet can be a very valuable way to helping jump-start API efforts at any company, organization, institution, and government agency. If a formal approach has not already emerged out of existing IT, and developer groups within your organization, you might consider a more grass roots, low hanging fruit approach to identifying the next steps for your API effort(s).
With this low hanging fruit target list, and website review in hand, we can start to talk about what is needed next when it comes to designing, defining, deploying, and managing APIs. If you need help with this, let me know, I'm now doing it as a service for more companies and organizations--just contact me via one of my channels, and I'll see what I can do.
P.S. Please make sure you have the legal right to spider the website in this way. I'm only looking to consider the low hanging fruit of your own organization, ad better understand how to jumpstart an official API effort -- we are just doing this from the outside-in.
I am hyper aware of where the ethical line exists, when it comes to being a hacker. I'm not a hacker that penetrates systems, or finds exploits, I am a hacker that provides quick and dirty solutions to problems, using technology, which in my case happens to be via APIs.
Through my work as the API Evangelist, I am evangelizing that companies should consider an API approach to help them be more consistent in how they operate online. Helping companies be more transparent in their operations, in a way that encourages participation from trusted partners, and even the public, through sensibly designed, and secured APIs.
When people from enterprise, institutions, and government agencies approach me and ask how they should jump start APIs within their group(s), I have a pretty standard response, which I call low hanging fruit (LHF). LHF is simply this: if their is a spreadsheet, CSV, XML, or JSON file located on your website, or data is available in a table format across your site, it should be available as an API.
If something already exists on your website as HTML, it should also be available in CSV, XML, RSS, and JSON formats, for direct integration into other systems, applications, and devices. The is is the difference between your resoures being available to humans in a browser, and it being available to humans via the thousands of other Internet connected devices that are becoming ubiquitous in our worlds.
For me, this is a pretty fundamental way to help people understand API. To help bring my point home, I have a website spider that will crawl any URL I give it, and return the existence of any CSV, XML, RSS, JSON, or table with over five rows. I then publish this list of resources to a Github project, which I call the "low hanging fruit" for any domain (aka company, organizations, agenc(ies), or institution). If it is available on your website, it should exist as an API. If it is available on your website it has already been deemed valuable, and ok for publicly sharing. Right?
Well, this isn't always the case. You see, a lot of people publish data and content on the open Internet, and believe it is secure, if they are the only ones who know the URL. Many folks are unaware of how things work on the web, and accept security through obscurity as a sensible way of operating in our digital worlds, simply because they just do not know any better. The Internet has pushed its way into our personal and business lives so fast, many folks just do not fully grasp what it is, and how to properly protect themselves, their children, their jobs, co-workers, businesses, organizations, constituents, and customers.
As I read yet another story of a security breach, this one at children's toy manufacture V-Tech, I'm reminded of the line I walk in my API Evangelist world. The hacker in question, shared the fact that V-Tech's online security was pretty superficial, and shared that he was able to get at 5M parent's and children's accounts. He didn't do it for profit, and sell to the black market, he shared the details with Motherboard, to apply pressure on V-Tech to tighten down security.
I will make clear, I do support folks poking around for security holes like this, even though it is not something I do not personally do. I will scan the public surface area of any companies site, or mobile application(s), but I understand the public / private line that exists, and will never intrude beyond the line I am closely walking. The problem is, while I have a good grasp of the line I walk, I do not have the faith that others will share the same understanding of where this line exists, or even that it exists all.
When I publish a list of low hanging fruit for any domain on Github, one can easily perceive what I did was hacking (it is). The slippery slope in process of scanning a public website, following every link within the domain, and indexing the available data sources, is that you can uncover loose privacy and security practices, and the ignorance and incompetence that exists at any business, organization, institution, or agency. The not so fun part of all of this exists in the current climate, where we go after the person who uncovers the problems, not usually the folks who created the problem.
In my low hanging fruit process, I'm not using SQL injection, or other common security exploits, I am simply spidering what is already publicly available on a website. The problems comes when people in power, do not understand the difference, and in this current, very lopsided environment, there is a huge chance of getting swept up on the wrong side of the perception and understand of exactly what hacking is, or isn't. I acknowledge these dangers exist, but will be pushing back, hoping to change the perceptions that exist, whenever I possibly can.
I strongly believe in APIs, and when done right, they can benefit businesses, organizations, institutions, agencies, and the public and markets that they exist in. I think low hanging fruit is a great way to help individuals effect change from the bottom up, by demonstrating the API potential, using safe, already available resources. Groups that embrace a domain-wide API first strategy will have their houses in order, and will be more respectful of privacy and security of those that do not. The problem comes when we apply a low hanging fruit process in some of the more disorderly households, where ignorance, incompetence, or even straight up corruption exists--the power will bite back hard.
I write this post to help me set a stage, that will hopefully keep me out of trouble, as I continue to help groups understand where to begin with their API journey. As I do this, the security, surveillance, and privacy world seems to becoming much more volatile around me, which tells me my work is all the more important, but also runs the risk of being misunderstood. What a crazy digital world we are creating for ourselves, I worry about our future.
I wrote a story a couple of weeks ago, about how to kickstart APIs at the University of Oklahoma (OU). I ended the post, saying I would find some easy targets for generating the initial APIs, and publish a basic developer portal using Github Pages. After some work I think I have enough done to initiate another conversation with my friends at OU.
In any company, organization, government agency, or institution, where you are trying to decide where to start with APIs, the public website is the place to start. If data and content is already published to the website, it should also be available in a machine readable format via an API—this is the obvious place to start at OU.
To help find the low hanging fruit, when it comes to data and content at OU, I wrote a simple script that would slowly spider every page of the ou.edu website, looking for the opportunities, providing some potential targets for deploying APIs. So far my script has identified 183,094 pages across university sub-domains, and processed 52,281 of these pages looking for any HTML tables on pages, with over 20 rows, as well as .xls spreadsheet files, and XML files.
I’m running the script slowly in the background, as I don’t want to put any burden on OU sites, and just pulling pages every few seconds. To date I’ve identified 252 pages that have large HTML tables, 330 spreadsheets, and 68 XML files. While the script will still be running and looking for more, I think this represents a pretty good start when it comes finding the low hanging fruit for open data and content at OU--resources that should be APIs.
As I do with all my projects, I published my work for the OU API effort on Github, as an open repository. Using this repository, I’ve published the low hanging fruit as JSON, and HTML pages to allow for easy browsing:
I’ve also created another page which is driven from the Github issues for the repository, to help handle discussion around which tables, and files should be targeted. I will put some thought into how to improve the process, better using Github to move the conversation forward. I’m not going to be able to do this alone, ultimately we need the students and factuality at OU to get interested in participating, but this will take time—something which Github is well suited to help manage asynchronously.
The next steps for this project, is to have another conversation with Mark and Adam from OU, and look through some of the low hanging fruit I’ve identified. After that I will put together a strategy for using Kimono Labs to help us easily publish APIs from the targets I’ve identified. I also suspect that as part of this process there will be a lot of de-duping, and other data janitorial work before we can publish some usable APIs, and move things forward.
If you have a story you think I should telling about I recommend you write up the thought on your own blog and share a link with me, or feel free to submit a Github issue on this research project's repository, or use of the other channels available on the contact page.