20 Years of Websites Evolution: Part 4 – Data Integration

This is part 4 in the series: 20 Years of Websites Evolution.
Previous part: Part 3 – Dynamic Pages.

As we learned in the previous chapter, CMS products are great tools for content editors to create and publish site content. But that is not the only kind of information sites needed to display.

Besides pages content, sites usually show data which was produced inside the company – for example, products catalog, customer information or statistical data. That data can be the result of another internal system or even entered manually into a file, but often some kind of processing was required in order to transfer it into the site database.

Finding out how to automatically move data from one system to another is the topic of this chapter.

 

Internal Systems Integration

Batch:

The first attempt to automate data entry was by using batch operations.
Batch operations are an automated process which runs on specific inputs (usual files), process them and insert the processed result into another place, usually a database.

In the early days, developers created those process by writing scripts and used them in order to import site content and data into the site database. In later years dedicated tools were created for that purpose. Those tools are called ETL(Extract, Transform, Load), and using them usually falls into developer’s hands.
Since ETL operations had to occur on a daily basis (sometimes even hourly basis), an automated scheduling system was also required.

There are many ETL tools like Oracle Warehouse Builder, SAP Data Services, IBM Infosphere, Microsoft SSIS and many more which all share 3 main characteristics: they are all expensive, slow, and complicated to use.

Live – Messages:

A different approach to ETL is instead of “batch processing” the entire data once a day, just transfer data update as it occurs. This is called “event-driven” or “message based” operation and is usually preferred over batch processing.

The first generation of message-driven tools was called ESB(Enterprise Service Bus) which are used to transfer live information from one system to another. Since systems are different and can’t understand each other without translation (like a German person can’t speak directly to a Spanish person, they need a translator which can understand both of their languages), ESB comes with data translation and processing built-in. ESB products can be described as “real-time ETL” which operate on separate records instead of an entire data set.
There are many ESB products, some notable examples are: Oracle Service Bus, Tibco, Mule ESB, and the free WSO2.

The problem with ESB products is that they are very complicated (usually requires a dedicated team) and cost a lot (each year). This created a space for a range of smaller products that couldn’t do everything ESB does but usually are less complicated and free to use (open source based). Those products are called Message Brokers and they sometimes can be used instead of full ESB product.

Live – Services

As an alternative of relying on 3rd party products to transfer data between two systems, sometimes it is better to just let them talk directly among themselves. This is where Services come in. After adding services support in two systems, developers could create a “communication line” between them, and use that line in order to transfer data from one system to another.
The use of services in a system is called SOA (Service Oriented Architecture) and is the preferred way today to transfer messages between different systems (although not without its difficulties and limitations). The latest iteration of this technique is called Microservices and follows the “smart endpoints, dumb pipes” principal.

Integrating two systems together with services is never easy. Web services technologies have changed drastically during the last two decades, beginning from XML based SOAP message in the early 2000 and the transition to JSON based REST services (which are now the standard). But whichever the technology used, connecting two different systems with services still requires a lot of time and effort.

 

External Sites Integration

Sometimes the data does not come from an internal system, but from external company site – for example, stock data, weather data, news, lookup dictionaries (like cities and streets). In this cases, we need to be able to consume the data directly from another site.

Content – Client Side

The easiest way to achieve that was by doing it inside the browser itself, by embedding widgets which are fragments of another site that sits on your site. Widgets were used for a range of purposes, for example:
– clocks, event countdowns, auction-tickers, stock market tickers, flight arrival information, daily weather, phone books, pictures, calculators
– Public Profile (Gravatar), Badges, Signatures
– Sponsored content, content rings
– YouTube video, Facebook post
– Facebook Page Plugin
– Maps & Location Tools
– Widgets are also the primary way that online advertising works by placing Web banners

Embedding widgets can be done by several techniques, like IFrame, JavaScript, Images and Flash objects.

Content – Server Side

Web widgets were very popular because there were easy to implement – just add some code to your HTML and you are done. But that came with some two major problems:
1) They were limited to a single container inside the page
2) The webmaster did not have any control of the displayed content – they couldn’t adjust its look to match the site, they couldn’t process the content or enrich it with internal systems information, and if the external site was down or compromised it would affect their site as well.

In order to have more control over the external data and how it is displayed on the site, developers had to move the integration point to the server-side. This way they could use the power of server-side scripting languages to process the data before it is presented to the user.
This move had 3 major benefits:

  • Presenting the external data with the same look-and-feel of your site
  • Preprocessing the data – filtering, sorting, and adding data from internal systems etc
  • Caching the data – for example, we can perform a query to get a stock quote at the beginning of the day, and use that data in our site for the rest of the day.

But before they could do that they had to overcome one major problem – extracting data from other sites was a very complicated and unstable operation.

In the previous chapters, we learned that sites are made primarily made of HTML, which is used by the browser in order to format the data. But the same HTML was the biggest problem developers faced when wanting to retrieve data from another site – the data was “buried” inside the HTML and they had work hard in order to extract it. This process of data ‘reverse engineering’ was called Web Scrapping and it was complicated, had to be tailored for each specific site, and produced unstable results (whenever the source site design was changed).

It was obvious scrapping was not good enough and better ways had to be created, ways which allow transferring only the data between sites, without any HTML formatting. In other words site-to-site communication.

In the last 20 years, 4 technologies have succeeded each other in an attempt to build better site-to-site communication methods. They had overlapped usage for a while (and some are still in use today) but in general the timelines where:

Those communication methods are the data “roads” which connects our site, and with each iteration, the roads became bigger and more robust. And the evolution has not stopped – newer technologies like GraphQL are now being developed in order to continue upgrading our digital infrastructure.

Business Data

Besides sharing content between sites, business started to understand that the Internet had more to offer. They could use the internet in order to send and receive business data like purchase orders, invoices, shipping notices, products, stock, and even health records.

But before they could start integrating their systems together, a unified specification had to be made. This spec is called EDI, and there are many different implementations targeted for a different business like health care, insurance, transportation, finance, government, supply chain. Some of the more known protocols are X12, EDIFACT and HL7.
The move from physical to digital data exchange saved many operational costs and started the entire B2B age.

Implementing B2B integration is complicated and usually done by using enterprise-level products like Oracle B2B and IBM Integration Bus powered by specialized personals.

Functionality

Besides sending transnational data (like in B2B scenario), sites now are starting to depend on each other in order to perform complicated tasks. By using Web Services and API, sites can rely on other sites to perform action like sending emails (SendGrid), SMS, push notification, voice messages (Twilio), mapping & location services (Google Maps), make online payments (PayPal) etc.

Even things that traditional were implemented inside the company can now be moved to other companies sites, like logs analytics (Amazon Kinesis, Elastic ELK as a Service), data storage (Database-as-a-service, Firebase), and identity management (Azure AD, Okta).

 

Conclusion

Finding out how to bring data into the sites, especially from different companies, was not easy. From a start of manual batch scripts and ETL tools to more advanced live messaging products like ESB and Message Brokers, to 4 generations of direct site-to-site integration technologies – connecting sites together required 20 years of evolution to finally mature.

The end result is the latest generation of services, which is called WebAPI. Sites implementing those WebAPI can directly communicate between themselves over the internet, which enables them to do things which were previously not possible. Those hybrid web applications are known as Mashup.

WebAPI is the digital roads which connect the site together – they are the infrastructure of our digital reality.

Diving Deeper

If you want to find out more about the topics discussed in the Content Publishing chapter, I’ve created the following table where each topic is broken down to the requirements developers need to know in order to be able to fulfill it.

In the next chapter, we will learn how to create interactive sites for the modern age.

Next part: Part 5 – User Interactivity

What do you think?