<![CDATA[Deyvos Labs]]>http://blog.deyvos.com/http://blog.deyvos.com/favicon.pngDeyvos Labshttp://blog.deyvos.com/Ghost 5.75Wed, 13 Nov 2024 03:41:50 GMT60<![CDATA[Which Javascript framework to choose in 2019?]]>http://blog.deyvos.com/which-javascript-framework-to-choose/6587fd1f21c3587be7b08a8bFri, 23 Nov 2018 07:19:00 GMT

Most startups and product development companies nowadays struggle with the idea of choosing the right Javascript framework for their products. In a few years Angular has jumped from v2 to v7 (as of 2018 May) and React to version 16 as of March 2018. Javascript frameworks are developing at a blazingly fast rate. So the question often perplexing the newbie architect or lead UI developer is - "Which Javascript framework to choose?"

What capabilities does one look for in a Javascript framework?

Are you looking for MVC capabilities?

"When all you have is a hammer, everything looks like a nail, but then, there are more than one ways to skin a cat!"

Ok! Enough said!

Breaking an application down into layers has been the way server side code has been written. MVC pattern has been a popular way to write server side code. The same practice was also carried over to the client side. But smart people find ways to do things differently and efficiently. While your chosen framework may do things the way you want them to, it is worthwhile to look at the non-MVC way of doing things for other benefits which a framework may provide.

Is the framework performant and is it the right tool for you job?

While all frameworks perform well (as their creators claim), when you take an application to scale, the quirks and idiosyncrasies begin to surface. Just because the framework behaves well on a single developer machine is no guarantee that the performance will replicate on a larger system.

While this may look a little strange, but yes, when it comes to getting the business' buy in from a technology perspective, a popular framework is much easier for business to say yes to than an exotic one. Albeit, in some cases, an exotic one may be all you need. Deciding on this question is more an art than a skill and is best left to the team to decide.

To assess this question, one may want to have a look at the Github contributors, framework downloads and the number of questions answered on stackoverflow to get a feel how popular the framework is with the developer community.

Does it make you productive?

This boils down to the question - is the framework easy to use, to write code with and maintain? Sometimes, the framework in question should make you "productive" enough, else it can be "counter productive" (ok you get the point!).

In a nutshell, the development "experience" should strive towards optimization of that experience between ease of writing, maintaining code and having a framework which helps you churn out code quickly. If the framework (however

The point is akin to the urban legend about Python being "slow and non performant" a decade back yet coming out to being the most "vied for" skill set by developers in 2018. This was primarily because of its "productivity" which helps you dish out production ready code much faster than other statically typed languages like C# and Java. (This is apart from the fact that there are tons of libraries also being available in Python).

Are you looking to make an SPA or an MPA?

This decision may affect your SEO and digital marketing strategy. Also, if you are trying to get a pre-MVP product churned out quickly or if you have a team which still needs to come upto speed with the SPA paradigm, then you may want to go with MPA frameworks and may choose JQuery or other such frameworks. However, just plugging in a router from most of the SPA frameworks can quickly get you setup and help you migrate to an SPA framework. (Ok! It's not that easy too!).

In the article below, however most of the frameworks we are going to compare are based on single page application architecture.

Is the documentation friendly enough?

Very important. This is going to affect your productivity, especially if your framework is being actively upgraded over time and the way to do things is going to change as new releases are made. You do not want to get into a frustrating experience of coding in the trenches for prolonged periods of time and get battle weary!

Does the framework have features like templating, form processing & validation, HTTP callbacks and routers?

While this may seem an obvious requirement, not all frameworks have these by default. You may need third party libraries in addition to the base framework in ase

Final word!

Ultimately you have to choose your own poison. Primary facts you want to bear in mind are

What makes most sense for your project?

What intuitively makes the most sense to you as a decision maker?

What skill set of developers are readily available to you?

Having said that...

Here are the major frameworks in the Javascript world today and these are some of their salient features, keeping above points in mind.

Angular

Salient features

  • When you need to build an SPA (single page application), Angular aims to be the go to framework for developers.
  • It uses bi-directional data binding and dirty checking for data update/propagation
  • Highly event driven framework which can take some serious client side computation at times.
  • It uses the "digest" cycle to check object scope changes. These changes when detected are pushed to all places where they are referenced in the object model. The view updates based on changes in the object model.
  • Angular has a non-trivial learning curve.
  • The documentation though extensive is not that trivial to understand.
  • A powerful feature is the directive which syntactically looks like an HTML element decorator.

Pros

  • Very quick to put up application or view prototypes.
  • Angular uses directives which are a unique and powerful feature. They can provide powerful two way data binding and even create a fully functional web-part component.
  • Angular filters allow data formatting and parsing at runtime which reduces custom JS code significantly.
  • Angular (dependency injected) services allow common "service elements" to be abstracted into a single "service" component. These can talk to each other and provide a "service layer" on the client side itself.

Cons

  • If there are too many watchers, and the scope changes, high number of object scope changes can affect the user experience because Angular does dirty checking and digest re-evaluation of all scope changes. Usually that is not an issue but this can take time to stabilize if some watcher triggers an update.
  • At large scale, the framework like Angular has certain idiosyncrasies which need to be dealt with.
  • Compared to Vue or even React, needs more effort from a beginner to get the code setup. Learning curve comparatively a little more.
  • It uses Typescript which is different from Javascript and hence increases the learning curve. This may be a challenge if you have Javascript purists in your team.
  • Of late, the Google developers have been pushing a major version release every 6 months or so which means you may always keep falling behind the technology curve.

Ember

Salient Features

  • It uses the MVC architectural pattern and unlike Angular or React, there are no multiple ways of doing things here. You can call it an "opinionated" framework which makes sure developers are "guided" to do things that one right way.
  • Allows two way data binding which is done via "accessor" methods on objects.
  • Uses route and route handlers for matching the URL to the right template and setting up the application state.
  • Uses templates with "handlebar templates" syntax for producing HTML output.
  • Route handlers render models using classes which incorporate the Ember Data library.
  • Uses components which incorporates a template (JS based) and a source file which defines the component behaviour.

Pros

  • A very compact library (almost 8kb) unlike anLike Angular, it's an excellent framework for rapid prototyping.
  • Because of accessor methods on objects used for data binding, data checking (property checks) become efficient
  • Ember provides "computed properties" which allows gluing together multiple models, making it cool to work with.
  • Unlike an Angular style digest cycle, the validations are done instantaneously as the property change happens. This makes debugging and seeing data flow easier to visualize.
  • Ember like Angular uses a router, albeit much simpler syntactically and logically to understand. Ember routers are a simple yet powerful feature which very few other frameworks have.
  • Ember has pretty good documentation.

Cons

  • Ember does not do the dirty checking which frameworks like Angular do. You need to call the getters and setters to access, re-valuate the properties and have them propagate onto the models.
  • Some of the native Javascript functions are not available immediately in Ember - like map etc.
  • At large scale, the framework supposedly has certain idiosyncrasies one has to deal with.
  • Everything in Ember needs to be wrapped inside Ember objects. All dependencies for computed properties need to be declared manually.
  • The handlebars' syntax is much limited to something like Vue's usage of Javascript expressions.

Backbone

Salient features

  • An MVC framework which came out in Oct 2010 as an alternative to JQuery when you want to create an application with better design and less code.
  • Uses routers, models, events, views and collections as the building blocks of the framework.
  • Typically used when you are not sure which framework to use but have inherited a legacy app which has some messy, loose Javascript code and you need to clean things up.

Pros

  • The library is very compact (8kb gzipped) unlike React or Angular which may go into the 30-65kb (gzipped) range.
  • Models very easy to work with when dealing with RESTful APIs. The framework automatically propagates changes to HTML when models change.
  • Good for organizing JS code
  • Mimimalistic framework and hence performant since it gives the control totally to the programmer in terms of organizing code.
  • Extensible framework which can be used to put in ReactJS views.
  • Lots of plugins available on top of Backbone for specific requirements.
  • Performs well for even large applications

Cons

  • Has a soft dependency on JQuery and a hard dependency on Underscore.js
  • Not as easy to prototype applications when compared to Angular or even React
  • Sometimes the amount of code one has to write is much more than say one would have to write in Angular or React. However, the code is performant and gives more control to the developer.
  • Two way data binding and all associated wiring has to be done by oneself and is not supported out of the box by Backbone.
  • If you do not know what you are doing, one could develop memory leaks in Backbone inadvertently. Some level of experience and expertise may be needed here.

React

Salient Features

  • React is not an MVC framework. As the documentation mentions, it is a library for creating composable user interfaces!
  • React does not use templates and instead approaches view creation by breaking the view into components where individual component has its own corresponding view logic.
  • The library allows two way data binding by "providing helpers"
  • Allows virtual DOM creation and updating only the "diff" with the actual DOM,
  • Uses the JSX approach where the "Javascript code" is replaced by an HTML and CSS like syntax. React is for developers who like readability of HTML over raw Javascript.
  • Also, a point to remember is this - React was written with "functional programming" in mind.
  • React uses "Flux" instead of an MVC approach to data binding. This allows data to flow in only "one way" on the client side.

Pros

  • Updating the DOM with the diff allows it to be most performant (when compared to other frameworks) by minimizing the DOM view re-painting. Effectively the paints and re-paints are very efficient.
  • It is agnostic to the data layer. React portrays itself to be the "V" in MVC frameworks.
  • Code is relatively easy to read (and hence code!)
  • React is the only framework which does server side rendering. It can take "pre-rendered" markup from the server side and work with it (cache, render etc). This is unlike other frameworks which have to work in much more involved way when rendering data received from the server on the client side.
  • Architecturally, data in a React application flows in "one way" (does not use two way binding because of Flux) and hence it is easy to catch the data operations and view the data flow. It is easier to find where the data could have gone messed up.
  • Has a component like approach and hence is easy to aggregate them on multiple levels and layers.
  • React is easy to setup (say unlike Angular) and quickly prototype .
  • Functional programming aficionados will feel very familiar and comfortable working with React.
  • Easier to debug since there are not many places where you could go "wrong" with your code (as compared to Angular).
  • A huge advantage learning React is the native rendering of React component model onto the iOS and Android mobile platforms for native rendered apps. The React native framework built on top of React JS along with the native modules which one can write for iOS and Android make it a unique skill set to learn and be productive from a full stack engineering perspective.

Cons

  • React has a steeper learning curve compared to something like a Vue because of the need to learn DSLs like JSX and even ES2015+ to understand React's class syntax.
  • One also needs to understand the build system for production deployment. It is not as simple as just including a JS file in the <script> tag like for Vue.
  • React has a high pace of development updates. One needs to "re-learn" the way to do things since updates could change old ways quite frequently. Not every developer is comfortable with that kind of an ever changing working environment.
  • Documentation is another challenge with React and could be lagging behind the development as new updates are released.
  • Frameworks like Redux and Reflux are an additional challenge which developers need to learn apart from JSX which is a DSL needed to code in React. This can seriously slow down developer productivity unless one has already mastered ones' tools.
  • Some technology decision makers may not like the React "patent clause" which they argue makes React not an open source framework. While this is a debate still going on, you may want to keep this in mind.

Vue JS

Salient Features

  • The library was created with the aim of creating "interactive" interfaces.
  • Uses the Virtual DOM concept (similar to React) and the final changes between the virtual and real DOM are propagated to the real DOM similar to React's architecture.
  • A very "component"esque framework allowing one to make custom elements. These can then be used inside the HTML.
  • Allows only one way data flow between components unlike Angular.
  • Uses templates for binding DOM with Vue instance data
  • Uses directives on HTML elements for data / formatting manipulation (like v-bind, v-model, v-if, v-else)
  • Uses watchers to figure out and fire any events when data changes
  • Uses routers for navigation between pages
  • Uses computed properties for performing actions and calculations when changes are made to HTML/UI elements

Pros

  • Relatively lightweight compared to Angular and performant too. Not as heavyweight a framework (but who knows in future may evolve into) as say Angular.
  • Performance comes from the fact that like React, Vue also uses a similar approach of a virtual DOM whose final diff is updated into the real DOM.
  • Unlike React, uses the HTML, CSS and JS approach and hence is easier to pick up initially. No DSL learning requirements along the way.
  • Uses directives like Angular which allows two way data binding and also server side rendering etc.

Cons

  • Vue is not as popular as React or Angular and hence the developer community and resource support is still building up.
  • For the same reason, Vue also does not have the rich set of plugins which frameworks like Angular and React already have.
  • Another oft cited issue with Vue is the documentation which quickly becomes outdated - again for the same reasons above!
]]>
<![CDATA[Selling IOT services and solutions]]>http://blog.deyvos.com/selling-iot-services-and-solutions/6587fd1f21c3587be7b08a85Mon, 15 Oct 2018 09:12:41 GMT

There seems to be a slew of solutions in the market for IOT products. However, there are challenges along the way as Deyvos engages more and more with some of the startups selling IOT services and solutions. After engaging with a lot of product startups in the IOT space, these are some typical challenges they face along the way.

The customer journey or discovery process

There are two approaches to the IOT products marketplace. You either have a product out of the box, in which case you showcase all your capabilities, proof of concepts, specifications. compliance, certifications and industrial requirements to which the product confirms. A potential customer goes through your catalog of products and connects if he feels like something has piqued his interest. Suddenly he may give a call and ask over the phone, "Oh so what all does this smart temperature sensing device do? Does it also work for cold storages? To how many sensors does your product scale? Does it work for confectionery and processed food markets which have stringent temperature and humidity control requirements?"

After the initial set of questions, the customer then gets to the core of his requirements, "but do you have something which ... ", and then you hear all the special set of requirements which the customer has in mind. Once you hear the requirements, you know (and the customer knows) that it needs some customization.

This brings us to the dichotomy of selling IOT products. Most startups create POCs and working products which they try to sell in the market. They showcase it online and create inbound marketing campaigns to attract customers. However, what some startups claim is this - creating out of the box solutions in an IOT market does not work very well and can be counterproductive to their sales actually. We need to know what the customer really wants. Why so?

The reasons they cite are as follows...

Every customer has special needs

"Customers have specific requirements and your product needs to be customized. Out of box seldom works unless it is industry specific. "

Can we make a completely "generic" all purpose IOT device? Something say, as simple as a temperature or humidity sensing device? The more generic and general purpose the product, more the cost! Size, shape and form factor may still need some tweaking since it depends on the site/location where the customer intends to use this device. Any IOT device which is generic enough may not be the cheapest or best option available to the customer.

Customers do not want to commit to anything which they have not seen operating as per their needs

Now this seems like a contradiction of sorts. When it comes to the software market, customers are happy sharing their requirements and in turn be provided a customized or turnkey solution! They are willing to pay and this is a fairly well accepted mode of working.

Then why not the hardware or IOT market? Why are customers hesitant collaborating with an IOT product company, share their requirements and get a working prototype made? The answer to this question lies in the way the consumer hardware and electronics market has evolved over the last century. When customers look for an IOT product or solution, which is a hardware solution with an electronic sensor part to it, they have been trained to get "out of the box" solutions - very similar to other consumer electronic devices and contraptions. The perception that hardware can be "customized" just like software is still not a widespread notion.

The perception a typical customer has is - "Someone must have already cracked this problem and hence found a solution to it". It does not look right that I have something unique or I should engage with someone to find a hardware solution catered to my needs since I would not be the first one asking for such a device! Can I get something working out of the box. I do not want to spend too much time waiting for all the customizations I may need.

Also, if I ask them to build something tailored to my needs, it may turn out to be expensive affair or be practically unusable at all. In fact, I am not even sure what all requirements I may have in future or discover along the way? And how am I supposed to test this device and figure out that its the right product for me and my business needs? There are too many questions buzzing in the mind.

Surprisingly, the software industry over the last few decades has "thrived" upon the very fact that customization and turnkey solutions are equally important as out of box products. However the same is not a prevalent notion in the hardware, embedded or electronics industry. This is a perception which the industry leaders will have to change. Industrial and commercial grade IOT products would need to be sold as solutions which are "customizable" and "programmable" to the common man. They would have to be educated and the perception needs to be moulded that they are not limited to getting only out of the box solutions. Also such out of the box solutions may or may not be most suitable for their business and that it is ok to ask for customization.

But then there are customers who are willing to partner with potential IOT solution providers.

The common perception when selling IOT products has been this. Customers want the product owner to "create" a POC before they commit to a purchase. The number of consumers who are willing to fund or pay for the customization are few and in between. They just want their hands on a working solution and get an experience very similar to what they get when they go to a local electronic store. Just talk to the salesman or shop owner and be guided to the best fit solution literally "out of the (cardboard) box" and walk off by ordering N number of units of the same.

As experience shows, it is only the people who have idea about the IOT/embedded space who understand this field who partner and show willingness to get the customization done. Although, it is prevalent in the software industry to get customized solutions, probably the problem which IOT product manufacturers face when asking customers for customization is another one. The problem is...

Customers do not know what they want!

Customers want to see somethign working for which they do not want to burn any money. They want to see something fool proof and workable for their requirements. They are jittery trusting their own instincts and are not sure that they know everything under the sun under which this IOT device or contraption is supposed to work. They want the "specialists" to figure out on their own and give them the product which they (need to be educated on) is the best fit for their requirements.

The above issues come from the way the consumer has been trained to look for hardware or electronic devices. The Radioshacks, Home Depots and other stores have spoiled the average customer. The customer is trained to expect a "solution" which the experts have already figured out and suits their needs. They expect the solution to be already sitting somewhere on some aisle and some shelf "already", assuming that the experts would already have figured out some solution and produced a working product for it. It is only a matter of them inquiring about it and they will be guided to the store and the aisle number on which it can be picked up, ready to work. I am not really supposed to ask them to make something specially for me! That does not sound right! After all I am not an expert and I may not know what all the product is really supposed to do?

So how can the industry perception be molded?

This is where the industry leaders need to spread awareness at many levels along the following key activities.

  • Showcase products which have been very successful post hardware customizations. Showcase such products which have been more succesful than their "off the shelf" counterparts.
  • Make industry leaders and key decision makers aware of the "hardware studios" which have made successes for their customers. Educate users how partnering with them for your needs can actually make their products more successful.
  • Since there is a common perception that hardware solutions have a higher turnaround times, they still need to be educated that this is true of all their competitors too and that the product timelines would need to incorporate this point. The sales and marketing teams should plan likewise.
  • Promote hardware consultancies (like software consultancies) and highlight successful stories
  • Educate people more on the "next level for IOT platforms" available. After the customer has toyed around with arduino and raspberry pi based solutions for an introductory or POC kind of an application, he is not aware of the channles or resource which can help him take this POC to the industrial grade IOT product or hardware that he envisions.

]]>
<![CDATA[Data Pipelines and the Big Data World]]>http://blog.deyvos.com/data-pipelines-in-the-big-data-world/6587fd1f21c3587be7b08a83Fri, 12 Oct 2018 06:42:00 GMTIntroductionData Pipelines and the Big Data World

According to wikipedia,

A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

Data pipelines as of today are an essential part of the Big data world. While some challenges to large data stores typically in the data warehousing systems have carried on, the big data world because of its streaming data architectures and near real time analytics and visualization capabilities has given rise to some interesting data pipeline implementations. A lot of thought has gone into it and we try to summarize it in this article below.

Data pipelines - history and origins

The concept of a pipeline began from the good old Unix "Pipe" symbol (|). What was the symbol used for? For sending output of one command to another on the command line. Effectively speaking, it was the output of one "process" (on left side of the pipe) to be given as "input" to another process (which was on the right side of the pipe). The underlying concept was that of a pipe and filter.

"Input (Source) >> Pipe >> Filter >> Pipe >> Filter >> Output (Sink)"

Pipes are connectors which send data from one component (filter) to another.

Filters do actual data "processing" (transformation/cleansing/scrubbing/munging... whatever)

Input or Source is the actual data source (database output/text file/SQL resultset/raw text)

Output or Sink is the final output at the end of this chain.

However, this simple architecture has its own drawbacks. How do you handle data or memory overflow errors? What if the filter process died in between? You see the room for errors here...?

Data pipelines in the Big Data world

The world has moved on from there and as of now, with the rise of "big data", developers talk in terms of data pipelines. We often hear keywords like data pipeline, analytics pipeline, process pipeline and nowadays, "big data" pipelines, how do we differentiate between them? Let us begin with the basic concept of a data pipeline.

A data pipeline involves intermediate tasks, which encapsulates a process. Data pipelines open up the possibility of creating "workflows" which can help reuse, modularize and componentize data flows. As complexities increase, tasks can now be made to work in series or in parallel as part of the workflow.

In a nutshell, data pipelines perform these operations

  • store schema and access information of different data sources
  • extract discrete data elements of the source data
  • correct errors in data elements extracted from source data
  • standardize data in data elements based on field type
  • transform data using rules driven framework
  • copy extracted, standardized and corrected data from a source data source to destination data source
  • join or merge (in a rule driven way) with other data sources
  • move and merge this data (on-demand basis) and save it to a storage system typically called a data lake

Data pipeline components

A data pipeline would typically include

Serialization Frameworks

There are plenty of serialization frameworks in the market like Protocol Buffers, Avro, Thrift etc in the open source world. Systems (in this case data pipelines) need to be able to serialize and de-serialize data when sending from a data source to a destination data source. The systems should understand a consistent data packing and unpacking mechanism.

Message Bus

Message buses which are typically rules based routing systems are the actual workhorses of the data pipeline, helping move chunks of data (sometimes at blazing speeds) from one system to another. The Bus can understand multiple protocols, serialization mechanisms and intelligently route data between systems using a rules engine.

Event processing frameworks

Data pipelines need event processing frameworks to identify trigger events which help identify and then generate necessary data. This data which needs to be routed to systems using a rule driven framework (typically inside a message bus), typically is identified by events generated by "event frameworks".

Workflow management tools

These are different from the rules based routing frameworks working inside a message bus. Typically these are "orchestrating" or "choreographing" systems which supervise the processes which run inside your data pipelines

Storage layer

These are file systems or data storage systems (called the persistence layer), which allow data to be saved. The data could be entering the storage layer in a stream form or a batch form.

Query layer

This is the layer where queries are made on the storage layer.
Storage layers nowadays typically support polyglot persistence. Polyglot persistence refers to a persistence architecture which uses multiple database instances, where the instances could be of different type. There is flexibility in terms of using a mix of languages to query the databases and merge the results (using mapreduce etc) to obtain the data expected from the query output.

Tools which allow this kind of querying are Apache Hive, Spark SQL, Amazon Redshift, Presto,

Analytics layer

This is the layer where actual analytical models are made from the data obtained from the query layer. In the analytics layer, the parameters/variables for the predictive model are extracted and tuned. These variables and their parameters may change as the model changes with new data being ingested. It is in the analytics layer where algorithms like k-means, random forests and decision trees are used to create machine learning models.

What are stream operators, data nodes, scheduling pipelines and actions in a data pipeline?

Before we understand the concept of stream operators, we must understand field names field values and records. Record is immutable data structure containing data of field types. Each type has a name and a value. A data pipeline consist of data readers and data writers. The data reader streams record data into the pipeline by extracting the data from source data location. Data writers stream record data out of the pipeline and into the data target locations. Stream operators select using filters that data which needs to be transformed or altered or enriched when in flow inside a pipeline. Stream operators could implement the decorator pattern on top of readers and writers to provide them such filtering capabilities.

A data node is typically a data location storing a certain type of data. The pipeline uses the data node as an input source data or an output target data location.

Pipelines are logical software components or artifacts which can be created, viewed, edited, cloned, tagged, deactivated or deleted on demand basis. Scheduling a pipeline means creating a Scheduled event or scheduling the creation of a pipeline.

Activities in a data pipeline are work components defined to be performed as part of work for the data pipeline. This could include data transfer activities, execution of custom scripts, querying or transformation of data inside a data pipeline.

Data pipeline could have preconditions which are conditional statement which determine if an activity should be executed or not. These conditions could be scripted or be defined using some rules engine.

Actions in a data pipeline are steps to be taken after a successful or an successful execution of an activity. This name mean sending notifications or messages to the intended recipients who need to be notified in case of a successful, unsuccessful or exception having been reached . The job is to alert the intended parties to notify the final activity completion status. These could trigger further activities in turn which would have been defined as part of the data pipeline.

What are the data pipeline architectural patterns?

Data pipeline is an architectural pattern which defines software components in big data through which data flows in a combination of stages which includes data acquisition, processing, transformation, storage, querying and analytics.

Some of the architectural patterns which have become popular over time are

  • Lambda architecture
  • Kappa architecture
  • Polyglot persistence

Lambda architecture

Lambda architecture is a data processing architecture which takes advantage of both batch and stream processing methods wild comprehensive and accurate views. The Lambda architecture splits data stream into two streams. One goes for batch processing called the batch layer and another stream for real time stream processing called the speed layer. Batch layer has massive amounts of data and hence can provide accurate insights when generating views since it contains "all" data over eternity. Speed layer contains most recent data which provides inputs to the "view layer" also called the serving layer. Since the speed layer has most recent data, it's accuracy is lesser but this is an architectural tradeoff where you sacrifice accuracy for speed. The serving layer contains a "joined" output of the batch layer and speed layer which respond to ad-hoc queries. This layer uses fast no-sql databases for generating quick pre-computed or 'on-demand' computed views.

The downside of a Lambda architecture is the need to maintain two different layers of data - the speed and batch layer which can increase infrastructural costs and development complexity since both layers need to be "joined" for the "serving layer" to provide useful outputs.

Typical examples of the batch layer are Apache Hadoop.

Examples for the speed layer are Apache Spark, Storm, Kafka and SQLStream.

Examples of the serving layer for speed layer output are Elasticsearch (ELK stack), Apache HBase and Cassandra, MongoDB

Exampes of the serving layer for batch layer output are Apache Impala or Hive, ElephantDB.

Kappa architecture

The lambda architecture had the challenge of maintaining two different storage layers - the speed and batch layer. The kappa architecture solves this by keeping only a single layer called the "speed layer". This layer - "speed layer" is the only layer on which all processing is done. This architecture typically needs very quick data stream "replay" and "processing" capability. Results of this processing are then kept on the "serving layer" which gets its feed from only one layer - the speed layer. The architecture is simpler in terms of providing only one code base or architectural silo to be maintained. However the Kappa architecture has its own set of challenges. One of the challenge is that event data may come out of order and "replaying" data for "re-querying" can have additional cost and complexity. Finding duplicate events or which "cross-reference" each other because of their dependency of being part of a larger transaction or "workflow" make processing a stream more complex.

Polyglot persistence

While this may not look as an "architectural pattern" as per the purists, keeping different types of persistence systems (SQL, NoSQL and NewSQL based) can help solve a data ingestion, processing and pipelining problem. Since different databases are designed to solve different class of "data problems", no single "database system" can solve upfront all the problems which "big data" platforms can have. In most cases, architecture evolves as business finds out what new types of data to capture and engineers figure out new ways of capturing it technically. One effective or "quick fix" way is to use polyglot persistence to capture different kinds of data and just "save" it for future usage. Capture whatever you can, however you can. The systems which do actual data processing and provide business insights can be built on top of this polyglot persistence layer at a later stage.

How does a data pipeline handle errors?

Create an Error Pipeline

Creating an error pipeline and joining it with the parent pipeline is one way of handling errors. "Exceptions" or "errors" or "filters" can be used to identify events or data in the pipeline which can be marked as exceptions. This can be routed to the error pipeline. Care should be taken that parent pipelines after execution also clean up the error pipelines after the errors have been addressed appropriately.

Conditional timeouts and troubleshooting

When pipelines are doing the grunt work, their status can be monitored for changes. Keeping a timeout mechanism whereby they can be monitored if they finished their jobs/tasks successfully or not can be used to figure out the final "status". Their execution details and summary can be utilized to ascertain all "errors" during the job/task execution. Appropriate actions or events can be fired to take care of the error(s) or error data. Depending on the error generated, a whole pipeline can be failed by design.

Error views

In either case, whether having an error pipeline or handling errors generated in a pipeline while cleaning it up, it is advantageous to have an "error view". These views can have the original event documents, the exception or error message, event ID (if the document was generated as part of a larger event lifecycle), saga ID, parent process ID, timestamp, source system ID and any other meta data which could be beneficial to identify the event source.

One common error in design often seen is pushing massive amounts of data through a single pipeline leading to larger volumes of error data generated which is challenging to process and troubleshoot. Keeping data pipelines which manage smaller "manageable" tasks has often been found to be a more manageable architecture.

What are the typical challenges implementing a data pipeline or choosing a data pipeline framework?

  • A distributed stream processing framework needs performant "In memory computational" capabilities to run rules and filters on the data. This stream should also be immutable to not to corrupt the source data.
  • Data pipeline frameworks have storage complexities for large data sets. This means performant (disk) IO capabilities especially in a distributed computing environment.
  • Stream processing may need both Synchronous and Asynchronous processing capabilities since data may come out of order or may have complex rules based filtering / processing criteria.
  • A data pipeline should have the capability to process data as per schedule or in an on-demand way.
  • Streaming data comes from Multiple sources and can get routed to Multiple targets. Data pipeline frameworks should have resilient pub-sub models for complex data routing requirements.
  • Since the data itself may have rules of processing and persisting them in series or parallel, frameworks should have the capability of processing in batch/series/parallel
  • Any data pipeline framework should allow custom or even complex processing of data. It should have the capability to support rules based engines or filtering rules which may even have more complex "state management" needs for processing data.
  • Data pipelines should be performant whether the needs are compute intensive or data intensive.

Are there any architectural differences between compute intensive vs data intensive data pipelines?

The data tunneling needs for a compute vs data intensive pipeline are being researched by different groups of people. Typical solutions are

  • using multi-core architectures to scale up computationally,
  • reduce object sharing and increase parallel execution models for high speed processing of streaming data
  • reduce disk I/O and increase in-memory processing
  • use network accelerators for high speed data transfer

Which all (open source) frameworks are used to implement data pipelines?

There is a flood of open source tools in the market but we will have time only to cover the behemoths among them which are listed down here.

Apache Hadoop

The grand daddy of all data processing frameworks, Hadoop is collection of open source projects which together help solve problems involving massive computing data. It uses the MapReduce programming model to rapidly compute queries from data stored on multiple systems by running the queries in parallel and combining their results. Hadoop is built on top of HDFS - Hadoop Distributed File System, which is the distributed storage part of Hadoop and Hadoop MapReduce which is the querying and large scale data processing part.

Apache Spark

Spark is a "cluster computing framework" which allows the capability to program entire clusters in parallel. Spark includes a cluster manager and distributed file storage system. The core of the system includes Spark Core which provides features like task scheduling, dispatching, APIs based on the RDD (resilient distributed data set) abstraction. The RDDs were in response to the limitations of the MapReduce computing model which came as part of the Apache Hadoop project. The RDD functions' latency could be reduced by multiple times compared to the MapReduce computation model by allowing repeated database-style querying of data. RDDs which are an abstraction which provide convenience in terms of working with distributed in-memory data. Spark provides an interactive shell (REPL ) which is great for development and helps data scientists quickly get productive with their models.

Further components of the Spark framework are

  • Spark Core - An API model based on RDD abstraction which allows functional and higher order programming model to invoke "map, filter and reduce" functions in parallel. It also allows RDDs
  • Spark SQL - A DSL (domain specific language) which allows manipulation of DataFrames (data resultsets crudely speaking) for structured and semi-structured data
  • Spark Streaming - A framework for performing streaming analytics
  • Spark MLib - A distributed machine learning framework considered faster than Vowpal Wabbit and Apache Mahout
  • GraphX - A distributed graph processing framework built on top of Spark based on RDDs

Apache Storm

Apache storm is a distributed stream processing framework written in Clojure language. The framework allows both distributed and batch processing of streaing data. A storm application treats a graph of data pipelines as a DAG (directed acyclic graph) topology. The edges are considered the streams and vertices as data nodes. The spouts and bolts act as the data vertices. A topology is a network made of Streams and Spouts. Spout is the data stream source. Its job is to convert data into tuple of streams and send to bolts on need basis. Stream is an unbounded pipeline of tuples.

Apache Airflow

Apache Airflow was developed at AirBnb and later became a part of the Apache software foundation. The official definition of Airflow mentions that it is a platform for programatically creating, authoring, scheduling and monitoring workflows. It is a workflow management system which inherently uses data pipelining mechanisms to intelligently route data and create workflows on top of them. Similar to Apache Storm, the entire workflow can be treated like a DAG (directed acyclic graph). Since Airflow has thrown in a rich UI, it becomes easier to create, monitor, visualize and troubleshoot these workflow pipelines. On top of the visualization, there is a rich CLI (command line interface) support to perform complex operations on the DAG.

Apache Beam

Apache Beam is programming model built to define, create, execute and monitor data pipelines. It has support for ETL (extract, transform, load) operations, batch operations and stream processing operations. Apache Beam is one implementation of the Google DataFlow model paper. Once you define the pipeline in Apache Beam, its execution can be done by any of the distributed processing back ends like Apache Apex, Flink, Spark, Samza and even Google Cloud Dataflow.

Apache Flink is a open source distributed stream processing framework which executes dataflow programs in both data-parallel and pipelined manner. The pipelines enable execution in both bulk and batch processing as well as in stream processing. It also provides some additional features like event-time processing and state management. Flink does not provide its own storage systems. It provides source and sink connectors to other systems like HDFS, Cassandra, Elasticsearch, Kafka or even Amazon Kinesis.

Apache Tez

Apache Tez is a framework which allows complex DAG of tasks to run for processing streaming data. It provides a faster implementation engine for MapReduce than those provided by Hadoop using Hive and Pig. Tez provides APIs to define dataflows with flexible input-output runtime models. The dataflow decisions can be made by allowing DAGs to be changed at runtime for optimized performance. While compared to Storm, it may not have the advantage of in-memory dataset processing or having immutable datasets (RDDs) to play around with, it's faster performance of MapReduce algorithm allows it to be used with Apache Hadoop and YARN based applications for speed and performance. A comparison of the performance metrics can be seen here.

Apache Samza

Another Apache open source framework, Samza is a near real time, asynchronous framework for distributed stream processing. A unique feature of Samza is the idea of immutable streams. Upon receiving a stream, Samza allows creation of immutable streams which when shared with other processing engines does not allow the original stream to get altered or affected in any way. They are immutable. Samza works along with Apache Kafka clusters (called brokers). Like a typical streaming application, Kafka contains topics to which producers "write" data and consumers "read" data from. Samza has been written in Scala and Java and was developed in conjunction with Apache Kafka. Both were developed originally by LinkedIn before becoming a part of the Apache incubation projects.

Apache Kafka

Apache Kafka is an open source stream processing framework. It has become a popular framework providing unified, high throughput, low latency stream processing platform which works on the principle of a "massively scalable pub/sub model (queue) implemented as a distributed transaction log (also called commit log)". Kafka provides a Connect API to import and export data from other streams. Kafka Streams API is a stream processing library written in Java which allows development of "stateful" stream processing applications.

Can we have API based usage of data pipelines?

There are systems (mostly as SAAS based models) like Amazon Pipelines, Parse.ly, Openbridge, etc which allow an API like usage for data pipelines. All you have to do is to create a pipeline on the fly (declaratively or using some cloud configuration) and start ingesting and routing data. A sample usage is of Amazon Redshift which allows integration with Amazon's Data Pipelines and SQS services to ingest petabytes of data from multiple sources. All you need is to create the pipeline (pro-grammatically or declaratively) and then consume it like an API.

Enterprise Service Bus (ESB) vs a data pipeline - how are they different?

As is often said in the programming circles - "Buses dont fly in the cloud". When it comes to high throughput data which can run into petabytes, the ESBs dont scale up. While open source frameworks like Kafka, Samza, Storm, Spark and Flink are designed to handle high speed streaming data, conventional ESBs like ActiveMQ, Mule etc may not be the right choices to handle such a deluge of streaming data. Scaling up will be a challenge. Even though functionally, both ESBs and data pipelines perform similar tasks of routing event data, allowing rules based routing and transformation and allowing pub-sub models for data subscription and publishing.

How do you manage security for data pipelines?

Logically all data pipelines reside on the cloud in their physical form. Handling data pipeline security is akin to handling security of any cloud based infrastructure. This would entail creating right security groups, secure and encrypted communication between cloud instances, securing the server ports, have rigorous authorization and authentication mechanisms for those agents accessing the cloud infrastructure.

Using secure transport layer (SSL/TLS) on the data pipeline is an additional way of securing data streaming between cloud instances.

How can data pipelines be debugged?

Debugging streaming data is a tough ask. Marking event data as "healthy" (if it got ingested) or as "error" in case it met an exception is one way of "marking" the source data which is causing exceptions.

Marking "error" data and routing it through the right pipelines or "logging" it to make it accessible for debugging is the right way. Unlike throwing an exception where the actual payload data may have to be figured out, explicitly dumping "unhealthy" data which failed in a pipeline (the way it is mentioned in the section "Error views" above) is the right way to design your system. Capture everything when logging errors.

What are the different data storage and modeling solutions available when it comes to polyglot persistence in big data solutions?

Common data storage solutions get classified in one of these categories of databases

  • In-memory datastores (HSQL, Redis)
  • File system based databases (MySQL, Oracle)
  • OLAP or OLTP/TDS (Oracle, Postgres, MySQL)
  • Distributed file systems (HDFS etc)
  • Data marts or warehouses or master data stores (as traditional warehousing solutions) (Oracle DW)
  • ODS or operational data stores (diferent than traditional data warehouses)
  • Data lakes (data from multiple data stores - sql/nosql merged into one massive data lake) (Amazon or Azure data lakes)

Traditionally, the OLAP databases have been used for "bulk" data used for warehousing solutions. OLTP databases are used for event based or transactional processing on real time basis.

]]>
<![CDATA[Is Javascript eating Java?]]>http://blog.deyvos.com/is-javascript-eating-java/6587fd1f21c3587be7b08a86Mon, 08 Oct 2018 10:26:00 GMTDevelopments in Javascript till 2018Is Javascript eating Java?

When it comes to employing full stack engineers, we normally look at a having a separate team for the mobile division, the browser application team and the backend or the server side team. With Javascript, especially NodeJS, this is changing. All you now need to know is Javascript and you can have a team with the same skill set working on the backend, front end and the mobile app. Say in an Javascript world, that would mean NodeJS on the server side, Javascript frameworks on the client side and JS frameworks for the mobile (with cross platform frameworks like React). Javascript has become ubiquitous, getting adopted by the web, mobile, embedded, server, devops and database world. The force is rising! You could say Javascript is eating the world! At least it seems so, with the hype the framework is getting.

Of late, you even have TensorFlow in Javascript! You have machine learning capabilities inside the browser. NodeJS is very popular on the server side because of the LTS (long term support) because of which it is very popular with large enterprises.

So is Java getting eaten up by Javascript? Going by the trends which most major players are showing, it seems like a Yes. Some of the major players to move to Javascript, (especially NodeJS) are Netflix, LinkedIN, Uber, PayPal, AOL, ebay and Zappos.

So here are some of the metrics under which we will try to make a more objective comparison of this debate

Developer Productivity

Simple server side applications can be created quickly in NodeJS. One can put together prototypes pretty quickly and it works well for teams which work in client side frameworks to dabble in server side frameworks and quickly get productive.

Java is NOT a framework and does not provide anything on the client side which Javascript does - in terms of frameworks like Vue, Angular, React, JQuery etc. Javascript is the defacto browser language of choice for web and browser development.

Performance

Both platforms - Java and Javascript are (infinitely) scalable. It is however the "threading" model which Java application servers take versus NodeJS multithreaded servers which can impact the architectural choice.

Java application servers create a multithreaded application but all the threads belong to one (operating system) "process". In case of a thread hangup, if it impacts any shared resource which the process is sharing amongst other threads, it can block all the threads (and as a result hangup the process running all the threads in the application server). However, this multithreaded paradigm allows Java to have shared state amongst threads and allow concurrent and multi-threaded applications where thread concurrency, synchronization and parallelization can be controlled programmatically. Writing such applications need to be debugged for dead-lock or thread starvation situations.

NodeJS approach is different. It follows the thread per process architecture where no resources are shared between processes. Even if one thread fails, other threads are not impacted (since one thread is equivalent to one server process) which can die in the background but other processes (or threads) still keep serving user requests. This however makes sharing of variables between threads (in this case processes) difficult. There is no shared state or at least its difficult to implement.

Skills and Experience

Finding Java guys with a few decades of experience is not a problem. It is a fairly mature, old and stable platform with developers having rich experience. Finding mature Javascript developers is still a tough ask. The language has had an upsurge of late and finding developers with as much experience is a challenging requirement.

Even a decade back, basic Java topics had fairly comprehensive documentation (thanks to the Sun Microsystems promoting Java at all levels) but Javascript documentation was still obscure and not that well organized. Developer forums had to be reached out to learn different tricks, tweaks and language semantics which was known to a few select set of Gurus. There was no single "organized" and "well documented" source of truth for learning all Javascript nuances.

Over the last few years, there has been an explosion in the Javascript developer community whose participation in online forums like stackoverflow and stackexchange has lead to a lot of documentation and language nuances being available. Javascript best practices have evolved and become more mature than a decade back.

Resource availability

Traditionally, Javascript resources have been UI developers who had not delved into server side Javascript much. The transition from a UI developer to a full stack developer is fairly recent development. Just like Python, Javascript has had some demand in the last five years because of popularity of frameworks like React and NodeJS. However, there is still a dearth of good resources in the Javascript world and this is a gap which is still getting filled up. Over the next few years, however, because of availability of rich online resources, the supply of skilled resources should increase. Java being a mature platform, the resources are fairly easily available since over the last two decades, the language has been promoted fairly well at the school and university level courses for learning programming. The same cannot be said to be true for Javascript.

Features and capabilities

  • Exception handling and debugging in Javascript/NodeJS are still evolving. Java handles both in a much more robust and mature way since Java was touted and brought up as a server side technology of choice. So while you get explicit log files in Java or IDEs which support debugging, Javascript frameworks still resort a lot to the Console.log() style of debugging which needs to evolve as this can affect productivity. With tools like node-inspector, some of these problems are getting resolved for the Javascript developers.
  • Since Java is statically typed language, it is more amenable to debug it than a dynamically typed language like Javascript. Since the variable types can change at runtime in JavaScript, fixing defects can be more challenging. Although Typescript is trying to bridge a lot of gaps in Javascript from that perspective, it is still far from where Java is in its evolution.
  • Node.js and other Javascript server side frameworks currently do not support multi-threading support natively. Java has a rich set of API availability for multi threaded applications like Thread Pools, multi tenancy etc. Node JS provides a "hack" around this lack of support by providing a "event driven architecture" where events are consumed the moment they are produced.
  • Java thanks to Eclipse and JetBrains has a more robust and rich (plugin based) IDE support than what we currently see for Javascript for whom the IDEs are still evolving.
  • For mobile platforms, Javascript has frameworks like React which allow increased productivity in terms of allowing cross platform apps to be developed (for both iOS and Android platforms). Java / Kotlin can only cater to one platform - which is Android.
  • Java has very comprehensive package managers like gradle and maven which are pretty mature and great dependency management and build tools. "npm" for Javascript / NodeJS is still evolving and errors generated during the build or package installation stages are many times incomprehensible and difficult to troubleshoot.
  • Type safety in Java is assured since it is a "statically typed" language. This can become a problem for Javascript frameworks especially when consuming them as APIs or third party packages. In case the documentation is lacking, figuring out the right usage or data type which a third party library is returning can be a big pain!
  • While Object Oriented programming paradigm is strongly encouraged in Java and other statically typed languages like C# etc, the same is not true for a lot of Javascript frameworks. They instead prefer more functional, event driven or procedural style of coding. Although this is more a matter of personal style or taste, but for very large projects, the reason why the OOPS tsunami came about in the 90s and early 2000s was for a particular reason which the Node community chooses to overlook.
  • When it comes to heavy duty industrial grade features like transactional logic, loaders and external services,

Deployment speed

Java has complex build tools like Maven and Gradle which can create some of the most complex build processes possible. Build process in the NodeJS paradigm could be as simple as deploying the right packages using the npm package manager and copying the scripts onto the server. Just configure an Apache or nginx server and you are good to go. Javascript on the server side has a long way to go in terms of complex build management tools. Indirectly, this also means that it takes considerably less time and effort to deploy the code on a production server!

While a Java web application would need to be deployed using an app server which would then needed to be configured using a web server, Node itself provides the runtime environment using Chrome's JavaScript runtime engine to exeecute JS code directly.

Support and Maintenance

From a code maintenance perspective, static languages like C# and Java are far less pain than dynamic languages like Javascript for the following reasons.

  • Javascript code is not exactly "modular" by design. The language has to be worked upon for the code to look modular, readable and maintainable.
  • Function scoping in Javascript is a tricky thing to master and can lead to bugs which can miss the eye of a not so experienced developer.
  • For very large teams, code maintenance and bug fixing is relatively easier for Java because of its static typing features and rich IDE based debugging support which can quickly help identify impacted code areas because of the way its compilation and build process works.
  • Javascript syntax and features like promises, closures, callback and event binding etc make it difficult to debug. For smaller teams (of say 4-5 members), the code is maintainable but for very large teams, Javascript code becomes difficult to debug and maintain because of the language nuances. It can sometimes become difficult to understand the full impact of even small code changes for very large codebases.
  • Binary compatibility and package versioning issues are more prevalent in JavaScript because of the active development happening in different frameworks. Java packages are relatively more stable and platforms like Spring make sure the binary dependencies are taken care of at project configuration time itself. Any package upgrade has higher chances of unpredictable code breakages in JavaScript than on the Java platform.
  • Upgrading packages can be troublesome in Javascript / NodeJS in case they use C-style bindings. In the Java world, replacing binaries (jars) should work and no external /native builds need to be triggered. But when using C-style bindings in Node, you have to 'recompile' the bindings for the upgrade to work. This means an extra step in the build process if you do not forget that is, in which case your upgrade can actually fail!
  • To trouble shoot many topics, Java community seems to have more "researched" and available solutions compared to the Javascript community. When searching on Google, it is easier to find answers for problems in Java than on Javascript.

Security

Compiled and even interpreted languages create bytecode at runtime. While interpreted languages like even Python create compiled bytecode (.pyc files), Node being javascript based does not generate any such files at runtime. The code being executed is the code "as is" on the file system. This is sometimes seen by many as a potential for code to get compromised. The code can be "read" as is and there is no additional challenge on part of the hacker to go "reverse engineer" the code. The best you can do is obfuscate the code before execution which makes it "unreadable" at the best. However there are tools in the market which can reverse beautify the code and make it readable again.

The same can be said of bytecode files which can be reverse engineered using disassemblers.

Summary and afterthoughts

It sometimes depends on which language you picked up first (whether Java or Javascript) and how you see the second language that you pick up which determines how comfortable you are switching or adopting the second language. Familiarity makes one more productive and even though every language has some nuances, if you can figure out the way to be productive, you can make most languages work. There is no "one" way to strictly determine which language is the best. However, with the slew of companies adopting the Javascript frameworks on the server side, it seems like more companies are betting on this stack evolving over time as they feel that the time for Java is behind us - or at least that is how it seems!

]]>
<![CDATA[Address cleanup for logistics companies using ML]]>http://blog.deyvos.com/address-cleanup-using-machine-learning/6587fd1f21c3587be7b08a8aSat, 06 Oct 2018 17:12:00 GMT

Address cleanup using machine learning is a sought after solution for the Indian logistics companies. The problem is much more complex than what initially it looks like. While companies like Flipkart and Snapdeal have had their fair way of success because of access to the vast amounts of user address data, there are some standard ways in which beginner logistics companies can create in-house solutions to address their own problems.

Beginning with the data itself

All AI and ML data scientists consume more than half of the time cleaning up data. The actual modeling and number crunching is much less time consuming. This is a standard industry practice (or feedback) for any kind of a machine learning problem. The more accurate the data, the more accurate the models and more accurate the output.

So for the address "cleanup" problem, how do you get the cleaned up data? Here are some of the problems which the data will throw in your face...

Data does not follow a particular format and is out of "order" (e.g. landmark data given randomly inside the actual address")

There is no "standardization" of the format for addresses in the Indian diaspora. E-commerce sites would provide consumers with standard form based inputs in a particular wizard like form sequence, but then people have the freedom to choose what kind of data they insert into it.

Addresses shared with logistics companies are not exactly "standardized". This could be generic and this is a practice which cannot be monitored. While submitting data, all address data could have been stuffed into one text field element, or the street address could be mixed up with the locality. There are more than one ways to "ruffle up" the cat, proverbially speaking. This can be a problem for logistics companies for address standardization. However, using machine learning, algorithms can learn "key phrases" and associate them in the correct context. This may need lots of data "annotation" or "labeling" which is a manual process.

Data is spelled incorrectly

Now typical case of places like Bangalore, aka, Bengaluru, there are many ways to spell a locality which has a complex pronunciation. Long tail address keywords run into this problem. People in the same locality will spell it differently!

Let us take the case of Kadugondanahalli, a locality in Bengaluru. This locality can be spelled as...

  • Kadugondanahalli - original
  • Kadugondanhalli - "a" missing
  • Kadu gondan halli - split into three words
  • Kadu gondanhalli - split into two words
  • Kadugondan Halli - split into two words at a different location
  • Kadgondanhalli - "u" missing
  • Kadagondanahalli - "u" misspelt as "a"
  • Kadugondanhali - single "l" instead of double

Most of the time your conventional RDBMS systems will not be able to figure out the "exact" match. However, databases like MySQL and SQL server have come out with "soundex" phrase or keyword matches which can give you "similar sounding" words. This gets complicated by the fact that the keyword itself could be split into multiple words depending on user input.

The problem in a nutshell is - there are more than one ways to "spell" the cat!

So what exactly was the zip code? (Scratch scratch...)

Yes, this is a more prevalent problem than what seems on the surface. Many people in India do not remember their "postal or zip" codes. This could be in case you were a recent migrant to the city or you are in an age group where your simply forget to remember. This can complicate matters since a given city could be found in different states for the exact name. In a large geography like India, this article could be an eye opener. There are 32 Rampur's in India - some in the same state! This is where zip code now matters. For large cities, something like a "ghanta ghar" or a "hari nagar" could be at multiple places within the same city. Sounds interesting?

Google! Why did the reverse geocoding change the address completely? I just added the flat number.

A lot of times, users are in the habit of adding prefixes like these

  • house number, house#, house no., house num, h.no., ho. no. ...
  • flat number, flat no., flat num.
  • room number, room num, room no, room #
  • shop number, shop num, shop no, shop #
  • flat no H-74, flat number H74, flat number H 74, flat H-74

So why is that a problem?

Just put this additional description to your address and query Google for a reverse lookup and you will see that it sometimes goes haywire. This additional information can confuse the address lookup engine. While using API based queries, the responses (top 10) could be totally away (sometimes by many many kms) from the actual address that was meant. This has a direct impact on your shipment costs and route management / scheduling algorithms. For a startup, this can bring unnecessary burn into your pocket.

Now wait - is that the phone number inside the address?

Yes, this can happen too. Data can come in formats, where this could be sitting inside your address fields. While a human being can figure that out, it can confuse address parsers as to what this data field means. There are customers shipping gifts to their friends and then they don't have their friends' complete address. Asking them would look odd, so they put his phone number in the shipping details. Not that very rare since the customer "expects" the shipping person to call up before delivering the package. Who said life is fair?

The problem is further made complex with deliveries being made to an office address. Surprise! Surprise! You could be getting deliveries being made to people whose "badge/employee number", desk number or phone extension number may be included in the address.

Did I mention which floor?

Many customers will share data - and can confuse your poor parser! Data often cited about "which floor do I want to ship to" could be stated as...

  • first floor, 1st floor, Ist floor,
  • mezzanine floor
  • top floor
  • ground floor
  • bottom floor

This kind of information about the floor can be a great aid to humans but can confuse address parsers as to how to qualify this kind of data. This kind of data is difficult to classify. Nevertheless, the parser needs to intelligently classify these keywords.

Are we "in front of" the landmark? Or "to the side"? Or "behind" it? or simply adjacent? I need to be accurate!

Ideally we should qualify the landmark or the address in fewest number of keywords which make the "most" sense. Clarity and brevity is the key. However, as human beings, we want to help the other person have the maximum information which gives him directions about the actual place. We try to be informative, and that can be a problem to handle algorithmically.

Here are some keywords which you will find inside addresses, especially when specifying landmark data which need to be handled

  • [Adjacent to], [adj to], [adj. to], [adj]
  • [Behind], [behind]
  • [next to], [next]
  • [in front of], [in front]
  • [above]

Very informative to a human being but confusing to a subroutine or intelligent parser.

It is not Calcutta stupid! It is Kolkatta! Or is it Kolkata?

With more than 25 Indian cities changing name (Baroda to Vadodara, Cochin to Kochi, Benaras to Varanasi for complete list, see here), keeping up with the latest and best copy of the data (often called golden copy) takes up serious data cleansing and updation effort. This is not that difficult a problem to solve, till you have "pro-active users" or a support center team which constantly keeps updating the tech team with complaints, changes or reported updates from customers in case the tech team has not been updating itself with the recent geo political updates.

Another challenge of late has been with the creation of new states (Uttarakhand, Jharkhand, Chhatisgarh and Telangana), some of the state codes, zip codes and other geo political data just got changed. Old addresses need to be mapped to the new ones. Updating the geographical and GIS data is another challenge which companies like Google keep resolving by buying data from third party providers. Challenge is updating your own data or expecting customers to update it. With the "use last known address" or "work/home" address for shipping the packages, you could be asking for trouble unless you enforce customers to update their data based on which regions actually had a state or zip code change!

Hey UIDAI, did you give the GIS based pin code on the Adhaar card?

Here is an interesting news article whereby three people in a family got different PIN codes because the UIDAI when providing Adhaar cards used the GIS based pin to print on their cards. Postal services in India are an autonomous organization. It has the sole ownership and responsibility of providing PIN codes to regions in India. However, there have been cases of people getting their zip codes based on the GIS based data as in the article above. Although UIDAI authorities deny any role of theirs, there have been cases of such confusions over last few years. Thanks UIDAI! This was all the help we needed!

Address standardization in the Indian context

In US, address standardization is an industry wide practice which is yet to take strong roots in countries like India. Here are some steps being taken by the Indian postal services. A standardized address would have the fields mandated to have a particular sequence, like number, locality, street, state, country, zip etc. However, there is no such mandate by the government or the postal services in India. Under different governments and development and planning plans, while some cities are divided into blocks and streets, others have been divided into sectors and phases.

Can we standardize the format of addresses in India whose nomenclature and legacy dates back from the British times?

Now that we have "spelt" the cat, how do we skin the cat using machine learning?

Here is a multifarious approach which can be taken to get to a better looking date.

Cleaning up the house

The first step in the entire process is creating the "golden copy" of the data, with clearly defined, manually labeled data. Without this "golden reference", it is challenging to reach that level of accuracy. There should be processes in place which "enrich" the data with newer updations to this golden copy. Manual interventions should be in place before the golden copy gets updated.

This would become an essential step when creating a supervised model, where the data would need manual interventions and labeling for learning purposes.

Modeling the "postman's" mental models

One crucial step to the entire process is creating mental models of the way the courier agent identifies or "maps" the given address to the final location. It needs to be "simulated" and this would need a close interaction with the logistics team. These mental models are typical to the demography and the logic may change from region to region. These are the models which will make the core of the engine for address parsing and geolocational mapping. While there are companies like MapMyIndia do keep data using deep mapping technology, translating the user driven address input to pin point the geo location is a different problem to solve.

Learn, learn and learn...

In the end, you got to play with the data to learn more from the data. Ultimately you will have multiple models from which to choose, optimize and create a recommendation system of locations for a given address. Each of the recommendations would have with it a probability score mentioning the confidence with which your system believes is the percentage match.

While there are more than one way to "skin" the cat, you have to find the top ones for every cat that's thrown your way.

]]>
<![CDATA[Progressive Web Applications]]>http://blog.deyvos.com/progressive-web-applications/6587fd1f21c3587be7b08a84Wed, 03 Oct 2018 09:14:00 GMTWhat are progressive web applications?Progressive Web Applications

As per Google,

“Progressive web applications use modern capabilities to deliver app-like experience”.

Progressive web apps bridge the gap between native and web based applications. They fill this gap by using new Web APIs as envisioned in the Extensible Web Manifesto. These low level APIs now help developers write their own libraries and frameworks which help progressive web apps interface with them directly.

Native vs progressive web apps

Native apps, the ones you download from the App Store onto your Android phone, have historically had capabilities (like work offline, cache content, update content in background, send push notifications etc) by default. Mobile web apps on the contrary have some restrictions to the phones’ resources since they run in the mobile browser. Hybrid apps however have access to resources similar to the native app, and they look like a native app too. Progressive web apps have the capability to provide seamless user interface and similar functionality as a native app. The biggest advantage is that PWAs load almost instantly because of their usage of the “Service Workers“. Unlike native apps, whose performance can vary over limited connectivity, PWAs work even on limited bandwidth. Offline webapps have tried to use solutions like AppCache, LocalStorage, IndexedDB etc but with the development of ServiceWorkers, which intercept every network request, including allowing the ability to serve from a local cache, PWAs can have more detailed control than the in built browser cache.

What is so special about progressive web apps?

PWAs can be quickly added to the home screen using install banners, making it easy to launch. Since PWAs by default use HTTPS (secure) connection, data transfer mechanisms are secure. Being responsive, they can work on all kinds of screen shapes including tablets, smartphones etc. Another unique thing being that PWAs work inside the browser, but can receive push notifications even when the browser is closed. This helps re-engagement with the client using web push notifications.

However, like other apps in the App store, the app still needs to be listed on Google Play or Apple App Store. Some other features supported are – clipboard access, launching in full screen, persistent auto-login using Credentials Manager API, receiving intents, file system access, reading user selected files and hardware accelerated 2D/3D graphics (using Canvas/WebGL).

How does it work?

PWAs are “app like” web sites hosted on a web server. They include a “manifest file” (manifest.json file) containing metadata on icons, orientation etc. You can point your browser (currently Opera, Firefox and Chrome for Android support PWAs) to the web site and “save it” on your homescreen. Browsers not supporting PWAs simply show a web site! Simple! Manifest can define if the web-site or PWA needs to open without a browser UI or work offline using a Service Worker.

Pros for PWAs

  • No need to “download” and install an app!
  • No need to make multiple apps across multiple platforms (huge saving of time and effort)
  • Reduced development time
  • Quicker reach to market
  • Quick to download and execute
  • Conserves space and resources
  • Safe – always uses HTTPS (secure)
  • Uniform customer experience across devices
  • Seamless app-cross functionality – sharing information between apps and switching between them
  • Simple to update. In fact, no “need” to update. Happens seamlessly behind the scenes.
  • Increased user engagement because of simplicity and ease of access
  • Can share apps using a LINK. No need for an app store
  • Search engines can “find” the app using the manifest.
  • Works on low quality networks!
  • Less intrusive than native apps! (Good for “privacy” concerned users)

Cons of PWAs

  • Does not provide complete access like native apps (like access to sensors, bluetooth, NFC and other hardware functions)
  • Competition from Hybrid model apps which give features similar to PWAs and native apps
  • Inexperience architecting PWAs. Few players have the execution and delivery capability
  • Limited Open Source frameworks supporting PWAs, makes it more challenging
  • Offline mode makes it difficult to capture analytics
  • Difficult maintaining updates
  • All browsers do not support it
  • Some native device functionalities not supported
  • iOS which is approx 50% of US market still does not support the entire PWA feature set. iOS is still to support Service Workers.
  • Native features not supported – telephony, flashlight, atmospheric pressure, alarms, contacts, calendar, browser bookmarks, logs, system settings, registering app to be able to handle custom URLs, protocol or file types.
  • Unlike a google play or app store native app which undergoes an audit/inspection, PWAs may not arouse a similar confidence in terms of legitimacy of the app

Progressive vs Native apps

PWAs do not need to be “downloaded”. For android users, its instant accessibility unlike native apps. Native apps lose customers because of the “download, click install” and then “open app” work sequence. PWAs are more “discoverable” by search engines too. As of now,

Progressive vs Instant Apps

Like progressive apps, instant apps provide the same set of features – instant click and use the app working paradigm. Both are supported by Google, but unlike progressive app, instant apps dont run inside browsers. It is based completely on the Google Android platform. However, you can use the app without downloading it. Given the similarity of behaviour, the question which most tech heads tend to ask themselves are

  • What is it that an Instant App provides us, which a PWA does not?
  • Does an instant app run on Android and iOS? Does a PWA?
  • What if you go mobile web?

Should your business make a switch? What are the benefits?

Alibaba, Flipkart and Twitter are some major players who have already incorporated the progressive web app paradigm. Some of the features which are offered on PWAs are geolocational data, battery status, screen orientation, compass and gyroscope, vibration, camera and microphone using Media Stream Image Capture APIs. For most applications, these is a rich set of native features available. Some browsers(in future) are developing support for NFC, accelerometer, magnetometer, light sensor, proximity sensor and Web Bluetooth. In short, we should be there soon!Choosing a PWA first strategy is more rewarding, unless you have the skills, budget and intent of building native apps. Then you have to make native apps for iOS, Android and then making a browser based web app and then support all three interface. Challenge is keeping them in synch at all times during the release cycle. Existing app development companies may make the switch but a huge majority of current traffic comes from native apps as of 2016. There is no urgency for these companies to move to PWA paradigm.

Companies have adapted a cross platform development methodology. For newer products, developing a PWA is a no-brainer! Some facts to remember – android users form the majority! If you need a desktop (browser) based app too, then building a PWA makes more sense. Even iOS users saw 82% increase in conversions when Alibaba switched to AliExpress progressive web app. Building a native iOS app separately along with a PWA can also work. Making a native Android app may not have value if you do not need native features not supported by PWAs already.From Digital Marketing perspective, building a PWA makes more sense looking at the discover-ability aspect from search engine perspective and the reach of mobile browsers per se. The ability to share apps as links makes it more “shareable” and makes it a more marketable proposition.

]]>