Jumia Tech Blog

Discovering Behavior with BDD

2022-12-19T00:00:00+00:00

Discovery, BDD’s foundation

Behavior Driven Development is an approach to software development that relies on creating a common understanding of the requirements through the use of real use case examples, thus ensuring that we are building the right thing.

Arguably the most important practice of BDD to reach this shared understanding is the Discovery part. In our last post, we presented what is BDD, how it works, and how we implemented it with our team at Jumia. This time around, we will deep dive into the Discovery practice and how we apply it, what are the benefits, and what are the problems that it solves.

People occasionally tend to compare BDD with TDD, leading to the common misconception that as long as you are implementing (business-oriented) tests before actual coding, then you are practicing BDD. That’s actually a small part of it. Before you start formulating and implementing your acceptance tests, you have to understand the needs of your end user and how you are going to deliver them the best solution. This is basically the core of BDD: the bridge between business requirements and technical specification starts here.

As with all Agile practices, there are many ways to apply BDD Discovery depending on the reality of your business, company, team, etc. While we based ourselves mostly on The BDD Books, we will explain what is our approach to Discovery within our context, namely, what tools we use, how we organize and name the sessions, who participates, etc.

To sum up, Discovery should be a collaborative activity that is capable of picking a user story and exploring it until a shared understanding is reached. The main technique used to achieve this is called Example Mapping.

Example Mapping

The Example Mapping consists in getting the team together to discuss the business requirements of a story using rules and concrete examples represented by cards of different colors placed in a visual board:

User story: Represented by a yellow card, can be split and reorganized during the course of the example mapping.
Rules: Represented by blue cards, they are statements that bring a specific business requirement, similar to Acceptance Criteria.
Example: Represented by green cards, they illustrate each rule by describing a concrete case of a user interacting with the application.
Loose Ends: Represented by red cards, any questions or assumptions that appear during the discussion and need to be addressed later.

To start, have the Product Owner (or someone that was involved in the topic analysis) give a brief explanation of the user story and guide the discussion. It doesn’t matter the order to achieve the final goal, sometimes the rules can be defined first or even pre-defined, and sometimes the story is vaguer and it makes sense to start exploring the examples of usage first, and then identify the rules based on that. After a consensus is reached for each rule you should move to the next one, there is no need to explore a lot of different examples and corner cases (there will be time for that), just enough for everyone present to be on the same page.

In the above example, a seller trying to create a consignment order has to follow specific rules regarding the products that can be included in his request.

You can see that the first example of the first rule is a simple success scenario where they import a valid product and everything goes well.
After this first example is discussed, it also becomes clear to everyone that in case it was an invalid product, there would be an error and the consignment wouldn’t be created, so the team finds no need to write this example now.
However, one of the team members ask: “What if they import a lot of valid products and only one invalid, should we fail the entire request?”. The team decides it’s better to still create the consignment with the valid products, and generate a report to check the failed items, so they write the second example.
This behavior is replicated for the other rules as well.
Then another person says: “But let’s imagine the other way around, they import a file with lot’s of invalid products and only one valid, in this case, I think it wouldn’t be worth it to create the consignment with just one product, the seller would prefer instead to review the errors and try again after validating their products.”
Someone then suggests: “What if the seller was able to choose an import mode, one that would succeed with any valid product and another that only succeeds if all products are valid?”. Seems like a good idea, but the team doesn’t have a definitive answer for it now, and it also seems to be out of scope, so the team writes a red card with this loose end to be discussed in the next session.

How Discovery was incorporated in our process?

Before BDD, we already had a well-established process in place based on Scrum, but that didn’t mean that we had to abandon the benefits of any of our previous practices, we just needed to adapt a little bit.

At the beginning of a topic, instead of having a big meeting in order to define, discuss and estimate all the user stories, we broke it down into smaller sessions where we would perform the example mapping for one or a couple of stories. We called them Business Workshops. We found that Business Workshops also replaced the need for “Grooming” when combined with a Tech Workshop, which is also story-oriented, a moment when you can discuss more technical aspects and solve loose ends that came from the Example Mapping if need be. Those workshops are done ideally several times a week, on-demand, and they are not meetings but rather working sessions as the name suggests.

Who participates?

To enjoy Discovery benefits the most, the whole team should ideally participate, but if not possible, we should have at least what we call the “Three Amigos” (business, development, test). Having people with different roles in the session brings different perspectives to the table. For example, a Product Owner might be more focused on fulfilling the business goals, while a developer will think about the technical implications of the feature, while a tester will challenge the feasibility of the rules. That doesn’t mean that those 3 are the only roles that can participate in those sessions, if there are more roles in your team (ex. UX/UI) for sure they will bring value to the discussion and also have more context to do their work correctly afterward. Remember that BDD is a bridge between the business requirements and the software itself, so it’s fundamental that someone from the business is involved at least on the Discovery part.

What tools do we use to accomplish the final result?

For Example Mapping, we use Miro, a platform for visual collaboration where you can build a virtual board. If your team doesn’t work remotely, you can even opt for a physical board with post-its. The result is then moved to each user story on JIRA with the following format:

This will serve as the base for the next step of the BDD process, the Formulation of acceptance tests.

Results / what changed / what benefits we observed

After some time of practicing Example Mapping, we observed these main benefits:

Shared Understanding: First of all, it’s faster to reach a shared understanding. It’s easy to get everyone on the same page about your requirements when you direct the conversations using concrete examples and have your team thinking about the goal from the user’s perspective. Not only that, but shared understanding also means that your team will be progressively creating an ubiquitous language, multiple business terms that might mean the same thing will start to be standardized, while terms that have multiple meanings will be broken down into different ones. This will help eliminate confusion as everyone can quickly understand what we are talking about from a specific term.
Quality of Conversations: Another nice benefit that we noticed was that the quality of our conversations got better, and not only during Example Mapping, but every other conversation/meeting tends to have a clearer scope and less entropy, mostly because we learn to separate business conversation from technical conversation.
QA Shift Left: Last but not least, there is the shift left on QA. This is a movement where testing should be thought and done at the early stages of development, because the sooner you detect problems with your code, or even with your requirements, the sooner you can react to it and prevent it from reaching later stages of the lifecycle, where fixing them would cost more. Our team already had this mindset from a tech perspective, but BDD Discovery takes it further by involving the QA in business discussions from the very beginning of the process.

Tips

To close up, here are some tips that we find very helpful to follow while doing Example Mapping:

Keep the sessions short (max. 30 minutes) to maintain engagement.
The Example Mapping for a specific story should obviously be done before the development starts, but not so much in beforehand to avoid forgetting what was discussed. If you use Scrum, try to be one sprint ahead (Works well with 1 week sprints).
Write the examples in a free language, don’t use Gerkhin at this point or any other way to structure the written language. The goal is throughput, i.e., reach the shared understanding quickly. The Formulation step is where the scenarios will be transformed into a structured language.
Avoid discussing technical aspects.
Every loose end should have an owner to follow it up afterward.
Bring some pre-defined candidate rules. During the discovery they can be changed, new ones can be found, and the story can be split into multiple ones, but it helps to have a base for the discussion.
Do not bring pre-defined examples though, it takes away the goal of collaborative thinking.

Resources

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

2022-10-19T00:00:00+01:00

“The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”

Data is everywhere, and it opened the opportunity for the usage of advanced analytics and data based processes, through Data Science, with the adoption of algorithms, frameworks, technologies based on Machine Learning.

Even, the adoption of ML increased only in the last years, the term was firstly used in 1959 by Arthur Samuel, an IBM employee and pioneer in the field of computer gaming and artificial intelligence.

Companies are increasing and accelerating the adoption of machine learning, using it in different areas such as agriculture, astronomy, banking, health, climate science, etc, in a wide range of applications.

As a data driven company Jumia is also doing a lot of projects of Data Science, lead by Adrien Devaux, in areas such as recommendation, delivery times prediction, and many other applications, supported by other teams namely Data Engineering and Big Data team.

Taking this into consideration, my very first book proposal is: “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World”, written by Pedro Domingos, and published in 2015.

Pedro Domingos is Portuguese, and he is a Professor Emeritus of University of Washington, in the Paul G. Allen School of Computer Science & Engineering.

This book was recommended by Bill Gates, as one of the two books that we should all read to understand AI.

“The Master Algorithm” is composed by 10 chapters: The Machine Learning Revolution, The Master Algorithm, Hume’s problem of induction, How does your brain work, Evolution: natures learning algorithm, In the church of Reverend Bayes, You are what you reason, Learning without a teacher, The pieces of the puzzle fall in place, This is the world on Machine Learning.

Pedro Domingos states that “All knowledge — past, present and future — can be deduced from data by a single, universal learning algorithm”. This universal learning algorithm is what the author refers to as The Master Algorithm.

The main approaches in ML are supervised, semi-supervised, unsupervised and reinforcement learning, using algorithms such as Artificial Neural Networks, Random Forests, Regression… So the idea of having an universal learning algorithm is disruptive.

“The Master Algorithm” is a book for anyone, really for anyone. This is an exploration of concepts and ideas that will help you understand what your data is used for and how to better control what you share, and the ones that recently arrived in the field will learn about new algorithms, projects and problems.

Immutable Infrastructure & GitOps

2022-09-21T00:00:00+01:00

Introduction

Mutable Infrastructure: Mutable simply means ‘changeable’ or ‘customizable’. This means that you can login into the server and update configurations in place. Engineers and administrators working with this kind of infrastructure can SSH into their servers, upgrade or downgrade packages manually, tweak configuration files on a server-by-server basis, and deploy new code directly onto existing servers.

Immutable Infrastructure: Servers are never modified after they’re deployed. If something needs to be updated, fixed, or modified in any way, new servers built from a common image with the appropriate changes are provisioned to replace the old ones. After they’re validated, they’re put into use and the old ones are decommissioned.

Benefits and Limitations (Mutable Infrastructure)

Benefits:

Totally customizable and mutable = Precisely fits the application specific needs
No need to deploy new servers with new changes
Faster updates

Limitations:

Manual changes = More error prone + More time consuming
Possibility of the existence of undocumented changes = Inability of version tracking and debugging
Servers have inconsistent configurations = Configuration Drift
Harder to troubleshoot/reproduce errors
More risks during update/upgrade

Benefits and Limitations (Immutable Infrastructure)

Benefits:

No room for configuration drift = Easier testing
More consistency and reliability in infrastructure = Higher up-time
Simpler and more predictable deployment process = Lower risks during update/upgrade
Discrete Versioning and Easier Tracking
Easier rollback to older versions

Limitations:

Infrastructure can’t be modified in place
Necessity of externalizing data

Why does DevOps use Immutable Infrastructure?

Immutable infrastructure is used most by DevOps. It is affordable and easy to create new servers using modern DevOps. If there is any room for updates or improvements in immutable infrastructure, you need to replace the whole server.

Immutable infrastructure is strongly correlated with the concept of infrastructure as a code. Infrastructure as a code allows you to do all the planning of the components like instances, networking, and security. And then you can push it into your dev environment. As you promote those to the dev environment, test and prod, it is quite easy to repeat these steps in a consistent manner with immutable infrastructure. This will also make sure that no matter what environment the app developers might be in, there will always be a consistent environment. They don’t need to worry about it while deploying applications.

For all these reasons, immutable infrastructure is highly associated with the DevOps practice. DevOps includes culture and tools that work to achieve agile development with continuous delivery. And continuous delivery drives immutable infrastructure.

Immutable Infrastructure Tools

There are plenty of tools that leverage the concept of Immutable Infrastructure. These tools include:

Packer
Terraform
Docker/Kubernetes

Packer: free and open source tool for creating machine images for multiple platforms from a single source configuration. A machine image is a single static unit that contains a pre-configured operating system and installed software which is used to quickly create new running machines. Machine image formats change for each platform.

Terraform: infrastructure as code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share. You can then use a consistent workflow to provision and manage all of your infrastructure throughout its lifecycle. Terraform can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features.

Docker/Kubernetes: containerizing tools that help provision your infrastructure in the form of containers instead of VMs. This helps more in enabling an immutable infrastructure as creating/destroying containers is much easier and faster than spanning new VMs.

What we have been using at Jumia:

Packer to build our AWS AMIs that we use for our EC2 instances and EKS nodes as a base OS. These AMIs contain the initial steps and data to bootstrap our servers.
Terraform to build our AWS infrastructure like EC2 instances, RDS, Route53 records, and so on. This helps us avoid any manual changes to our infrastructure, and makes it documented.
EKS to build our AWS Kubernetes clusters to host our applications for different environments and verticals. This helps us more in managing our applications lifecycle and debugging/troubleshooting them in case a problem occurs.

GitOps

GitOps is code-based infrastructure and operational procedures that rely on Git as a source control system. It’s an evolution of Infrastructure as Code (IaC) and a DevOps best practice that leverages Git as the single source of truth, and control mechanism for creating, updating, and deleting system architecture. More simply, it is the practice of using Git pull requests to verify and automatically deploy system infrastructure modifications.

GitOps takes full advantage of the move towards immutable infrastructure and declarative container orchestration. In order to minimize the risk of change after a deployment, whether intended or by accident via “configuration drift” it is essential that we maintain a reproducible and reliable deployment process.

GitOps ensures that a system’s cloud infrastructure is immediately reproducible based on the state of a Git repository. Pull requests modify the state of the Git repository. Once approved and merged, the pull requests will automatically reconfigure and sync the live infrastructure to the state of the repository. This live syncing pull request workflow is the core essence of GitOps.

To achieve a full GitOps install, a pipeline platform is required. Pipelines automate and bridge the gap between Git pull requests and the orchestration system. Once pipeline hooks are established and triggered from pull requests, commands are executed to the orchestration piece. The GitOps methodology has several advantages:

Single source of truth. All code is on Git, so there is a single source of truth for the infrastructure or platform.
Easy rollback and fast recovery. All changes to the infrastructure are version controlled and reviewed so they are easy to roll back if necessary.
Enhanced security. All changes are pulled and applied by automation; this limits manual access to the infrastructure and thus improves security.

What we have been using at Jumia:

Atlantis to migrate from old self-managed Kubernetes clusters to new AWS EKS clusters. Atlantis makes use of Terraform to compare the infrastructure state of the current infrastructure and the infrastructure changes on Github. It can then show the differences and apply them if the user approves them.

Behavior Driven Development at Jumia: How it started!

2022-06-27T00:00:00+01:00

This is not what I asked for!

Verifying that software does what was intended to do in the first place is one of the main challenges in software development. Indeed, several things can go astray while developing software — requirements may be interpreted in different ways by different people, use cases that were not thought of may appear and, as new features are implemented and older ones are changed, it may become difficult to verify that everything keeps working as it should. Furthermore, in modern software development, everything keeps changing very fast, that is, today’s priorities are not the same as tomorrow’s and what’s valuable today may not be valuable tomorrow. If teams are not prepared to cope with fast change, they can easily become overwhelmed, inefficient and, ultimately, frustrated.

Agile software development helps reducing the risk of building the wrong thing (and increases the chances of building the right and most valuable thing first!) by shortening the feedback cycle. Normally this is done by slicing a project into a fixed set of cycles called iterations. These, produce feedback that we can use to steer and adapt the project towards the right direction. However, more often than not, this is implemented disguised as waterfall, for example, teams start by developing a feature and, after the development is done, they test it to make sure they got it right. When a mistake is discovered they have to go back to development, fix the mistake and test the whole thing again. Yes, most of the times these tests are part of a manual regression that can take hours or even days!

TDD (Test Driven Development) speeds-up the feedback loop since tests are written before the implementation and can be used to promptly validate that it is working as expected. This is great! However, these tests are technical-facing, that is, they cannot be understood by most business people. Hence, there is no easy way for them to validate the behavior before the implementation — the risk of mis-implementing requirements is still quite present.

BDD (Behavior Driven Development) is an Agile approach to software development that was originally created to make tests readable by business, i.e., as a mechanism business people could use to verify the expected behavior is implemented. The authors of the fantastic book The BDD Books: Discovery [1] explain that BDD was the missing link between requirements and the software itself: It acts as a bridge, connecting requirements to software. The bridge is made out of concrete examples and represent parts of the required system behavior. The thing about these examples is that both people and machines can understand them, i.e., they are written in plain English but, at the same time, they can be used as automated checks. Each example is a check and is often written using Given When Then keywords. These are called scenarios.

In BDD, the requirements are defined collaboratively by Business and Tech team using the aforementioned examples that can be understood by anyone and reduce the risk of ambiguity and misinterpretations. Furthermore, since these can be used as automated checks, they can continuously verify that the behavior of a system stays the intended one as the it goes through required changes.

In sum, BDD greatly helps delivery teams build the right thing and making sure it continues to behave as intended as the system evolves.

Interesting! How does BDD work then?

Well, the BDD cycle starts when the delivery team discusses the behavior of a determined user story for the first time. The team gathers and, collaboratively, starts discovering the behavior of a functionality by collecting examples of how the system should work. E.g., let’s say the team needs to implement a feature that automatically suggests tunes for the listener in a music streaming platform. Examples (Green cards) to describe that behavior could look like:

After the examples are collected the team formulates the examples into scenarios. We want these scenarios to drive the development, i.e., the implementation of the behavior. These are executable specification that belong to the application codebase and are used to continuously verify the system behavior. The scenario for the examples above could look like:

The above scenario is written in Gherkin (Given, When, Then keywords) which is one of the most used languages for behavior specifications.

Finally, these scenarios can be implemented as automated checks by the delivery team using a cycle similar to TDD. In TDD, we write a failing test first and then we make it pass by implementing the application code. In BDD, we implement a failing scenario first, representing a portion of behavior, and then we gradually implement the application behavior until the scenario passes. BDD and TDD can work alongside. In fact, the BDD cycle includes the TDD cycle inside as shown by the following picture.

In sum then, BDD is composed by three different practices done in order [1]:

Discovery: Structured and collaborative activity that uses concrete examples to uncover the ambiguities and misunderstandings that traditionally derail software projects. The goal here is for the team to reach a shared understanding of the behavior to implement.
Formulation: Process that turns concrete examples produced during discovery into business-readable scenarios. Subsequent review of the scenarios gives confidence that the team really understood what business asked for.
Automation: Code is written that turns scenarios into tests (automated checks). These give confidence that the application is doing what business intended to do, works as a safety net when code needs to be changed and act as living documentation.

Sounds good! How did you start with BDD anyway?

One of Jumia main businesses is the marketplace platform. This platform has two main clients: the sellers — who sell goods — and the customers — who buys goods. One of the seller facing projects at Jumia is SellerCenter, a platform where sellers can manage their catalog, orders, and finance.

This is a complex and legacy monolith that had already passed through many teams during its lifetime. When our team first started working on it, several problems were faced. For example, the software had a lot of defects requiring intensive support, the documentation was extensive but confusing, and outdated on many points. Most of the times, when we needed to give support to an existing feature, or even implement a new one, there was a lack of business and tech knowledge about what already existed, and we spent a lot of time and effort in investigations, trying to backtrack the business behavior by looking at existing code.

To try to tackle all these problems, we started covering every new development with Acceptance Tests, as well as catching up on the more important already existing features. We were using an internal test framework based on Gherkin that had several pre-built steps for things like payload manipulation, performing requests, asserting responses, and many other useful actions. These were valuable because we were able to effectively reduce the number of defects and develop with more confidence, but they were still too technical facing. Along with the fact that they were “formulated” during the implementation itself, without collaboration from the whole team, they eventually became hard to read and maintain.

Another problem that we would constantly face was: meetings. We followed a structured and organized process based on Scrum, with all the common practices and ceremonies, and most of the times we found ourselves stuck in long meetings, with out-of-scope conversations where communication easily went off-road. We would constantly blame this on the size of the team, which sometimes reached up to almost fifteen people, but slowly we started to realize it was not only that.

To promote continuous learning, one of the common practices some teams use at Jumia, is the book club. Teams choose a book to read and, chapter after chapter, they take notes, share and discuss what they have learned. Typically, this learning is then incorporated in the team process and routine. We came across BDD in one of our book clubs and saw it had a lot of potential to help us overcome the problems we were facing at the time. We started with The BDD Books: Discovery [1] followed by The BDD Books: Formulation [2] and slowly started to incorporate what we learned into the team’s practices and routines. E.g., we rearranged the team routine so that user stories could be discovered using the example mapping technique (a collaborative activity where a user story is explored through concrete examples), and added the formulation step where a subset of the team (three amigos) would formulate the examples discovered by the team into test scenarios. Finally, we also started slowly converting our existing acceptance tests into business-facing ones.

So, what have you observed with these changes?

After a while of implementing BDD practices in our team, these are the main points that we noticed improvement on:

Structured Conversations

We saw a huge improvement in the quality of our meetings. They became much more objective, following a logical progression between the moments when we wanted to discuss a high-level business view as opposed to the moments when we wanted to deep dive into technical details. This was mainly due to the inclusion of the example mapping technique which helped the team focus on the business or behavior details while deferring technical details for later.

Business Facing Tests

As mentioned, one of the things we started doing was refactoring our acceptance tests in a way that business people could understand them. The following images show the before and after. The team was able to leave behind all the technical terms and think like someone who will be using the application without knowing any technical details in the simplest possible way.

Before:

After:

Although we still didn’t know much about formulation practices (for example using “I” instead of a role or persona), we already saw a huge improvement in readability and in the general understanding of the requirements.

As you can see, the tests are much cleaner, allowing the readers to understand what is the expected behavior of the feature, while all technical details are hidden behind the implementation.

Living Documentation

Traditional documentation usually has some big problems with becoming outdated, complex readability, and not being accessible to non-tech people. By writing feature files in a natural language that describes the requirements of how the application should function, we can have dynamic documentation that not only provides technical knowledge for the engineers but also brings business value to the current system behavior. With the new approach, stakeholders can now review these feature files anytime they want to assert a current feature behavior, which is very useful when giving support or when thinking about new features.

Only implement what is really needed

Another improvement that we were not really expecting to happen was the amount of work that ended up NOT being done. This might seem strange, but with the way that we started slicing the stories and understanding each rule that came with them, we were able to isolate the most important rules and develop them first, while deferring less important rules. They stood in our backlog to be tackled after we finish an MVP, but most of them ended up never being needed again. Taking one feature as an example, 22% of the total user stories were not implemented at all.

Reduced Escaped Defects

Although we already had a low ratio of defects due to the high test coverage with both unit and acceptance tests (tech-oriented), using BDD practices brought even more improvement in that matter. We shifted left QA by involving them in the early stages of discovery and formulation, and we also had the product owner review the scenarios during the formulation stage, guaranteeing that we were implementing the expected behavior. As an example, we saw a whole feature composed of more than 60 user stories with only 1 minor bug reaching production.

Conclusion

BDD is a great way to explore the intended behavior of a feature in a collaborative fashion, greatly reducing the chances of accidental discovery while also ensuring that your software keeps the desired behavior as it evolves. Although full benefits are only achieved if the three BDD practices are adopted, each one gives a set of unique benefits on their own. Hence, if you want to give BDD a try, adopting Discovery is already a good first step.

In the next stories we will zoom in on the first BDD practice: Discovery. Stay tuned!

References

[1] The BDD Books: Discovery
[2] The BDD Books: Formulation

Authors

Software testing: the seven truths you have to deal with!

2020-02-01T00:00:00+00:00

Introduction

Humans develop software and they do it with a purpose. Software will do what it is programmed to do, through coding. However, with the increasing complexity of the systems developed, testing activities have become inevitable and an essential piece in software development.

Testing is not a new activity, it is a creative and intellectual challenging task like code programming and they should walk side by side.

From real practice and research for testers it was written “The seven fundamental principles of software testing”. They are intended to be the main guidelines for you (tester!) or person that performs testing (that is not the tester!). They help to effectively use your time and effort to discover defects on a software which is under testing.

As an experienced tester, you will understand every principle as true and, certainly, will recognize yourself in them on your daily job. Sometimes, you might get trapped on testing in some way. Refreshing and learning new concepts will help you to do better your job.

Each time you revisit these principles you tend to see them with your acquired experience and learn something new. I invite you to follow me on this article where we will visit the “The seven fundamental principles of software testing”, context them with my experience and learn how to deal with them on your daily job!

1. Testing shows presence of bugs

Testing activity has two main purposes:

check if code built by developer makes what the requirements asked by stakeholders;
check that what was developed did not break the expected behaviors already implemented.

On this process, one of our goals while testers (there are too many, I can write another article about this topic) is to find bugs, report them, so they can be fixed before software’s release. However, not finding bugs does not tell that software is bug-free.

Imagine we have a website which has two features implying popup display. One opens an informative popup; the other one shows a newsletter subscription popup each time no one subscribed users make three actions on our page. These two features were tested without any issue to report. Some weeks later, after some exploratory testing, we have found that if the two popups are shown simultaneously the user is not able to close the first popup. Here, we have an example that the fact that bugs are not found doesn’t mean our software is bug-free.

2. Exhaustive testing is impossible

The second principle tells us that it is impossible to perform exhaustive testing. But what is exhaustive testing? Exhaustive testing means that you will test all the functionalities using all valid and invalid inputs and preconditions. So, is this easy to happen on your daily job? Yes, for the trivial cases; not for the majority.

I give you an example: imagine you have to test the email’s field on an e-commerce registration’s form. To perform an exhaustive testing you would have to test with many combinations like: latin/chinese/cyrillic/other letters, with @, without @, special characters like “,). #$%&/()=, and so on. If you want to go deeper on this topic there is an organization called IETF (Internet Engineering Task Force) that develops and maintains the Request for Comments (RFC). These documents specify the standards that will be implemented and used through the Internet. For this specific case of email address validation, the RFC is the RFC 5322.

Even if you have some available time, it would not be possible to validate all possibilities because of:

project deadlines
available time
human resources
the amount of outputs
too complex design

To overcome the exhaustive testing, your tests should be designed and executed according to the involved risk, priorities, time and business context. Imagine, on previous example, that the software is going to be used in the United Kingdom. Some of the combinations for email address input form will be discarded like Chinese and Cyrillic characters and testing will be less exhaustive. Be effective, brief and precise, and in the end, imagine that you are an end-user of your software.

3. Early Testing

More than only testing, after developer delivers his job, you can make something more valuable: to anticipate incoherence/failure/information missing.

After product stakeholders and product owners, you are the person that knows your product better. With that knowledge you can anticipate incoherence/failure/information missing since the first product development phase, in other words, product owner has a clear vision about business needs. Groomings/refinements meetings are supposed to prepare and aggregate as much information as possible, refine the task’s information before developers start their activities. One incoherence/failure/missing is not anticipated here and it is only discovered after development and/or release, the cost will be highly increased. To avoid this you can:

get in touch with the task you will test later and prepare in advance your test strategy;
check if any existing feature can be affected by the new one;
try to locate the lack of information about the feature;
leave some notes to refresh the developer’s mind in what he should not forget to validate when he starts to code;
prepare a set of test data for test designed test conditions.

So, it is always easier and cheaper to fix the issues from the beginning rather than changing the whole system if that incoherence/failure/information is missing and bugs are found later.

4. Defect Clustering (Aggregation)

Normally, bugs are not often distributed evenly throughout an application. Defect clustering (or aggregation) on software testing is based on Pareto principle, that tell us that 80 percent of bugs are due to 20 percent of code!

There are some facts that can contribute to this principle, such as:

legacy code
different teams touching the same part of code
features that may suffer many changes
fickle 3rd-party integration

Whatever the reason is, the whole team should always try to improve the ability to identify the defect-prone areas of the application without forgetting other areas.

A note here: on purpose, I above referred whole team because I consider that this principle evokes something I truly believe while tester: a product quality is a team’s responsibility, not only a tester’s responsibility.

5. Pesticide Paradox

On agriculture, when the farmer applies a pesticide he intends to combat a type of plague. However, if he always applies the same pesticide he will combat the same plague and another one will remain there or will appear new ones.

The same approach can be applied on software testing, calling “Pesticide Paradox”. If your test suite is not updated with new tests or if you repeat the same set of tests, eventually they will no longer be able to find new bugs. To overcome this “Pesticide Paradox”, you can use some tips such as:

Regularly review and, if necessary, correct the tests already existent;
Refresh yourself with a pause and certainly new ideas will appear on your mind to create new tests; every time you have new ideas, write them down no matter where you are; it has already happened to me, having a task on a day and, at home, appeared in my mind new ideas in order to deal with a problem;
Ask another tester’s opinion, even if he/she is a less experienced tester than you, to review your tests or ideas. He/she can give you new ideas or only talk about something that will awake and refresh your brain;
The opposite is also valid: if a tester requests your opinion do not refuse it: you will learn together;
In case you are a tester dedicated to a specific area of a product, dare you to explore other areas in order to discover new bugs; you will know more about the product too.

6. Testing is context dependent

Products and projects have different features and requirements. Hence, testers can’t apply the same test approach for different projects.

For example, a critical system software, such as medical or transports, needs more and different testing than an e-commerce website or a videogame. A critical system software will require risk-based testing, be demanding with industry regulators and specific test design techniques.

On the other hand, an e-commerce website needs to go through rigorous functional and performance testing; for example, it should not have errors on calculations on order’s placement and should load and work quickly on a promotional campaign.

Some days ago, I made an order on one popular Portuguese online store specialized in household appliances and technology products. It was not the first time that I used that online store, until now without major issues. After ordering with success,the message of the order was not appearing on the my order list. After some days without any news about it I decided to get in touch with the Help Center that told me, by unexplained reasons, the order had been lost on their order processing system. Then, it was made a refund. Imagine if this order was an airplane! The damage caused by a lost order is incomparable with a lost airplane.

7. Absence of errors — fallacy

Imagine you are in a project that has a rigorous and complete test procedure with all test levels, the software is good, technically speaking, no bugs are found and it is released to the production environment. If that software doesn’t meet the user’s needs it will never be used by anyone. It is crucial to know our users and build the software based on their needs. This work should start very early, on requirements definition stage, involving the business people that knows better the user’s needs.

Conclusion

These are “The seven fundamental principles of software testing” I hope this reading helped you to think about them and learn something new. Each time I visit them gathering with my experience I get a better understanding about these guidelines. I invite you to make the same exercise more often!

Not your kind of automation

2018-09-17T00:00:00+01:00

Introduction

Oh no, another post about automation? Well, kinda.

When we talk about automation in our industry we first think about deploys, continuous integration, delivery and so on. But, I won’t be talking about that today, it’s, let’s say, another level, important part of it. How many times you’ve done a very boring repetitive task that takes your attention for quite some time and you get frustrated by not having time for doing the cool stuff?

…

Yeah, that’s right, I know the feeling.

Background

Inside Jumia, everyone knew that I had a huge set of scripts that did a lot of different things. Although that is good, the scripts were for my usage “only”. So the idea of sharing them was born. To make it happen, we thought of implementing a basic but powerful UI, which allows, execution of such scripts easily and be very user-friendly. And so, we created JDash.

(At the end I share JDash stack for the curious ones.)

Why does it matter

You can improve everyone’s work by doing so, from Developers to Top Management. The sky’s the limit. Also, you motivate your team by doing something different, you empower them, you make them more noticeable, and everyone can be involved in the tool/product evolution. So, how do I say this, our industry is full of talent, but, we still do not automate our most common tasks. And, honestly, we all are the ones to blame. We do tech, so why don’t we take the time to improve our workflow by creating something that eases your and everyone’s life?

Real example

Let’s show exactly how JDash began.

As you all know, we build e-commerce for several countries in Africa. For each, we have a backend and it’s own database, thus it’s specificity, like configurations. The first real scenario and feature of our tool was a simple Configuration Search. Imagine, having 12 countries, so, 12 databases/backends, where you must log into, search for the configuration, and get the value, versus, selecting a configuration for a given set of countries at a click of a button in which within less than 5 seconds you have your results. All at once.

Ridiculous!!

Just think of how much time everyone saves with just this one feature.

Currently it also has some basic statistics, automate a hell load of features, it’s performant, has some easter eggs to be found and brings new technologies that can serve for something else, and, it won’t stop growing.

Below I share current major features:

Basic statistics;
Amount of new customers, new orders, sum of value of orders;
Average orders per minute;
Orders placed by device and payment method;
Configuration search and audit (like a history of changes);
Moving order status on test environments;
Several syncs and resyncs between systems, like items, orders, stock, suppliers and others;
Export of vouchers, customers, vendors and others;
Overview of what is deployed on each test environment with several details like, when, git revision and others;
Debugger for Android/iOS apps requests;
And much more…

Impact and how we made it possible

With such initiatives you get the attention of everyone, all the teams participate in order to improve what you’ve built, give feedback and suggestions. These kind of tools, empower people and make their life’s easier since every task is performed with efficiency and speed. On my team I do promote a simple Hackathon on every first day of a sprint. It’s here where new ideas come to daylight, just like this one did. So, I do encourage you to do the same. Even with multidisciplinary teams, keeping high performance, without losing productivity and not abandoning your roadmap, it’s possible to do it, just plan it, organize your days, delegate and lead, trust me, everyone can do it.

Conclusion

Apart from the obvious benefits of automating your daily, repetitive tasks, it has a huge impact on your team and your peers, and below are some key points:

Involvement;
Empowering;
Peer recognition;
Motivation;
Inspiration;

With it, other benefits also arise. The team is able to work on something new, imagine, eating the same thing every day, yeah, you will be tired of it and may never eat it again. The same happens with technologies, and, developers are thirsty for new things. But that’s not all, in this blog post, as Team Leader, I do want you, that have a leader role to involve your team members in the creation of tools that ease everyone’s job, do regular small Hackathons with them, use this opportunity to use new technologies, motivate along the way, let them grow, give them space to experiment, take risks and do mistakes. You’ll see that only good things will show up.

About Me

I’m Nuno Santos, started as a developer at Jumia in 2012 and have been invited to be a Team Leader in 2015 and founded Ninja Bytes along with it’s members. A team that I’m proud of, being bold and innovative, brings innovation all the times and is always one step ahead leading Jumia to success.

JDash stack

As promised here is the tech stack of JDash:

PHP
Laravel
Bootstrap 3
Vue.js
MariaDB
Redis (for queuing as well)
Algolia
NodeJS
Socket.io

Tsuru PaaS

2017-10-17T00:00:00+01:00

Tsuru PaaS

This article is about how a Platform as a Service, named Tsuru, allowed us to handle an increasing flow of applications, widely scale our infrastructure and reduce both costs and workload caused by our growth. With a short learning curve, Tsuru set a high and reproducible quality in production.

Scaling problem

Let’s say you are a new systems engineer at Jumia Travel in charge of site reliability. What would you do if you had to run in production 10 new applications every month? every week? every day? Would you cope with that amount of change without degrading production quality or jeopardizing your budget? A Platform as a Service (PaaS) may be an adequate solution, however finding the right trade-off in terms of cost, feature and learning curve is not trivial.

“Production-ready” is not just a matter of having your application running and reliable in a production environment. “Production-ready” implies you have defined processes and tools to deploy, run, monitor and change your application according to real world use-cases. For instance, for each new service you implement, you probably need at least the following:

continuous integration/delivery pipeline;
deployment process (e.g. continuous deployment);
system configuration management recipe;
system/middleware/application monitoring;
system/middleware/application logging;
high availability;
horizontal scaling or auto-scaling.

Implementing all these is a lot of work you may not have planned up front… Worse, you may even have to adapt those components according to the type of service - starting with the programming language. Overall, it won’t scale for a increasing flow of new services you wish to run in production.

Solution: increase service level

The table below compares service levels in cloud computing. It was the starting point to solve our scaling problem.

Most cloud-based companies now fully operate on IaaS. But, as you can see in the comparison table, IaaS misses a lot of automation in terms of service level that could be beneficial to scale. PaaS is the next and obvious step after IaaS to abstract away the underlying infrastructure so you can focus solely on your application.

You may be considering CaaS (Container as a Service) as a halfway solution between IaaS and PaaS. CaaS is merely a container-based IaaS enhanced with richer features. However, container-based PaaS usually includes CaaS capabilities, so it makes sense to go up to the PaaS level.

After reviewing PaaS solutions, we noticed that most of them support the previous list of services surrounding production environment (and not only production). But which PaaS should I get?

To choose a PaaS, you must decide first if you go for a public or a private one. To keep it simple, public means the provider manages the PaaS, private means you’re in charge. A public PaaS sounds nice, but it obviously implies a higher cost. In the end, we went for a private PaaS as we had to be sensible about our budget. However, since we would manage our own system, the PaaS solution had to be simple, with a short learning curve, and it had to also include all the features a modern PaaS should have. For these reasons we chose Tsuru.

Why tsuru?

Because it’s a simple and complete solution. Tsuru PaaS is what we installed and configured in our infrastructure to radically scale service integration to production at Jumia Travel, and we got started in a matter of minutes! Besides, it’s free and open source. So let’s dig a little into Tsuru.

Platforms

Tsuru supports many platforms, check this list, you’ll find most of the popular programming languages. Note that you can also instantiate backing services like databases (e.g. MySQL, MongoDB) along with your application. If you’re not yet satisfied with that current list you can either use Heroku buildpacks or easily create your own custom platform with a simple Dockerfile.

Workflow

Tsuru workflow is very straight forward. A single git push triggers the entire workflow to build, release and deploy application changes for users to enjoy. Have a look at the following figure:

IaaS/CaaS

Probably the most remarkable thing about Tsuru is that it aims to be technology agnostic, so that it can run on top of most current IaaS solutions and use different CaaS technologies. In our case we used AWS EC2 and Docker Engine, this means Tsuru’s own scheduler manages Docker containers. Other options for container orchestration are Docker Swarm (Tsuru 1.2) and Kubernetes (Tsuru 1.3). Even though Swarm or Kubernetes increase service level to CaaS, you can still benefit from Tsuru PaaS for higher level tasks (such as application deployment) that are not easily performed by CaaS.

Horizontal scaling

With Tsuru you can scale horizontally twice: at the host level (Docker node) and at the guest level (Docker container). This means each piece of application scales independently while the entire cluster scales according to the overall demand. This makes Tsuru extremely flexible and scalable.

Let’s get started!

To test the waters of Tsuru, we decided to begin by only migrating our increasing number of new, private microservices. By postponing the migration of our public applications and databases we could minimize the customer impact in the event of a Paas failure. Besides, we knew we would get into trouble to go to production on time, with very little budget allocated for each service’s infrastructure, due to service size.

Today, all of Jumia Travel’s private microservices run steadily on Tsuru, and until now we haven’t experienced any service failures. Performance-wise we have lost less than 5% application response time (primarily due to Docker virtualization overhead) but have gained many benefits that make up for it.

Jumia infrastructure is mainly hosted in AWS. For this reason, we tried to integrate as much as possible with AWS services, and the next sections reflect this strategy. Note that Tsuru integrates very well with other cloud providers that have similar services.

Tsuru components

Tsuru core components are written in Go. These are the main components and services:

Tsuru server
Gandalf (git server)
Archive server (git archive)
PlanB router
MongoDB
Redis
Docker Registry
Docker Engine

What makes Tsuru highly scalable is that each component of this list can scale independently of the others. For instance, Tsuru core components and Docker Engine naturally scale horizontally; Redis and MongoDB support master/slave replication, and you can directly use AWS Elastic Container Registry as a Docker registry to manage your application images. With Tsuru you can start with a very small infrastructure - in AWS, all Tsuru components can run on a t2.micro instance (1 CPU, 1 GB of RAM) for test or development purposes. You may then progressively scale each component according to its usage.

As we began managing more services with Tsuru we noticed that Docker Engine and then PlanB router consume most of the allocated resources. This is the reason why they should be the first components to scale. The figure below shows the overall infrastructure.

AWS auto-scaling can directly provision and retire PlanB routers, whereas Docker nodes require Tsuru auto-scaling integration into AWS. This is critical for service delivery as Docker nodes aren’t aware of container health, whereas Tsuru is. This why Tsuru itself handles auto-scaling to safely provision and retire Docker nodes, while initiating or draining container connections from PlanB routers via Redis.

Logging

Best practice for logging in containers is to redirect any log to stdout and stderr. This way, Tsuru captures container logs that may be freely processed. However, if you use AWS instance roles you may actually configure Docker (and Tsuru) to send all logs directly to Cloudwatch. This gives you a nice interface to browse your application logs.

--log-driver awslogs --log-opt awslogs-region= \
--log-opt awslogs-group=my.tsuru.cluster --log-opt tag='{{.Name}}'

Service discovery

Service discovery is a complex problem that Tsuru solves for you: it uses the platform, Planb router and Redis to manage routes and traffic. When you create an application in Tsuru, you get two endpoints: a Git endpoint to push your application code and a second endpoint for HTTP access. As deployment occurs, Tsuru creates a new container and waits until it’s healthy to retire the old one. This way, users seamlessly switch to the new version of the application. In the event that the container never becomes healthy, Tsuru aborts the deployment. Tsuru healthcheck is customizable for each application in a tsuru.yaml file:

healthcheck:
    path: /healthcheck
    method: GET
    status: 200
    match: .*OK.*
    allowed_failures: 3

12 Factor App

As we were implementing Tsuru, we realized it helped us reach most, if not all, of the 12 factor app principles that make cloud-based software more manageable, portable and scalable. Have a look at those best practices if you don’t know them yet, there are a cornerstone for applications running on PaaS, SaaS and even CaaS.

Give it a try

Want to test Tsuru? Download Tsuru Now follow this simple video example to create, customize and scale a hello world PHP application in Tsuru.

Some references

Check out the following links for additional information:

A journey to Progressive Web Apps

2017-05-23T13:10:00+01:00

INTRODUCTION

The internet has evolved a great deal since the first time I’ve access it with my 24 kbps modem. All flashy GIF are gone. Rarely I find these days, weird mosaic backgrounds or midi music playing jingles in Christmas time while snowy javascript animations falls from the top of the screen. This evolution came mostly from the creativity of developers who tried without rest to experiment technics and technologies to overcome the browser limitations. From Flash to Dynamic HTML ,Iframes to AJAX, innovation was always the goal to reach success.

But with the growth of the mobile access from cellphones and tablets, consumers created this new need of a more performant and reliable platform. Globalization also brought to emergent countries a new type of user thanks to the mobile. Those users from basic devices to flaky internet connection made developers to rethink how we build for the internet. The progressive web apps user experiences were born and today let me tell you how in Jumia Travel we went from our standard mobile web site to a more progressive web approach to bring to our users a rich, reliable, fast an engaging experience.

Jumia Travel

Jumia Travel is a leading traveling company with more than 25’000 hotels in Africa only. In our main market, sub-Saharan Africa, it’s important to know that 75% of the mobile connections are 2G networks. Many of our users have intermittent connectivity which make the “standard” user experience a little hard to love. Also the majority of users have a low-end device with data limitations at very high costs. This two factors make difficult to induce the download of our native apps leading to steep drop-off rates and high customer-acquisition costs.

We knew we could do better, we could find a way to both save data consumption and improve the user experience by delivering a faster web application. To achieve just that we start working last year around the concept of Progressive Web App.

Progressive web apps

PWA are user experiences that have the reach of the web and are:

Reliable

Load instantly and never show the downasaur, even in uncertain network conditions.

Fast

Respond quickly to user interactions with silky smooth animations and no janky scrolling.

Engaging

Feel like a natural app on the device, with an immersive user experience.

We found this definition on Google developer’s website. And what does it mean? It should be an application that always works in any network condition, should be fast for users and let them interact as soon as possible even if the page was not fully loaded. Should feel like a native application, rich in user experiences and capabilities. It sounded good! It sounded like it solved all our issues for our users. But how to achieve just that.

How we build our PWA

When building a PWA we faced a series of challenges. I could say the first one was the fact there was no much where to look at. In our perspective PWA’s are still a very young concept or idea and we didn’t know where to go and how to get there.

The search for the values

Somewhere around February 2016 I started to investigate more on how I could build a web application that felt like a natural application. I was suggested to look at a new Google technology called Polymer. For me, it was my introduction to Web Components. Web Components looked nice, they are a set of four technologies, shadow dom, templates, html imports and custom elements and the latest caught my attention. I knew if I wanted to create a good performance application I needed to make a mentality change. This was going to be a total change of methodology.

The custom way of building quality

Custom Elements are basically a native API for you as a developer to create custom HTML tags. We can apply a default css to style the element, attach javascript to define the behaviors of the element and set custom properties to react with the element. With those custom elements I was able to orchestrate a single page application. Polymer not only introduced me to web components but also make it more productive. With polymer attached to my custom elements I had a set of features available to manage elements bindings, listeners or observers. I was able to compose elements with custom classes called behaviors, this extended the possibility to reuse code in order to avoid duplicated code. From a single custom element to a full functional application seemed to be the next logical step and there was much more from the Polymer team to use in our advantage. Web components from the Polymer catalog like the app-layout elements provided an application shell for us, and the Polymer Cli all that we needed to generate our service worker but also to build and deploy the application.

class HelloElement extends HTMLElement {
  // Monitor the 'name' attribute for changes.
  static get observedAttributes() {return ['name']; }

  // Respond to attribute changes.
  attributeChangedCallback(attr, oldValue, newValue) {
    if (attr == 'name') {
      this.textContent = `Hello, ${newValue}`;
    }
  }
}

// Define the new element
customElements.define('hello-element', HelloElement);

Example from mozilla

We felt we were ready to go.

Go hands on

In four weeks I build a prototype based in our Android Native App. Users were able to go from the home screen (home page) to a hotel list, hotel detail and hotel room page. It was a little more than half of the main funnel. We this prototype I was able to understand the effort in building Jumia Travel’s PWA but also I was able to compare some metrics in term of performance and data consumption, after all those were the main reason behind this project. The numbers were promising:

2x faster in 2G connection
6x less data compare to the native app

A new challenge was ahead. How to deliver this app project to any of the development teams knowing that none had any experience in Web components or Polymer. But also how to transmit the idea and new mentality that building this web application was closer to build a native application than actually the standard client to server requests. This PWA acts more like a native app, with state management, highly cached data and a complete new type of architecture which brought to us as a new development model, the Demo driven development. With this model we create custom elements that are atomic systems, self maintaining and testable by unit. This challenge was very important to me and I understood it was the only way this project could be successful. Internally I’ve promoted trainings for each team. It was both a theoretical and practical knowledge sharing.

Bigger may not be better

As soon as the application began to grow beyond the prototype and closer to the expected final result, our tests started to show that we were failing in performance. The Progressive Web App was slower compared to the current mobile experience. One of the major risks to build a Single Page Application is to add too much code to load in order to start the application. An action was to be taken to take us on the right course again. We broke down the app in small pieces and make sure that we lazy loaded features only when needed. We used the PRPL pattern, which stands for Push, Render, Pre-cache, Lazy-load. The idea behind PRPL is to make sure that the first render happens as soon as possible, and to do that the developer should serve by priority elements that are visible and useful for the first interaction, using lazy-load techniques through html import, service workers or by HTTP2 push. The result is that some applications features were downloaded only when requested or by the service worker in the background if the application was idle. This allowed to slim down the app shell by 200% and to get a faster first render.

Unexpected Roadblocks

With performance on track, we were using caching with the service worker for files and indexDB for data to make sure we were doing the best use of the user data consumption. Now it was time to face other new challenges, URL compatibility, SEO and Tracking are some of the examples.

URL’s are important, even more if your current website is already indexed in most of the search engines. To build a single application that knows how to respond from any request (organical or paid), we had to build custom elements that were able to understand the URI and to load the correct page fragment. Users were able then, to share URL that worked correctly between any of our websites or client applications. In this way we started to worry more about maintaining compatibility with our current system and thus with the SEO value that we already had. To make sure we kept the app rich in current features and searchable, we added all the SEO content that we had in our previous mobile experience. All the meta data, title, micro data any semantic value, everything was there. But at this stage all this content was given after the load of the page from API calls and not by the server response. To tackle this issue we decided to reuse what we already had. Since Jumia Travel has a basic feature phone website, we used this experience to feed all the search crawlers with our SEO content. This way we were sure that any crawler without Web components support could “see” our SEO content. Since we wanted to serve our PWA into a serverless environment inside our CDN for faster download response, it was easier for us to avoid to worry so much about SEO in the PWA by using our current platform. For tracking it was important to understand that in this architecture model, events like document ready or window load happend only once, and any pageviews should be conceptually considered fired when we were actually switching each page fragment since the entire application is running in a single page. In PWAs, Analytics aren’t easy.

The results

One year after the prototype was started we launched the PWA for a limited set of users with specific browser versions. And while results were arriving we opened the restrictions to other browsers or browsers versions.

In May of 2017, Google published our results, we had

33% higher conversion compared to our previous mobile site
50% lower bounce rate
12X more users versus native apps (Android & iOS)
5X less data used
2X less data to complete first transaction
25X less devices storage required

Find more information about Jumia Travel show case

Think back

For me, working in this PWA was a unique experience and a lesson on how to think and recreate the concept on building application for the web. Developers have been asking for a more centric technology and tools to build modern applications that breaks all the conceptual rules of the old web. With web components and service workers we were given the tools to extends our creativity beyond our limits. I was able to reorganize all the vendors tools that I’ve been using for the last decade and to focus more on the user experience without searching too much if there’s something already done to boost my productivity. Well designed PWA are in fact faster, engaging and reliable, but they are also Hard, and to build one is a culture of taking good experient or data driven decisions, since there’s no real guidelines or good practices to be found. PWAs experiences are still very new and I believe there’s is a lot of terrain undiscovered.

Some references

User Authentication: Look ma, no servers!

2017-03-31T11:18:00+01:00

*This is part of a guest post on Amazon Web Services Compute Blog.

Want to secure and centralize millions of user accounts across Africa? Shut down your servers!

Jumia is an ecosystem of nine different companies operating in 22 different countries in Africa. Jumia employs 3000 people and serves 15 million users/month.

Jumia unified and centralized customer authentication on nine digital services platforms, operating in 22 (and counting) countries in Africa, totaling over 120 customer and merchant facing applications. All were unified into a custom Jumia Central Authentication System (JCAS), built in a timely fashion and designed using a serverless architecture.

The challenge

A initiative was started to centralize authentication for all Jumia users in all countries for all companies. But it was impossible to unify the operational databases for the different companies. Each company had its own user database with its own technological implementation. Each company alone had yet to unify the logins for their own countries. The effects of deduplicating all user accounts were yet to be determined but were considered to be large. Finally, there was no team responsible for managing this new project, given that a new dedicated infrastructure would be needed.

With these factors in mind, we decided to design this project as a serverless architecture to eliminate the need for infrastructure and a dedicated team. AWS was immediately considered as the best option for its level of service, intelligent pricing model, and excellent serverless services.

The goal was simple. For all user accounts on all Jumia websites:

Merge duplicates that might exist in different companies and countries
Create a unique login (username and password)
Enrich the user profile in this centralized system
Share the profile with all Jumia companies

Requirements

We had the following initial requirements while designing this solution on the AWS platform:

Secure by design
Highly available via multi-master replication
Single login for all platforms/countries
Minimal performance impact
No admin overhead

Chosen technologies

We chose the following AWS services and technologies to implement our solution.

Amazon API Gateway

Amazon API Gateway is a fully managed service, making it really simple to set up an API. It integrates directly with AWS Lambda, which was chosen as our endpoint. It can be easily replicated to other regions, using Swagger import/export.

AWS Lambda

AWS Lambda is the base of serverless computing, allowing you to run code without worrying about infrastructure. All our code runs on Lambda functions using Python; some functions are called from the API Gateway, others are scheduled like cron jobs.

Amazon DynamoDB

Amazon DynamoDB is a highly scalable, NoSQL database with a good API and a clean pricing model. It has great scalability as well as high availability and fits the serverless model we aimed for with JCAS.

AWS KMS

AWS KMS was chosen as a key manager to perform envelope encryption. It’s a simple and secure key manager with multiple encryption functionalities.

Envelope encryption

Oversimplifying envelope encryption is when you encrypt a key rather than the data itself. That encrypted key, which was used to encrypt the data, may now be stored with the data itself on your persistence layer since it doesn’t decrypt the data if compromised. For more information, see How Envelope Encryption Works with Supported AWS Services.
Envelope encryption was chosen given that master keys have a 4 KB limit for data to be encrypted or decrypted.

Amazon SQS

Amazon SQS is an inexpensive queuing service with dynamic scaling, 14-day retention availability, and easy management. It’s the perfect choice for our needs, as we use queuing systems only as a fallback when saving data to remote regions fails. All the features needed for those fallback cases were covered.

JWT

We also use JSON web tokens for encoding and signing communications between JCAS and company servers. It’s another layer of security for data in transit.

Data structure

Database design is somehow simple in this scenario. DynamoDB records can be accessed by primary key and allow access using secondary indexes as well. We created a UUID for each user on JCAS which is used as a primary key on DynamoDB. Most data must be encrypted so we’ll use a single field for those data. On the other hand, there’s data that needs to be stored in separate fields because they need to be accessed from the code without decryption. The user Id, phone number, email, the ctime for the document, the user’s status and the user’s passwords are all out of the main data field. Users existed on the companies’ systems before JCAS came along so they already had passwords created. These are stored in a Map for the rare cases where JCAS needs to authenticate the user with one of those passwords. For faster accesses, we decided to hold the user’s password for JCAS on a separate field considering that most authentications will be performed against this field.

Id: UUID generated by this system
- Purpose: Primary Key, unique identification for Jumia companies
mec, mpn: Main email and phone number respectively hashed
- Purpose: Searchable fields used when a company doesn’t have an Id
old_hashes: DynamoDB Map containing legacy password hashes
- Purpose: This Map contains the password hashes for each pair (company, country) on the companies’ systems – Example on image below
revision: Timestamp for last item changes
- Purpose: Avoids invalid operations and reduces a number of API calls to Key Management Service (KMS)
Stat: Active/Inactive toggle
- Purpose: Defines whether user data should be considered valid and worth decrypting or just skip it to save a KMS call and Lambda CPU time
Info: Map containing encrypted data and its KMS key for decryption
- Purpose: Read on for more details on multiple KMS keys usage
Secret: User’s password in bcrypt
- Purpose: Hold the user’s password in JCAS. When set it prevails over any password in old_hashes.

Security

Security is like an onion, it needs to be layered and that’s what we did when designing this solution

Embedded in each layer of this solution, our design makes all of our data unreadable whilst easing our compliance needs.

The field data stores all personal information from our customers that is encrypted using AES256-CBC with a key generated and managed by AWS KMS – A new data key is used for each transaction.

The field keys stores the Ciphertextblob generated by KMS and it only can retrieve the keys to decipher the data field above.

Our solution uses KMS from multiple regions, therefore data can be decrypted within any region that is present in keys field.

KMS supports two kinds of keys — master keys and data keys. Master keys can be used to directly encrypt and decrypt up to 4 kilobytes of data and can also be used to protect data keys. The data keys are then used to encrypt and decrypt customer data.

Master keys only allow 4kB of data to be encrypted/decrypted, this was not good for us and also making it future proof for growth needs, we used Data keys to encrypt our customer data, this process is called envelope encryption

Here’s a snippet off code to perform envelope encryption. We just need to encrypt the key with the KMS in that region.

#Generate new key
k = kms.generate_data_key(KeyId=CONFIG['kmsKeyA'], KeySpec='AES_256')
#Cipher plaintext of Region A with key from Region B
kmsB = sessionB.client('kms')
kB = kmsB.encrypt(KeyId=CONFIG['kmsKeyB'], Plaintext=k['Plaintext'])
#Random IV/Salt, even though we use a new datakey per row
iv = Random.new().read(AES.block_size)
#Padding the message for the correct block size
secret = pad(secret)
#Generate ciphered text with AES256 CBC Mode
cipher = AES.new(k['Plaintext'], AES.MODE_CBC, iv )
#Data to save in the field data
csecret = iv + cipher.encrypt(secret)
del secret #Garbage collect secret ASAP
del k #Release plaintext key from memory

Here’s what is happening:

1 - Ask KMS in the current region to generate a Data Key

2 - Use the plaintext key to cipher the data field with AES256-CBC

3 - Request KMS from another region to encrypt the plaintext used to cipher the data (here we use a Master key from other regions)

If one the KMS is unavailable the process will not be affected by the key field will be left empty for the time being. The trigger we have in place that helps us achieve cross-region replication will check if any of the KMS key fields are empty, if true it will write the customer Id to an SQS queue called KMS.

If KMS on EU-CENTRAL-1 is offline, write customer Id to KMS SQS on all regions

A cron style lambda function will get the customer Id from the SQS queue, check if the needed KMS regions are available, encrypt the data and store the new keys in all the regions. The second region will do exactly the same, again last write wins in case of race condition. (for the sake of diagram clarity arrows were left out) In case that during this process other KMS region need is down or when writing the to DynamoDB is unsuccessful the customer Id will stay in the SQS KMS queue.

Schedule Lambda function will check if KMS SQS have data on all regions, check which regions we have empty in the key field and try to contact KMS, if successful, update data and keys in DynamoDB on all regions

For communication between our companies and the API we use API Keys, TLS and JWT in the body to ensure the post is signed and verified.

Data Flow

Our second requirement on JCAS is system availability

Companies always call the JCAS API when they authenticate a user. Despite this, business always comes first so if JCAS is down or inaccessible the company’s systems have to continue working just as well, even if in disconnected mode.

We have no reason to believe that an entire AWS region may be down at some point but that’s something we couldn’t risk, therefore we wanted a multi-region system.

Given our requirements for multi-master replication, we decided to use a system where last write wins, taking advantage of the revision field and have it to control all writes. With this in mind, we designed a system that replicates data and accepts writes in all regions.

Design-wise our systems are CAP-available. Replication to other regions happens synchronously but falls back to asynchronous should the former fails.

In order to provide low latency to the application, we decided not to do replication in the AWS Lambda function that answers to client requests. Instead, we use DynamoDB Streams to start the whole process.

A Lambda function called trigger is responsible for replicating writes across regions via DynamoDB Streams. This Lambda function will then check the revision field beforehand on the destination (last-write-wins), should this write fails though we fall back to asynchronous mode.

The Lambda function will write the item to a SQS queue named SQS__, for example if it failed a write on DynamoDB in region B we would write to two SQS named SQSAB and SQSBB, and will succeed if at least one SQS queue accepts the write. A cron-style lambda consumes from these SQS queues in all regions and writes to DynamoDB also using check-and-set on revision.

In order to be available across all regions, info field also contains the KMS keys needed to decrypt the user’s data on each region with its own KMS. The same queueing paradigm applies to KMS key reference contained in the info field on the DynamoDB table.

Wrapping all up

We have 3 Lambda functions and 2 SQS queues where:

1) A trigger Lambda function for DynamoDB stream – Upon changes it tries to write directly to another DynamoDB table in the second region * We insert the full item into 2 SQS queues, SQS_A_B and SQS_B_B (if we are present in only two regions, with the third it would add another one SQS_C_B) * This allows us to have the data available for 14 days against only 24h of the DynamoDB stream

2) A scheduled Lambda (cron-style) that will check a SQS queue for items and try writing them on the DynamoDB that potentially failed

This way we can replicate all DynamoDB data making it consistent,
Unless DynamoDB on the replica region is down for more than 14 days

3) The last Lambda function is another cron-style that will check the SQS queue called KMS for any items and fix the issue

Full infrastructure (for diagram clarity, missing only KMS recover topic #3 above)

Integration

In JCAS we authenticate all our customers and only save shareable data between our companies like emails, phone numbers and addresses.

JCAS is the Single Source of Truth so our companies have to update their local information based on the information provided by JCAS. Each company will have a set of specific information that is stored locally.

In the event of JCAS unavailability, these very same companies can still store data locally and subsequently call JCAS async API bulk sync with latest changes when JCAS is back up.

Regardless of how data is updated in JCAS, the highest timestamp will always prevail and although it greatly simplifies the system it may cause some discrepancies in user data flow – Here’s a situation to exemplify this:

1 - A user, disconnected from JCAS, connects to companyA and change his password
2 - He might be able to connect to companyB using old credentials stored in JCAS because companyB may not have synced yet
3 - When companyA manages to bulk sync its local DB to JCAS, the password previously set on companyA will be updated on JCAS, therefore, affecting all companies
4 - This new password is stored in old_hashes clearing the secret as it was an asynchronous process and we will not send the plaintext password in the bulk sync method. We can match any password hash from our companies
5 - Upon next login, JCAS will match the password in old_hashes, but since this time it’s synchronous we take the plaintext password create a new hash store it in secret and clear the old_hashes
6 - Some corner cases are created by this design but distributed systems are all about tradeoffs.

Results

Upon going live we noticed a minor impact in our response times – Note the brown legend in the images below

This was a cold start, as the infrastructure started to get hot, response time started to converge. On 27 April we were almost at the initial 500ms

It kept steady on values before JCAS went live (≈500ms)

As of the writing of this article, our response time kept improving (dev changed the method name and we changed subdomain name)

Customer login used to take ≈500ms and it still takes ≈500ms with JCAS. Those times have improved as other components changed inside our code.

Lessons Learned

Instead of the standard, creating your own DynamoDB cross-region replication might be a better fit for you, if your data and applications allow it
Take some time to tweak the Lambda runtime memory, remember it’s billed per 100ms, it will save you money if you have it run near a round number
KMS takes away the problem of key management with great security by design, it really simplifies your life
Always check the timestamp before operating on data, if it’s invalid save the money by skipping further KMS, DynamoDB and Lambda calls. You’ll love your systems even more
Embrace the serverless paradigm, even if it looks complicated at first it will save your life further down the road when your traffic bursts or you want to find an engineer who knows your whole system

Next steps

We are going to leverage everything we’ve done so far to implement SSO in Jumia. For a future project, we are testing OpenID connect with DynamoDB as a backend.

Conclusion

We did a mindset revolution in many ways. Not only we went completely serverless as we started storing critical info on the cloud but we also moved all user data to be decoupled between local systems and central auth silo. Managing all of these critical systems became far more predictable and less cumbersome than we thought possible. For us this is the proof that good and simple designs are the best features to look out for when sketching new systems.

If we were to do this again, we would do it in exactly the same way.

Scrumming Agreements

2017-03-16T14:08:17+00:00

Storming Teams

In the early stage of team formation, normally shortly after a team is put together, it is normal to start noticing some conflicts or frictions between team members, maybe due to different personalities, different ways in working, inexperience in collaborating, unhealthy personal habits, etc.

To help a team get through a more storming stage (forming > storming > norming), nothing better than creating a common ground: simple, direct team working agreements that can ease daily interactions, help resolve differences, and promote more appreciation or respect between team members – foster a good working environment.

Scrumming the Agreements

At the time some teams at my organization (POs included) were still trying “get the hang” out of writing users stories and grooming the backlog, so what came to my mind:

Why not do this team working agreements slightly differently?
Let’s “Scrum” these team working agreements and get more out of the exercise then the actual agreements!

So I prepared the exercise writing a user story (new functionality) for our product (teams) that would be groomed (high priority backlog item) and implemented in the next sprint (retrospective).

I wrote a “poor” user story deliberately (value/purpose, format/style), so that the teams could use their recent learnings, references and practices in getting a backlog item “ready”, as well as reassuring it is actually “done”.

Backlog item before grooming (non-ready)

User Story

As a manager I want to create a set of rules So that I can make teams follow and avoid conflicts

Acceptance Criteria

Rules will be followed by all teams
Rules must be detailed as much as possible
There is no limit of rules
No cell phones in events
No laptops in events (stand-up/retro events)
No chatting/messaging in events (e.g. skype)
Events must start on time
Rules must be handed over to the manager so he can use them when necessary

Backlog item after grooming (ready)

User Story

As a facilitator I want to establish a set of team working agreements So that we create a better, trustful and safer working environment

Acceptance Criteria

Team agreements are rules or behaviors that team members should agree and follow
The agreements must be effective:
simple and direct – that fit on a post-it
limited in number – more than 10 less than 20
enforceable – change/impact on behavior
The agreements should be made available for all team members so they can be:
referred daily or in ceremonies
shared with “new” team members
used as recommendations for other teams

Sprint backlog item (done) – Team A

Sprint backlog item (done) – Team B

Outcome

I came out to be an interesting way in achieving the goal and using some Scrum principles and practices:

getting backlog item “ready”: grooming session with the team (role played PO) creating clarity, understanding the “who”, “what”, “why”, questioning and re-assuring value, confirming acceptance criteria is feasible and testable,
getting the sprint backlog item “done”: collaboratively identifying agreements and confirming the acceptance criteria is met.