Technology is evolving so quickly that anyone with the title Architect can feel overwhelmed when it comes to deciding what to choose and how long the choice is going to last. This isn’t going to change anytime soon, and here are some of the technology and process components that we adopted to production from multiple experiments that we ran this year.
Hybrid Data Model
JSON being the de facto standard for REST and microservices, we depend a lot on document-oriented stores like Apache CouchDB. While some data is best modelled as rows, other data is best modelled as JSON documents. So we started moving to hybrid data models(relational+JSON). MariaDB’s JSON type and comprehensive set of SQL functions for storing, indexing, and querying JSON documents were adopted.
Bringing explainability to the data we consume and the model predictions. Adoption of Graph database (Neo4j) gave a context on what features and weights were used in predictions for our talent matching engine and gave the customer confidence to give the go-ahead for production. Graph Algorithms were adopted to our reputation risk solution, which gave the context for a risk alert.
Declarative data pipeline definition
Traditionally data pipelines were written in Python scripts, we were no exception. The script contains the logic of the individual steps as well as the code chaining the steps together. As we started building more complex data engineering pipelines, maintenance & reliability of the pipelines became a challenge. This year we started building declarative data pipelines for our data lifecycle; from ingestion, to clean up, to transformation leveraging Apache Nifi. With declarative data pipelines, data engineers can now rapidly branch, edit, and deploy data pipelines with existing DevOps tools for CI/CD. As we end the year, we have a stable data pipelines build and run using Apache Nifi.
Continuous delivery for machine learning (CD4ML)
CD4ML is a term proposed by Thought Works, which is beyond the much talked about term MLOps. CD4ML talks not just about version models, also about training data, hyperparameter. We decided to build our large scale AI/ML pipelines rely on Apache MinIO as the model and data store. MinIO was the choice considering the performance characteristics and the ability to seamlessly scale to Petabytes and Exabytes of data. By separating storage and compute, we built a custom framework that is not dependent on local resources, but runs on a container inside Kubernetes. This adds considerable flexibility to our deployment and meets the demand of scalability.
Other experiments that moved from trial to adoption are: the change from Nginx as our reverse proxy to Traefik. Nginx has not lost its charm, but it didn’t meet our monitoring and observability needs, protocol support, and ease of use. Traefik is a Docker-aware reverse proxy that includes its monitoring dashboard. Traefik supports automatic discovery of services, metrics, tracing, and has Let’s Encrypt support out of the box
Streamlit turned out to be a good pick of the year. Streamlit is a framework for building ML and data science web apps. It was quick and fun building prototypes using Streamlit.
What we still have in the assess phase and did not make to adoption this year is the micro frontends. A couple of POC’s are built and are being tested out, expecting to hit our mainstream adoption by Q1 2021.
Chandrasekhar Somasekhar is the Chief Architect at Cleareye.ai. He is responsible for product development, strategic technology direction, implementing and governing solution architecture methodologies at Cleareye.ai. Chandra defines and governs the enterprise solution architecture strategy. Additionally, he provides architectural direction. He is passionate about developing, mentoring, and motivating a high-performance team.