Architecture V1 in an Early Stage Startup

Although there are tons of articles on software architecture on the internet, they are mostly from late stage startups or public companies. At Trace Data, a seed stage startup, we built our product from ground up in the past 15 months. In this article, I would like to share our experience and a few learnings on forming the architecture V1 at an early stage startup. Hope it is an interesting read for entrepreneurs, CTO, and architects and other engineering leaders facing similar challenges.

(Note, the architecture decisions were mostly made by Mark Lu and me together.)

What is Architecture V1

  • Infrastructure platform and programming languages are decided.
  • Major components of the system are clearly identified and put into place.
  • The engineering team can independently develop any component rather than worrying about stepping on each other’s toes.
  • The system is secure
  • The system is scalable for all major business use cases, commonly including a Read/Query and a Write path.
  • It is all streamlined to publish code, review, merge, build, test, deploy, rollback, promote staging to production.
  • Monitoring and alerts are set up.
  • Cost is efficient for running the above system.
  • The architecture has enough room to evolve, such as introducing new business cases, adding new MicroServices. In short, architecture V1 should be built to run for at least 1 year without urgency for a V2.

We will discuss the choices one by one.

What did we have to start?

A prototype of the product was developed by a contractor company for Trace Data on Heroku between August and September 2019. I joined Trace Data on 9/2019 and Mark joined on 10/2019. We started to think about and work on Architecture V1.

Per the picture, it was very basic and far far away from a V1.

Choices, Decisions and Retrospection

Platform: AWS vs GCP

Retrospection: I think we made a great choice. GCP is scalable, reliable and cost efficient. I heard people complain that GCP is hard to use. We did not notice anything major.

Platform: Serverless vs K8s

  • Java is not a good language for Serverless due to the slow start. Majority of our engineers are Java developers though. JS or GoLang will take time for the team to ramp up.
  • The team has tremendous knowledge operating MicroServices at scale. Picking Serverless gives up that strength.
  • Our data ingestion pipeline will have complex logics to process, aggregate and join data. We will need local and access to remote cache, which conflicts with Serverless’ nature for standalone logic.
  • Per not so thorough research, Serverless may not be as performant as regular Java app.
  • If we choose Serverless, we will make sure it is portable from GCP Function to AWS Lambda.
  • Testing will be different. So is deployment. It means we move into a completely new world.

Mark recommended K8s instead. It quickly won our hearts. K8s has a few advantages that we love.

  • Cloud independent. We can easily port our system from GKE to EKS or AKS, thanks to K8s.
  • Every service is dockerized. It means we can build each piece with the language that we choose: Java, GoLang, React.js… A lot of flexibility.
  • The concepts of service, pods, load balance, auto scaling are very familiar.
  • K8s is supported by Google, gaining strong momentum in industry, and battle tested.
  • The system is also more compacted than operating traditional EC2 + LB type of micro services.

Retrospection: I think we made a great decision. K8s probably improved our productivity at least 20% to 30%. Great choice Mark has made.

Serverless might have worked out too. But I am happy with our K8s choice.

Programming Language: consolidate to one?

I argued that if we do that, the team will probably lose 30% productivity if not higher, dropping Java and learning GoLang. It did not sound like the right choice to me. Our CEO, James, agreed.

Eventually, we use GoLang to develop the agent and tracer, basically the apps that will be deployed on customers’ side. We use Java for developing data ingestion, query and API layers on our SaaS platform side.

UI is built on Node.js, React and Typescript. It stays that way.

Retrospection: I think we made a great decision. We fully leveraged the team’s expertise of Java. This became even more clear when one more principal engineer joined the team in 2020, who is also a strong Java developer.

Separate the major components

In reality, there are many other components such as a caching layer I purposely skipped. For the above, the focus should be

  • Read and write are separated.
  • Each component can scale out horizontally.
  • DB is configured for HA
  • API is the layer to hide actual implementation. For instance, we can replace datastore without changing APIs.
  • Most components are stateless by design.

The system is secure

Retrospection: This worked decently well.

CI/CD

We leverage its Context feature to inject different env variables based on Context: staging vs production, use its “approval” feature for promoting staging to production.

Note, docker image is the same after build. The difference in dev, staging vs production happens during deployment through config. This ensures that the exact same build can be promoted from staging to production.

Retrospection: I believe we did an outstanding job here. We have CI/CD which is clean, efficient, easy to understand, operate and extend. The integration among GKE, Github and CircleCI are seamless. This is far better than any other CI/CD I have worked with before: Jenkins, TeamCity etc.

I will write another article to share more details.

Monitoring/Alerts

It is worth noting that we moved to DataDog from New Relic for two major reasons

  • Cost: NR is CPU based. Since we do not have a large number of metrics but do have high CPUs in a K8s cluster, the cost is too high in comparison with DataDog. DD charges about $18 a host, no matter how much CPUs are there.
  • Retention: somehow, our NR data retention only allows 1 or 2 days, which is too short. DataDog allows a year.

Retrospection: Very happy about the DataDog choice. Its price model is very friendly to the K8s system. It is proven in both regular data volume and performance testing.

Integration with slack makes life easy.

Cost Efficiency

Architecture V1 has enough room to evolve

It came out the changes naturally fit into either Read or Write path or both. The architecture stays clean and strong.

How long did it take to reach Architecture V1

Conclusion

Love to hear your experience and thoughts.

VP Engineering