Architecture V1 in an Early Stage Startup

Tao Wang
6 min readFeb 26, 2021

Although there are tons of articles on software architecture on the internet, they are mostly from late stage startups or public companies. At Trace Data, a seed stage startup, we built our product from ground up in the past 15 months. In this article, I would like to share our experience and a few learnings on forming the architecture V1 at an early stage startup. Hope it is an interesting read for entrepreneurs, CTO, and architects and other engineering leaders facing similar challenges.

(Note, the architecture decisions were mostly made by Mark Lu and me together.)

What is Architecture V1

At the beginning, there is nothing. When architecture V1 is in place, we should have the following

  • Infrastructure platform and programming languages are decided.
  • Major components of the system are clearly identified and put into place.
  • The engineering team can independently develop any component rather than worrying about stepping on each other’s toes.
  • The system is secure
  • The system is scalable for all major business use cases, commonly including a Read/Query and a Write path.
  • It is all streamlined to publish code, review, merge, build, test, deploy, rollback, promote staging to production.
  • Monitoring and alerts are set up.
  • Cost is efficient for running the above system.
  • The architecture has enough room to evolve, such as introducing new business cases, adding new MicroServices. In short, architecture V1 should be built to run for at least 1 year without urgency for a V2.

We will discuss the choices one by one.

What did we have to start?

A prototype of the product was developed by a contractor company for Trace Data on Heroku between August and September 2019. I joined Trace Data on 9/2019 and Mark joined on 10/2019. We started to think about and work on Architecture V1.

Per the picture, it was very basic and far far away from a V1.

Choices, Decisions and Retrospection

Platform: AWS vs GCP

We chose GCP because we had more expertise in the team for it at that time. It was a velocity reasoning.

Retrospection: I think we made a great choice. GCP is scalable, reliable and cost efficient. I heard people complain that GCP is hard to use. We did not notice anything major.

Platform: Serverless vs K8s

We were quite interested in choosing Serverless for its benefits. There were a few hurdles though

  • Java is not a good language for Serverless due to the slow start. Majority of our engineers are Java developers though. JS or GoLang will take time for the team to ramp up.
  • The team has tremendous knowledge operating MicroServices at scale. Picking Serverless gives up that strength.
  • Our data ingestion pipeline will have complex logics to process, aggregate and join data. We will need local and access to remote cache, which conflicts with Serverless’ nature for standalone logic.
  • Per not so thorough research, Serverless may not be as performant as regular Java app.
  • If we choose Serverless, we will make sure it is portable from GCP Function to AWS Lambda.
  • Testing will be different. So is deployment. It means we move into a completely new world.

Mark recommended K8s instead. It quickly won our hearts. K8s has a few advantages that we love.

  • Cloud independent. We can easily port our system from GKE to EKS or AKS, thanks to K8s.
  • Every service is dockerized. It means we can build each piece with the language that we choose: Java, GoLang, React.js… A lot of flexibility.
  • The concepts of service, pods, load balance, auto scaling are very familiar.
  • K8s is supported by Google, gaining strong momentum in industry, and battle tested.
  • The system is also more compacted than operating traditional EC2 + LB type of micro services.

Retrospection: I think we made a great decision. K8s probably improved our productivity at least 20% to 30%. Great choice Mark has made.

Serverless might have worked out too. But I am happy with our K8s choice.

Programming Language: consolidate to one?

On the backend side, we were using GoLang. With many Java developers in town, the question rose if we should use both GoLang and Java or consolidate into one language: GoLang.

I argued that if we do that, the team will probably lose 30% productivity if not higher, dropping Java and learning GoLang. It did not sound like the right choice to me. Our CEO, James, agreed.

Eventually, we use GoLang to develop the agent and tracer, basically the apps that will be deployed on customers’ side. We use Java for developing data ingestion, query and API layers on our SaaS platform side.

UI is built on Node.js, React and Typescript. It stays that way.

Retrospection: I think we made a great decision. We fully leveraged the team’s expertise of Java. This became even more clear when one more principal engineer joined the team in 2020, who is also a strong Java developer.

Separate the major components

We separated by READ or Query vs Write paths, following the CRQS pattern.

In reality, there are many other components such as a caching layer I purposely skipped. For the above, the focus should be

  • Read and write are separated.
  • Each component can scale out horizontally.
  • DB is configured for HA
  • API is the layer to hide actual implementation. For instance, we can replace datastore without changing APIs.
  • Most components are stateless by design.

The system is secure

There is nothing too fancy here. We use Auth0 to sign up and log in users, use apiKey to validate incoming data on ingestion side, and force HTTPS communication. We also built simple RBAC with three roles: admin, user, and viewer. This is forced at each API level inside of Node.js for user access permission.

Retrospection: This worked decently well.

CI/CD

We leverage CircleCI for CI/CD. It has a nice module for building and publishing to GCR, Google container store.

We leverage its Context feature to inject different env variables based on Context: staging vs production, use its “approval” feature for promoting staging to production.

Note, docker image is the same after build. The difference in dev, staging vs production happens during deployment through config. This ensures that the exact same build can be promoted from staging to production.

Retrospection: I believe we did an outstanding job here. We have CI/CD which is clean, efficient, easy to understand, operate and extend. The integration among GKE, Github and CircleCI are seamless. This is far better than any other CI/CD I have worked with before: Jenkins, TeamCity etc.

I will write another article to share more details.

Monitoring/Alerts

We use DataDog. Infrastructure monitoring only. No APM. We also use DataDog integration with GKE, which has nice OOB dashboards. DataDog alerts connect to both slack and PagerDuty.

It is worth noting that we moved to DataDog from New Relic for two major reasons

  • Cost: NR is CPU based. Since we do not have a large number of metrics but do have high CPUs in a K8s cluster, the cost is too high in comparison with DataDog. DD charges about $18 a host, no matter how much CPUs are there.
  • Retention: somehow, our NR data retention only allows 1 or 2 days, which is too short. DataDog allows a year.

Retrospection: Very happy about the DataDog choice. Its price model is very friendly to the K8s system. It is proven in both regular data volume and performance testing.

Integration with slack makes life easy.

Cost Efficiency

Retrospection: I am quite pleased with our system running on GKE. Cost is predictable and proportional to CPU usage. In the past 15 months, I have not got any surprise on billing.

Architecture V1 has enough room to evolve

Retrospection: We have achieved this goal very well. In the past year, the system has evolved a lot. We added new functionality of policy, events, API detection, spec drift, risk scoring, anomaly detection and more. We switched to a new datastore. We added quite a few new MicroServices.

It came out the changes naturally fit into either Read or Write path or both. The architecture stays clean and strong.

How long did it take to reach Architecture V1

2 months. Most work happened in Nov and Dec 2019. Thanks to Mark, who is a superstar, built out the majority of this work, in his first 2 months at Trace Data.

Conclusion

Overall, I think we did a very good job of forming architecture V1 at Trace Data, from ground up. Hope the decision making process and retrospective thoughts are helpful to other engineer leaders when they face similar challenges.

Love to hear your experience and thoughts.

--

--