Architecture V1 in an Early Stage Startup

Although there are tons of articles on software architecture on the internet, they are mostly from late stage startups or public companies. At Trace Data, a seed stage startup, we built our product from ground up in the past 15 months. In this article, I would like to share our experience and a few learnings on forming the architecture V1 at an early stage startup. Hope it is an interesting read for entrepreneurs, CTO, and architects and other engineering leaders facing similar challenges.

(Note, the architecture decisions were mostly made by Mark Lu and me together.)

What is Architecture V1

At the beginning, there is nothing. When architecture V1 is in place, we should have the following

We will discuss the choices one by one.

What did we have to start?

A prototype of the product was developed by a contractor company for Trace Data on Heroku between August and September 2019. I joined Trace Data on 9/2019 and Mark joined on 10/2019. We started to think about and work on Architecture V1.

Per the picture, it was very basic and far far away from a V1.

Choices, Decisions and Retrospection

Platform: AWS vs GCP

We chose GCP because we had more expertise in the team for it at that time. It was a velocity reasoning.

Retrospection: I think we made a great choice. GCP is scalable, reliable and cost efficient. I heard people complain that GCP is hard to use. We did not notice anything major.

Platform: Serverless vs K8s

We were quite interested in choosing Serverless for its benefits. There were a few hurdles though

Mark recommended K8s instead. It quickly won our hearts. K8s has a few advantages that we love.

Retrospection: I think we made a great decision. K8s probably improved our productivity at least 20% to 30%. Great choice Mark has made.

Serverless might have worked out too. But I am happy with our K8s choice.

Programming Language: consolidate to one?

On the backend side, we were using GoLang. With many Java developers in town, the question rose if we should use both GoLang and Java or consolidate into one language: GoLang.

I argued that if we do that, the team will probably lose 30% productivity if not higher, dropping Java and learning GoLang. It did not sound like the right choice to me. Our CEO, James, agreed.

Eventually, we use GoLang to develop the agent and tracer, basically the apps that will be deployed on customers’ side. We use Java for developing data ingestion, query and API layers on our SaaS platform side.

UI is built on Node.js, React and Typescript. It stays that way.

Retrospection: I think we made a great decision. We fully leveraged the team’s expertise of Java. This became even more clear when one more principal engineer joined the team in 2020, who is also a strong Java developer.

Separate the major components

We separated by READ or Query vs Write paths, following the CRQS pattern.

In reality, there are many other components such as a caching layer I purposely skipped. For the above, the focus should be

The system is secure

There is nothing too fancy here. We use Auth0 to sign up and log in users, use apiKey to validate incoming data on ingestion side, and force HTTPS communication. We also built simple RBAC with three roles: admin, user, and viewer. This is forced at each API level inside of Node.js for user access permission.

Retrospection: This worked decently well.

CI/CD

We leverage CircleCI for CI/CD. It has a nice module for building and publishing to GCR, Google container store.

We leverage its Context feature to inject different env variables based on Context: staging vs production, use its “approval” feature for promoting staging to production.

Note, docker image is the same after build. The difference in dev, staging vs production happens during deployment through config. This ensures that the exact same build can be promoted from staging to production.

Retrospection: I believe we did an outstanding job here. We have CI/CD which is clean, efficient, easy to understand, operate and extend. The integration among GKE, Github and CircleCI are seamless. This is far better than any other CI/CD I have worked with before: Jenkins, TeamCity etc.

I will write another article to share more details.

Monitoring/Alerts

We use DataDog. Infrastructure monitoring only. No APM. We also use DataDog integration with GKE, which has nice OOB dashboards. DataDog alerts connect to both slack and PagerDuty.

It is worth noting that we moved to DataDog from New Relic for two major reasons

Retrospection: Very happy about the DataDog choice. Its price model is very friendly to the K8s system. It is proven in both regular data volume and performance testing.

Integration with slack makes life easy.

Cost Efficiency

Retrospection: I am quite pleased with our system running on GKE. Cost is predictable and proportional to CPU usage. In the past 15 months, I have not got any surprise on billing.

Architecture V1 has enough room to evolve

Retrospection: We have achieved this goal very well. In the past year, the system has evolved a lot. We added new functionality of policy, events, API detection, spec drift, risk scoring, anomaly detection and more. We switched to a new datastore. We added quite a few new MicroServices.

It came out the changes naturally fit into either Read or Write path or both. The architecture stays clean and strong.

How long did it take to reach Architecture V1

2 months. Most work happened in Nov and Dec 2019. Thanks to Mark, who is a superstar, built out the majority of this work, in his first 2 months at Trace Data.

Conclusion

Overall, I think we did a very good job of forming architecture V1 at Trace Data, from ground up. Hope the decision making process and retrospective thoughts are helpful to other engineer leaders when they face similar challenges.

Love to hear your experience and thoughts.

VP Engineering