Database design ad server? What is the typical ad server database design to track clicks, impressions and conversions. How can I design a cron job to aggregate all my data by hour?

Question

Robin Verlangen · Answer

I have recently developed a comparable system, thus will not be able to disclose all of the details. However, I think it all boils down to these key points:
- do everything you have to do at the right moment (example: if you need unique users per day, in a weekly report, when do you count them?, do you need geo lookups in realtime?)
- pick state of the art proven technologies that are designed to scale almost limitlessly
- make sure there's no single point of failure: not even primarily for high availability, but also for high throughput
- make sure your team is experienced in writing these kind of applications. Even if you pick the right technology stack you can still fail if your team misuses it.

EDIT: it might also be interesting to read this article of a very large RTB platform, API Rearchitecture Series - The Juicy Details [ http://techblog.appnexus.com/2013/api-rearchitecture-series-the-juicy-details/ ]

Dave Voorhis · Answer

The only thing you can say about big tech companies — and, for that matter, small tech companies — is that no two are alike.

Furthermore, no two projects within a given company are necessarily alike.

With that in mind, my experience of database design in companies — big, small, tech, non-tech — is that it’s utterly inconsistent, and varies from project to project.

Sometimes it’s done by DBAs.

Sometimes it’s done by Data Architects / Data Analysts or other specialists in data management.

Sometimes it’s almost entirely ad hoc, done by developers on a project as and when they need storage. (Developer A: “I need to add a column to the flergle table to handle zplots.” Developer B: “Dude, I think that violates 3NF.” Developer C: “Who is Trey Inif? Is he the overseas delivery manager?”)

Sometimes it’s some combination of the above, either before development or via review afterward.

Sometimes it’s something else entirely. (“Get Simon in Accounting to look at it… He’s really good with spreadsheets, and databases are just spreadsheets.”)

Etc.

ChatGPT · Answer

Designing a database for an ad server involves creating a schema that efficiently tracks key metrics such as clicks, impressions, and conversions. Below is a typical design and an outline for a cron job to aggregate data by hour.
Database Design
1. Tables Overview
You might consider the following tables:
 * Ads
 * [code]ad_id[/code] (Primary Key)
 * [code]ad_content[/code] (Text or URL)
 * [code]created_at[/code] (Timestamp)
 * [code]updated_at[/code] (Timestamp)
 * Campaigns
 * [code]campaign_id[/code] (Primary Key)
 * [code]campaign_name[/code] (String)
 * [code]start_date[/code] (Date)
 * [code]end_date[/code] (Date)
 * [code]budget[/code] (Decimal)
 * Impressions
 * [code]impression_id[/code] (Primary Key)
 * [code]ad_id[/code] (Foreign Key)
 * [code]timestamp[/code] (Timestamp)
 * [code]user_id[/code] (Optional, Foreign Key for tracking users)
 * [code]device[/code] (String)
 * Clicks
 * [code]click_id[/code] (Primary Key)
 * [code]ad_id[/code] (Foreign Key)
 * [code]timestamp[/code] (Timestamp)
 * [code]user_id[/code] (Optional, Foreign Key)
 * [code]device[/code] (String)
 * Conversions
 * [code]conversion_id[/code] (Primary Key)
 * [code]ad_id[/code] (Foreign Key)
 * [code]timestamp[/code] (Timestamp)
 * [code]user_id[/code] (Optional, Foreign Key)
 * [code]value[/code] (Decimal, for tracking conversion value)
2. Relationships
 * Each ad can belong to one or more campaigns.
 * Each impression, click, and conversion is associated with a specific ad.
 * You can optionally track users and devices for more granular data.
Aggregation Strategy
To aggregate data by hour, you can create a separate table to store aggregated metrics or run queries on the existing tables. Here’s how you can approach it:
1. Aggregated Metrics Table
You might create an [code]hourly_metrics[/code] table:
 * hourly_metrics
 * [code]metric_id[/code] (Primary Key)
 * [code]ad_id[/code] (Foreign Key)
 * [code]hour[/code] (Timestamp)
 * [code]impressions_count[/code] (Integer)
 * [code]clicks_count[/code] (Integer)
 * [code]conversions_count[/code] (Integer)
2. Cron Job Implementation
You can use a cron job to run an aggregation script every hour. Below is a pseudo-code example in Python using a SQL database:
[code]import sqlite3
from datetime import datetime, timedelta

def aggregate_metrics():
    conn = sqlite3.connect('ads_database.db')  # Replace with your database connection
    cursor = conn.cursor()

# Get the current hour and the previous hour
    current_hour = datetime.now().replace(minute=0, second=0, microsecond=0)
    previous_hour = current_hour - timedelta(hours=1)

# Aggregate data from the Impressions, Clicks, and Conversions tables
    cursor.execute('''
        INSERT INTO hourly_metrics (ad_id, hour, impressions_count, clicks_count, conversions_count)
        SELECT
            ad_id,
            ? AS hour,
            COUNT(DISTINCT impression_id) AS impressions_count,
            COUNT(DISTINCT click_id) AS clicks_count,
            COUNT(DISTINCT conversion_id) AS conversions_count
        FROM
            Ads
        LEFT JOIN Impressions ON Ads.ad_id = Impressions.ad_id AND Impressions.timestamp %3E= ? AND Impressions.timestamp %3C ?
        LEFT JOIN Clicks ON Ads.ad_id = Clicks.ad_id AND Clicks.timestamp %3E= ? AND Clicks.timestamp %3C ?
        LEFT JOIN Conversions ON Ads.ad_id = Conversions.ad_id AND Conversions.timestamp %3E= ? AND Conversions.timestamp %3C ?
        GROUP BY ad_id
    ''', (current_hour, previous_hour, current_hour, previous_hour, current_hour, previous_hour, current_hour))

conn.commit()
    conn.close()

# Schedule this function to run every hour[/code]
Cron Job Setup
To set up the cron job, you would edit your crontab file by running [code]crontab -e[/code] and adding a line like this:
[code]0 * * * * /usr/bin/python3 /path/to/your/script.py[/code]
This runs the script at the start of every hour.
Conclusion
This design allows you to efficiently track and aggregate ad performance metrics. The cron job ensures that your data is regularly updated, enabling real-time analytics and reporting. Adjust the database schema and script based on your specific requirements and the database system you are using.

Benjamin Ross · Answer

The other answers all involve database CRUD transactions which for tight time-sensitive operations like serving ads creates a bottleneck that cannot be tolerated when competing for ad space for billions of customers at any given instant.

The best solution is for each ad server to already have a cache of total click counts available for every ad it serves at bid-time.

One approach is to maintain an in-memory local count on each ad server, and increment that count in real-time as the ad is served. This local count necessitates sharing count updates between hosts that serve the same ads. With thousands of ad-servers, cross-communication between hosts won’t scale. An aggregator system can be put in place to receive local counts from the ad servers, aggregate a global count, and make this global count queryable.

The Back End

To receive the local counts from the ad servers, any type of synchronous data transfer from ad-server to ad-aggregator would result in many 503 errors and the like [ https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_Server_errors ]. Instead a system like Apache Kafka [ http://kafka.apache.org/ ] can receive, fanout and stream this huge amount of data to all the aggregator hosts.

The API

Each ad server queries the API every-so-often to re-sync its local count with the newly updated global count. Ad clicks happen at a very high rate across all ads, but remain low per ad. The time to re-sync can be tuned to tradeoff between highly accurate counts and putting low-load on these aggregator hosts.

Black Friday Problem

During times of high traffic [ https://www.digitalcommerce360.com/2017/11/27/amazon-captures-55-of-black-friday-transactions-among-top-online-retailers/ ] a system like this will need to scale. Increasing the amount of ad servers will increase the load on the aggregator hosts and increasing the amount of aggregator hosts will increase the fan-out load on the Kafka cluster. With all these moving pieces, there are a lot of places this system can break at the most in-opportune times. Companies like Google, Amazon and Facebook spend billions collecting information about what their users like, what websites they visit, who their friends are, and what they buy (and spend even more keeping this information secure). This is all to support their targeted ads business which brings that investment back 100-fold. Smart decisions must be made to meet scale demand with highly accurate click-counts; outages in the ads business result in extremely high loss of revenue and angry CEOs.

David Lewis · Answer

Designing real-time distributed counters for ad clicks involves ensuring that the system can handle a high volume of data with minimal latency and high availability. Here's how you can approach this using the keyword TBP (which we'll interpret as Topics, Brokers, and Partitions, often related to distributed messaging systems like Kafka):

1. **Topics:**

- Create dedicated Kafka topics for ad click events. Each event represents a click on an ad and includes metadata such as the ad ID, user ID, timestamp, etc.

- Design the topic schema to ensure that it can handle the necessary attributes for tracking and counting clicks.

2. **Brokers:**

- Deploy multiple Kafka brokers to handle the load of click events. The brokers are responsible for receiving, storing, and transmitting the click events.

- Ensure brokers are well-distributed across different servers or data centers to provide fault tolerance and high availability.

3. **Partitions:**

- Partition the Kafka topics to distribute the load across multiple brokers. Partitions allow parallel processing of click events, enhancing throughput and reducing latency.

- Use a partitioning key, such as the ad ID, to ensure that all events related to a specific ad are sent to the same partition. This helps in maintaining the order of events for each ad and simplifies counting.

4. **Producers:**

- Implement producers that capture ad click events in real-time from various sources (e.g., websites, mobile apps) and send them to the Kafka topics.

- Optimize producers for high throughput and low latency to ensure that click events are ingested into the system without delay.

5. **Consumers:**

- Develop consumers that read from the Kafka topics and aggregate the click counts. Consumers can use stream processing frameworks like Apache Flink, Apache Spark Streaming, or Kafka Streams.

- Consumers can maintain in-memory state or use external storage systems (e.g., Redis, Cassandra) to store intermediate counts and ensure persistence.

6. **State Management:**

- Use stateful processing to maintain the counts of ad clicks in real-time. Techniques such as windowing (e.g., tumbling windows, sliding windows) can be employed to aggregate counts over specific periods.

- Implement mechanisms for state checkpointing and recovery to handle failures and ensure data consistency.

7. **Scalability and Fault Tolerance:**

- Ensure the system is horizontally scalable by adding more brokers, partitions, and consumers as needed.

- Implement redundancy and replication at the Kafka broker level to handle broker failures. Consumers should be designed to handle reprocessing from the last known offset in case of failures.

8. **Monitoring and Alerting:**

- Set up monitoring for Kafka brokers, topics, and consumers. Tools like Prometheus, Grafana, and Kafka's own metrics can be used to track performance, latency, and throughput.

- Configure alerting to detect anomalies such as sudden drops in click counts or consumer lag, enabling quick response to potential issues.

By leveraging Kafka's Topics, Brokers, and Partitions (TBP), you can design a robust and scalable real-time distributed counter system for ad clicks that ensures high availability, low latency, and fault tolerance.

Peter Ode · Answer

Examples of really good database designs? That’s a very broad question. A more specific question might be in order. For accounting and operations? For a CMS (Content Management System)? For an EMR (Electronic Medical Records) system? Multi-tenant, such as a cloud based service where a single database might host many customers, each customer having their own WebCatalog and WebStore?

Technology limitations — A database design is also limited by the database technology used: RDBMS (Releational Database Management System), ODBMS (Object Database Management System), Hierarchical Db, Network Db, NoSQ

Technology limitations — A database design is also limited by the database technology used: RDBMS (Releational Database Management System), ODBMS (Object Database Management System), Hierarchical Db, Network Db, NoSQL in many flavors (key-value stores, document stores, search engines, graph db, and others).

RDBMS — Most developers are familiar with SQL (Structured Query Language) capable databases, most are RDBMS but SQL can also be used with several other types. SQL is a database specific language for performing queries, data inserts and updates. Most RDBMS databases also enable changing the shape of the database via SQL.

ODBMS — I’ll answer the question by referencing the king of database technologies: Object Database Management Systems (ODBMS) that are ACID compliant. ACID (atomicity, consistency, isolation, durability) is a computer science term describing a set of features for database transactions intended to guarantee data validity despite errors, power failures and other issues.

Most of the well known RDBMS systems, such as MySQL, Oracle, SQL Server, are ACID compliant. Most of the NoSQL databases are not. For critical business systems, ACID is mandatory.

Object databases are more flexible and performant — Object databases enable much more sophisticated and performant systems than what’s possible with Relational databases, especially as the database schema complexity increases. Also, developer productivity is greatly enhanced when an ODBMS is used.

For RDBMS designs, relationships are represented by joins between tables. Most common relationships are one-to-many, then many-to-many. For example, one Customer can have many SalesOrders. Typically such joins involve key field data that is maintained in three places: (1) parent table, (2) child table, and (3) in the index for the field to speed access to child records. In the example, the CustomerNumber would be in three places. Joins are expensive (in terms of computing resources), so relational designs attempt to minimize such relationships — often limiting the design in many ways. Because of such limitations, business app developers usually first try to design an optimal relational schema, then write the app code that reads/writes to this database.

Why is Object Persistence Faster? — In an ODBMS, your database schema is represented by your class hierarchy. An ODBMS adds database behavior to your classes so each instance of an object can save itself to persistent storage (your disk drive) without first having to write code that translates an object into rows and columns (required by an RDBMS). ODBMS systems integrate with your programming language with database aware Arrays, Collections, Dictionaries… The programmer writes code as if he had unlimited RAM memory. Objects are saved to disk as objects. If an object is not in memory, the ODBMS automatically de-references a pointer and directly brings that object into RAM memory. In a RDBMS multiple disk reads, first in the index, then in the table, are required to bring the database row into memory. Then the RDBMS programmer has to write code to reassemble the object from the flat row data.

Imagine if you had to disassemble your car on your driveway before storing it in the garage. And, reassemble your car in the morning before driving to work. That’s what you’re forced to do with relational databases. With an object database, all that is automatic and database performance is orders of magnitude faster.

Real-World Object Design Example — To answer your question, I’ll reference a multi-tenant eCommerce platform that my company built back in the late 1990’s — this system is still in operation today. Tenants include different types of online stores and cloud services — all using the same object database. The platform also had tenants with electronic medical record apps. The programming language / IDE is IBM Visual Age Smalltalk, a highly productive development system shown to be 3x more productive than C#, Python, PHP, Java or JavaScript (on Node).

Here are some unique characteristics of the system, made possible because of the ODBMS (rather than a RDBMS):

%3C%3E Instead of a Customer table and Vendor table, we have a LegalEntity object with direct references to a Customer object and Vendor object and ServiceProvider, Contact, Coach, Player, Physician, Patient and others. Since object oriented relationships are practically free, when compared to the RDBMS joins, we’re free to make such database schema designs (actually implemented in our class hierarchy).

One LegalEntity can be a Company, Customer, Vendor… as needed. If we need a new type, say a Customer that can rent a car, we just add a class that might be called RentalCustomer (maybe subclassed from Customer).

%3C%3E We use the same design pattern for Product. We have a generic Product object with attributes such as Code, Name, Cost, Images (a Collection of images), Price, AlternativePrices (a Collection of alternate prices for different types of customers, common in wholesale applications)… Then we have objects that can be directly referenced (with a one-to-one relationship) to provide specialized product data and behaviors such as: SerialNumberedProduct (use for items that must track serial numbers), RentalProduct, HourlyService, Subscription, and others.

Products are easy to extend, we can re-use most of our existing code, performance is stellar.

The designer/programmer still needs to be concerned with database normalization and other best practices, but ODBMS systems have far fewer limitations for your database design — compared to RDBMS design considerations.

Some ODBMS platforms — Although many ODBMS systems support multiple object-oriented languages, typically Smalltalk, Java, C++, our favorite is Smalltalk. IBM Smalltalk has been spun off to Instantiations - VAST Platform [ http://www.instantiations.com ] and there’s VisualWorks (VisualWorks® Overview [ http://www.cincomsmalltalk.com/main/products/visualworks/overview/ ]) — both used by Fortune 1000 Enterprises. A great open-source Smalltalk is Pharo (Phar.org) with several object database libraries available. I’ve used the ACID compliant OmniBase.

One of the very best ODBMS systems is Gemstone/S (Home [ https://gemtalksystems.com ]) which supports Smalltalk and Java.

Here’s a link to some code to make a database connection; save an object instance; find an object; indexing; and garbage collection of unused instances in the database. sebastianconcept/Aggregate [ https://github.com/sebastianconcept/Aggregate ]

Greg Kemnitz · Answer

My (short) experience at big tech seems to indicate that developers themselves do most of the database design.

Sometimes it’s good, sometimes it isn’t, and often the db is small enough that it doesn’t really matter (even in seemingly big companies that you’d normally think of as having huge dataworlds; not every database at “Big Tech” is measured in exabytes).

And yes, even at “Big Tech” I often find people doing stuff like fetching down most of a table and doing what amounts to joins in application code, often because developers “forget” that database query languages can do qualifying and filtering far better than your app can do in a “for loop”.

Most people doing relational database design know enough to avoid normalization issues. Where I see problems is in a sort of misplaced cleverness: people who get excessively “cute” and think that a gross but “correct” query is a good thing to run dozens of times on a busy production system. You often have to work with them to simplify their query or break it apart to use temporary tables, etc.

I’ve worked with some developers doing schema design, and have had good experience in getting them to learn good methodologies. I don’t typically encounter database schemas until they’ve been deployed and are already sorta broken to the point where they’re causing production issues…

Bruce A McIntyre · Answer

The first step in designing a relational database is to understand what is going to be the desired output.. What reports, queries, print outs, forms are going to be required.
Then you need to understand how the different data elements of these outputs relate to each other.
Then you need to understand the structure of the data.. Which elements are dependent on others, what pieces are unique, and what pieces are related or dependent on others.
Then you need to figure out where this data is going to come from. Is it all new? Is some already somewhere else? Do users need to enter it? 
Then you can look at defining "records" or tuples of data, defining which fields will be used to access the record. (Defining indexes)
Then you need to define the preferred size of each data element. How many characters for text, how many digits, logical data, raw or unstructured data.

Now you have the information needed to define the database. Create an initial database and fill it with test data so you can define the inputs and outputs (you may need lookups, data integrity checks, required data elements and optional data elements.)

Test your database to see if it meets the requirements as specified in the first steps. Then fix what doesn't work as expected.

Now if you are trying to create a different sort of database, one that is not relational, then there would be a very different path to get there.

Greg Kemnitz · Answer

This database isn't going to be big enough to "matter" much in needing weird performance hacks, so a straightforward ER-diagram-style normalized schema is fine.

Nowadays, any DB where tables aren't above O(10^7 recs) on a reasonably configured db server that isn't overloaded with extra apps doesn't need any weird performance hacks beyond just doing indexing correctly.  If you get to 10^10 recs or more, you do, but unless every human on the planet is one of your contractors, this won't happen :)

You've got a reasonably normalized schema there - I'd probably go with Surbhi Chadha [ https://www.quora.com/profile/Surbhi-Chadha ]'s suggestion on the id changes.

Gaëtan Gates Perrault · Answer

There are many DBS
To start, any "ad network" or "ad platform" is likely going to involve several different databases at several different parts of the pipeline. Think of the basic ad-serving pipeline, each of the following steps could easily represent a different type of DB:

1. Identify ad requester (publisher) and load their data.
2. Identify user requesting ad (cookies, uids) and load their data. (what have they seen? do they have demographics? etc.)
3. Pass this data off to the optimization system and get a list of recommended ads to display.
4. Load the data for the recommended ads and filter out ads as required (seen too often, blacklisted from site, out of budget, etc.)
5. Render the ad with a pixel and write out the cookie.
6. Process the pixel for the impression and eventually the click.
7. Run fraud tracking on all of this stuff.
8. Get real-time stats for your internal team.
9. Get roll-up stats for your publishers and advertisers and support staff.

MongoDB is great for some of these, like #8 and possibly #1, 2, 4. Fraud tracking (7), optimization (3) and roll-up stats (9) will all need some form of Map / Reduce system (like Hadoop, Mongo is insufficient here). To display roll-up stats, you probably an SQL which makes it easy to slice data for basic reports.

The engine for #6 (pixels) is probably just a series of flat files, take a look at Google's Protocol Buffers for some ideas on how this data can be passed between servers.

So what part are you doing?
It's not 100% clear what part of this you are trying to do. Are you purely just farming ads around? Are you a network of networks? How tight are your timelines?

If I had to build an Ad Network with "hundreds of billions of reads and writes per day", I would start looking at what Google and Facebook are doing. Frankly, they may be the only people doing "ads" at that level. The internet has about billion users, so if you show 100 billions ads / day, you're showing 100 ads to every user of the internet every day.

There is almost nothing that I can give you a Quora answer that can convey the complexity of making that happen.

Ben Darfler · Answer

I would recommend taking a look at CRDTs (Page on Psu , Page on Hal). The quick gist is that CRDTs (Consistent Replicated Data Types) are mathematically shown to converge to the correct state when implemented on top of an eventual consistency database such as Riak, Cassandra, Voldemort, Dynamo, etc.

Specifically, they can be used to create distributed counters (among other data structures), an overview of which can be found at Playing with Riak and CRDTs - Counters. For a counter the rough idea is that instead of keeping one value for the counter (where conflicting increments cannot be handled

Specifically, they can be used to create distributed counters (among other data structures), an overview of which can be found at Playing with Riak and CRDTs - Counters [ http://www.paperplanes.de/2012/6/27/playing-with-riak-and-crdts.html ]. For a counter the rough idea is that instead of keeping one value for the counter (where conflicting increments cannot be handled) you keep one count per node in your database. The real count is the sum of all the per node counts and you can easily reconcile merge conflicts by just taking the max of each per node count.

With CRDTs as your base you can move on to other optimizations such as batching counter updates like Twitter does (Rainbird: Realtime Analytics at Twitter (Strata 2011) [ http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011 ]).

Though I personally like the mathematical underpinnings of CRDTs you could also go a completely different way and follow Facebook's example (High Scalability - High Scalability - Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day [ http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html ])

Jennie Hoch · Answer

Designing a database is in fact fairly easy, but there are a few rules to stick to. It is important to know what these rules are, but more importantly is to know why these rules exist, otherwise you will tend to make mistakes!

A good database design starts with a list of the data that you want to include in your database and what you want to be able to do with the database later on. This can all be written in your own language, without any SQL. In this stage you must try not to think in tables or columns, but just think: "What do I need to know?" Don't take this too lightly, because if you find out later that you forgot something, usually you need to start all over. Adding things to your database is mostly a lot of work.

It helps produce database systems

1. That meet the requirements of the users
2. Have high performance.
The main objectives of database designing are to produce logical and physical designs models of the proposed database system.

The logical model concentrates on the data requirements and the data to be stored independent of physical considerations. It does not concern itself with how the data will be stored or where it will be stored physically.

The physical data design model involves translating the logical design of the database onto physical media using hardware resources and software systems such as database management systems (DBMS).

for more details check https://www.dbdesigner.net

Michael Hausenblas · Answer

Let's step back a bit. Following the polyglot persistence [ http://martinfowler.com/bliki/PolyglotPersistence.html ] mantra, a single (NoSQL) database won't fit the bill, given your requirements.

I recommend looking into Nathan Marz [ https://www.quora.com/profile/Nathan-Marz ]'s lambda architecture, see the Big Data [ http://manning.com/marz/ ] book (chapter 1 for free), and the slide deck A real time architecture using Hadoop and Storm [ http://www.slideshare.net/nathan_gs/a-real-time-architecture-using-hadoop-and-storm ] as well as An example “lambda architecture” for real-time analysis of hashtags [ http://www.datasalt.com/2013/01/an-example-lambda-architecture-using-trident-hadoop-and-splout-sql/ ] for examples.

Once you appreciate the power and flexibility of it, the choice of the databases used should be easier.

Ernesto Padilla · Answer

Well, there are entire books and people study majors covering this, but let’s do a quick and dirty step by step:

1. Understand the problem you’re trying to solve with your data: For example, let’s say you need to create a database for a pharmacy’s inventory stock, what are the questions your database must answer? items? grouping? customers? providers? users? employees? expiration dates?
2. You should always start with the outputs your database must comply, for example, what reports are the software that consumes this database if going to need? What validations must be enforced? What business rules should be in place?
3. Do try to learn what Database Normalization is, in order do avoid duplicate data.
4. Think ahead of the possible errors users will make, AND THEY WILL.
5. Do try to create keys that are relevant to each table, think ahead of the possibilities, for example, a customers table in U.S. may contain a social security number as a key, but what would happen if a few dozen customers are foreigners? then you can’t use that column as primary key.
6. Not NULL and Null columns are important, do not let a column go null if it is important to fill.
7. When possible use GUIDS instead of autonumber columns, the reason behind this is because if you have a several databases in different geographic locations that can become easily disconnected, then you will have a big problem if you use sharding of replication.
8. Do create foreign keys, nothing bites harder in the arse, than unprotected child records.
9. If you want security on your database, create triggers and stored procedures to encapsulate data CRUD operations. Then you can prevent developers from getting direct access to your base tables, and this will prevent garbage on your tables and save both you and the developers a lot of headaches down the road.
Hope this helps.

Wes Biggs · Answer

We use Aerospike at Adfonic, as one of the other posters mentioned and I would highly recommend it for this use case. We use it primarily as a key-value store, which is one particular part of the NoSQL world.

There's an article about our general work with big data here:
Adfonic processes 50,000 mobile ads per second with big data architecture [ http://www.computerweekly.com/news/2240178203/Adfonic-processes-50000-mobile-ads-per-second-with-SQL-architecture ]

50,000 is a lot, and we're growing quickly -- in a desktop advertising environment, you very quickly can get into hundreds of thousands per second (of course, typically this will be distributed amongst various data center locations).

100% uptime is a key requirement to consider. That means hot upgradability, completely reliable failover, all the rest. In RTB, time is money.

Mike West · Answer

The top data person in every company on earth right now is the DBA.

Data architect is just a DBA that knows schema architecture and more importantly how to install a database for maximum CPU, Memory and IO performance.

Let me say that again in a different way. The location of the data and log files is far more important than how the schema is designed.

Data architects often think they know more than front line DBA but that’s never the case.

Schema means how the tables are laid out. A famous architecture that’s rarely followed is the third normal form. After 30 years I’ve work on about 10 databases that were designed correctly outside of Microsoft.

They are some of the most technically astute people in every company I’ve worked at.

The data engine is often vaulted as the most complicated piece of software on earth.

After working at several big tech companies I can say without hesitation some of the most technical people I’ve ever seen were the database administrators.

Jayaraman Sampathkumar · Answer

This is a classic orders table. A better format for department-items columns can be like this

ORDERS
Order_Id
Order_Date
Department_Id
Order_Status

ORDER_LINE_ITEMS
Order_Id
Item_Id
Count
Item_Unit_of_Measure

Jason Dusek · Answer

[code postgres]
--- Things that are unlikely to change even once in a person's life, de facto
--- identifying information.
CREATE TABLE person (
  nic           text PRIMARY KEY,
  name          text,
  date_of_birth timestamptz,
  sex           text CHECK (sex IN ('m', 'f'))
);

--- Additional information about a person.
CREATE TABLE personal_information (
  nic           text PRIMARY KEY REFERENCES person
                     ON DELETE CASCADE ON UPDATE CASCADE
                     DEFERRABLE INITIALLY DEFERRED,
  religion      text,
  domicile      text,
  male_guardian text,
  married       boolean,
  address       text,
  email         text,
  computer_lit  boolean
);

CREATE TABLE qualification (
  nic           text NOT NULL REFERENCES person
                     ON DELETE CASCADE ON UPDATE CASCADE
                     DEFERRABLE INITIALLY DEFERRED,
  s_no          text NOT NULL,
  qualification text NOT NULL,
  institution   text,
  grade         text,
  year          date,
  PRIMARY KEY (nic, s_no, qualification)
);

CREATE TABLE training (
  nic           text NOT NULL REFERENCES person
                     ON DELETE CASCADE ON UPDATE CASCADE
                     DEFERRABLE INITIALLY DEFERRED,
  s_no          text NOT NULL,
  course        text NOT NULL,
  institution   text,
  country       text,
  starting      date,
  ending        date,
  PRIMARY KEY (nic, s_no, course)
);
[/code]

Clive Thomas · Answer

This is one of the most complex problems in current-day programming.

The majority of designers never learn how to do this well.

Theoretically, you simply lay out the data in tables and then perform normalization.

This, however, doesn't even begin to describe the problem.

Firstly, you need to know what data you are going to need to store and retrieve.  At the start of the project (when you are designing the database schema at first) you typically don't know enough about the data requirements to be able to specify all the data items in the right way.

Secondly, you have to select the right level of normalization.  Going 3rd normal form can complicate programming, sometimes to a degree which is not convenient.  Stopping at 2nd normal form can sometimes result in data redundancy and duplication, with other complications down the line.

So, in practice, what you often have to do is do a first-stab schema layout, using the best knowledge of the data requirement you have at the time, and very often only go to about 2nd normal form, because you are probably going to have to revise it anyway.

Then, as the project progresses and your understanding of the data requirement improves, you iteratively redesign the schema again and again, taking some tables to 3rd normal form as you see the requirement for this, perhaps leaving others at 2nd or somewhere inbetween because that works out to be the best compromise, and the whole time improving the match between the schema design and the real-world problem until you have a solution which is reasonably close to optimal.

One important issue is not to become emotionally attached to any part of the schema layout at any time.   Even the cutest ideas can later prove to be non-optimal, and you have to be prepared to abandon them for a better solution at any time.  This is one of the hardest aspects of schema design, and one of the most common failings with most designers.

Very few designers do this well.  Most database schemas are quite horrifyingly bad.

Christopher Smith · Answer

Cassandra has native support for Spark, so not necessarily a need for Hadoop. There plenty of other systems targeting this space (Aerospike, Druid, and Couchbase come to mind immediately). Some of the BaaS solutions out there from cloud providers (like AWS Cognito) can provided a limited, but perhaps effective enough solution to implement a targeting solution. There are also packaged solutions out there like MetaMarkets, AppNexus, etc…

You can, however, decide to forgo offline learning entirely and your work while processing your data in real-time with on-line algos (using frameworks like Storm or Spark Streaming for complex routing).

You can also offload almost all the work on the client (particularly for browser or app focused systems). Basically just have a server side piece for learning/identifying useful features and let all the state & learning be stored on the client. This has the advantage of minimize infrastructure investment.

In general, we’re well passed the point where handling the load & analysis is a task of forging new technologies. There are lots of off the shelf components that can get the job done for you.