rigadicomando.org

Whatever you can cat

Random Quote

The only thing not changing is that all changes.

• About Yin and Yang

Secondary links

  • About
  • Contacts
  • Disclaimer

Home News aggregator Sources

High Scalability Architecture

Syndicate content
This site tries to bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your own website with confidence. Please Start Here.
URL: http://highscalability.com
Updated: 11 hours 3 min ago

Latency is Everywhere and it Costs You Sales - How to Crush it

Thu, 2008-09-04 15:27

Update: Efficient data transfer through zero copy. Copying data kills. This excellent article explains the path data takes through the OS and how to reduce the number of copies to the big zero.

* The time it takes for a packet to cross a network connection, from sender to receiver.

* The period of time that a frame is held by a network device before it is forwarded.

Two of the most important parameters of a communications channel are its latency, which should be low, and its bandwidth, which should be high. Latency is particularly important for a synchronous protocol where each packet must be acknowledged before the next can be transmitted.

OS Latency

Let T be a task belonging to a time-sensitive application that requires execution at time t, and let t' be the time at which T is actually scheduled.

OS latency as experienced by T as L= t' - t.

http://www.possibility.com/epowiki/Wiki.jsp?page=ItsTheLatencyStupid
">Latency
matters. Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%. A broker could lose $4 million in revenues per millisecond if their electronic trading platform is 5 milliseconds behind the competition.

The Amazon results were reported by Greg Linden in his presentation Make Data Useful. In one of Greg's slides Google VP Marissa Mayer, in reference to the Google results, is quoted as saying "Users really respond to speed." And everyone wants responsive users. Ka-ching! People hate waiting and they're repulsed by seemingly small delays.

The less interactive a site becomes the more likely users are to click away and do something else. Latency is the mother of interactivity. Though it's possible through various UI techniques to make pages subjectively feel faster, slow sites generally lead to higher customer defection rates, which lead to lower conversation rates, which results in lower sales. Yet for some reason latency isn't a topic talked a lot about for web apps. We talk a lot about about building high- SCALABILITY (horizontal or vertical) = ability to easily add capacity to accommodate growth. Capacity doesn’t mean speed.

Planning includes realizing what you have right NOW, and predicting what you’ll need later. Planning (what ?/why ?/when ?)
">capacity
sites, but very little about how to build

In a computer network, it is an expression of how much time it takes for a packet of data to get from one designated point to another. It is sometimes measured as the time required for a packet to be returned to its sender.

Latency depends on the speed of the transmission medium (e.g., copper wire, optical fiber or radio waves) and the delays in the transmission by devices along the way (e.g., routers and modems). A low latency indicates a high network efficiency.

http://www.bellevuelinux.org/latency.html">low-latency sites. We apparently do so at the expense of our immortal bottom line.

I wondered if latency went to zero if sales would be infinite? But alas, as Dan Pritchett says, Latency Exists, Cope!. So we can't hide the "latency problem" by appointing a Latency Czar to conduct a nice little war on latency. Instead, we need to learn how to minimize and manage latency. It turns out a lot of problems are better solved that way.

How do we recover that which is most meaningful--sales--and build low-latency systems?

read more

MapReduce framework Disco

Thu, 2008-09-04 04:42

Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.

What CDN would you recommend?

Thu, 2008-09-04 00:42

Update 5: When It Comes To Content Delivery Networks, What Is The "Edge"?. Dan Rayburn is on edge about the misuse of the term edge: closest location to the user does not guarantee quality, often content is not delivered from the closest location, all content is not replicated at every "edge" location. Lots of other essential information.
Update 4: David Cancel runs a great test to see if you should be Using Amazon

http://aws.amazon.com/">S3 as a

CDN is a system of computers networked together across the Internet that cooperate transparently to deliver content (especially large media content) to end users. The first web content based CDN's were Sandpiper and Skycache followed by Akamai and Digital Island. The first video based CDN was iBEAM Broadcasting.

CDN nodes are deployed in multiple locations, often over multiple backbones. These nodes cooperate with each other to satisfy requests for content by end users, transparently moving content behind the scenes to optimize the delivery process. Optimization can take the form of reducing bandwidth costs, improving end-user performance, or both.

The number of nodes and servers making up a CDN varies, depending on the architecture, some reaching thousands of nodes with tens of thousands of servers.

http://en.wikipedia.org/wiki/Content_Delivery_Network">CDN?. Conclusion: "CacheFly performed the best but only slightly better than EdgeCast. The S3 option was the worst with the Nginx/DIY option performing just over 100 ms faster." Also take look at Part 2 - Cacheability?
Update 3: Mr. Rayburn takes A Detailed Look At Akamai's Application Delivery Product . They create a "bi-nodal overlay network" where users and servers are always within 5 to 10 milliseconds of each other. Your data center hosted app can't compete. The problem is that people (that is, me) can understand the data center model. I don't yet understand how applications as a CDN will work.
Update 2: Dan Rayburn starts an interesting series of articles on Highlights Of My Day In Cambridge With Akamai. Akamai is moving strong into the application distribution business. That would make an interesting cloud alternative..
Update: Streamingmedia links to new CDN DF Splash that specializes in instant-on TV-quality video streaming.

A question was raised on the forum asking for a CDN recommendation. As usual there are no definitive answers, but here are three useful articles that may help your deliberations.

  • First, Tony Chang shows how to drive down response times using edge acceleration strategies.
  • Then Pingdom gives a nice overview and introduction to CDNs.
  • And last but not least, Dan Rayburn from StreamingMedia.com gives a master class in how much you should pay for your CDN, what you should be getting for your money, and how to find the right provider for your needs.

    Lots and lots of good stuff to learn, even if you didn't roll out of bed this morning pondering the deeper mysteries of content delivery networks and the Canadian dollar.

    read more

  • SMACKDOWN :: Who are the Open Source Content Management System (CMS) market leaders in 2008?

    Wed, 2008-09-03 23:04

    I came across an interesting study about who are the leaders in open source content management systems market in the year of 2008.

    The study was just released to the public and it was conducted by Ric Sheves from Water & Stone web development company.

    At 50 pages, there is a significant amount of data in this study that should be of use to developers or to anyone who is looking to commit to a web publishing system (also known as a Content Management System).

    Read the entire article about who the open source content management systems market leader is for 2008 at MyTestBox.com - web software reviews, news, tips & tricks.

    37signals Architecture

    Wed, 2008-09-03 15:58

    Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between

    Apache is the most popular web server in use today because it is free, runs everywhere, performs well, and can be configured to handle most needs.

    http://httpd.apache.org/">Apache web servers and

    http://mongrel.rubyforge.org/">Mongrel application servers.
    Update 4: O'Rielly's Tim O'Brien interviews David Hansson, Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on, hardware gets cheaper and cheaper. In my mind, you don't want to shard unless you positively have to, sort of a last resort approach.
    Update 3: The need for speed: Making Basecamp faster. Pages now load twice as fast, cut CPU usage by a third and database time by about half. Results achieved by: Analysis, Caching,

    http://www.mysql.com/">MySQL optimizations, Hardware upgrades.
    Update 2: customer support is handled in real-time using Campfire.
    Update: highly useful information on creating a customer billing system.

    In the giving spirit of Christmas the folks at 37signals have shared a bit about how their system works. 37signals is most famous for loosing

    http://rubyonrails.org/">Ruby on Rails into the world and they've use

    http://rubyonrails.org/">RoR to make their very popular Basecamp, Highrise, Backpack, and Campfire products. RoR takes a lot of heat for being a performance dog, but 37signals seems to handle a lot of traffic with relatively normal sounding resources. This is just an initial data dump, they promise to add more details later. As they add more I'll update it here.

    read more

    Some Facebook Secrets to Better Operations

    Wed, 2008-09-03 15:27

    Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

    read more

    Google AppEngine - A Second Look

    Tue, 2008-09-02 16:37

    Update 3:

    http://labs.google.com/papers/bigtable.html">BigTable Blues. Catherine Devlin couldn't port an application to GAE because it can't do basic filtering and can't search 5,000 records without timing out: "Querying from 5000 records - too much for the mighty BigTable, apparently."
    Update 2: Having doubts about AppEngine. Excellent and surprisingly civil debate on if GAE is a viable delivery platform for real applications. Concerns swirl over poor performance, lack of a roadmap, perpetual beta status, poor support, and a quota system as torture chamber model of scalability. GAE is obviously part of Google's grand plan (browser, gears, android, etc) to emasculate Microsoft, so the future looks bright, but is GAE a good choice now?
    Update: Here are a few experience reports of developers using GAE. Diwaker Gupta likes how easy it is to get started on the good documentation. Doesn't like all the limits and poor performance. James here and here also likes the ease of use but finds the data model takes some getting used to and is concerned the API limits won't scale for a real site. He doesn't like how external connections are handled and wants a database where the schema is easier to manage. These posts mirror some of my own concerns. GAE is scalable for Google, but it may not be scalable for my application.

    It's been a few days now since GAE (Google App Engine) was released and we had our First Look. It's high time for a retrospective. Too soon? Hey, this is Internet time baby. So how is GAE doing? I did get an invite so hopefully I'll have a more experience grounded take a little later. I don't know Python and being the more methodical type it may take me a while. To perform our retrospective we'll take a look at the three sources of information available to us: actual applications in the AppGallery, blogspew, and developer issues in the forum.

    The result: a cautious thumbs up. The biggest issue so far seems to be the change in mindset needed by developers to use GAE. BigTable is not

    http://www.mysql.com/">MySQL. The runtime environment is not a VM. A service based approach is not the same as using libraries. A scalable architecture is not the same as one based on optimizing speed. A different approach is needed, but as of yet Google doesn't give you all the tools you need to fully embrace the red pill vision.

    I think this quote by Brandon Smith in a thread on how to best implement sessions in GAE nicely sums up the new perspective:

    Consider the lack of your daddy's sessions a feature. It's what will make your app scale on Google's infrastructure.

    In other words: when in Rome. But how do we know what the Romans do when the Romans do what they do?

    read more

    Paper: GargantuanComputing—GRIDs and P2P

    Sat, 2008-08-30 17:03

    I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating:

    A mathematical analysis by Ritter (2002) (one of the original developers
    of Napster) presented a detailed numerical argument demonstrating that the
    Gnutella network could not scale to the SCALABILITY (horizontal or vertical) = ability to easily add capacity to accommodate growth. Capacity doesn’t mean speed.

    Planning includes realizing what you have right NOW, and predicting what you’ll need later. Planning (what ?/why ?/when ?)
    ">capacity
    of its competitor, the
    Napster network. Essentially, that model showed that the Gnutella network is
    severely bandwidth-limited long before the

    http://en.wikipedia.org/wiki/Peer-to-peer">P2P population reaches a million
    peers. In each of these previous studies, the conclusions have overlooked the
    intrinsic bandwidth limits of the underlying topology in the Gnutella network:
    a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition).

    Trees are known to have lower aggregate bandwidth than higher dimensional
    topologies, e.g., hypercubes and hypertori. Studies of interconnection
    topologies in the literature have tended to focus on hardware implementations
    (see, e.g., Culler et al. 1996; Buyya 1999), which are generally limited
    by the cost of the chips and wires to a few thousand nodes. P2P networks,
    on the other hand, are intended to support from hundreds of thousands to
    millions of simultaneous peers, and since they are implemented in software,
    hyper-topologies are relatively unfettered by the economics of hardware.

    In this chapter, we analyze the scalability of several alternative topologies
    and compare their throughput up to 2–3 million peers. The virtual hypercube
    and the virtual hypertorus offer near-linear scalable bandwidth subject to
    the number of peer TCP/IP connections that can be simultaneously kept
    open.

    Product: ScaleOut StateServer is Memcached on Steroids

    Fri, 2008-08-29 16:39

    ScaleOut StateServer is an in-memory distributed cache across a server farm or compute grid. Unlike middleware vendors, StateServer is aims at being a very good data cache, it doesn't try to handle job scheduling as well.

    StateServer is what you might get when you take

    Danga Interactive developed memcached to enhance the speed of LiveJournal.com, a site which was already doing 20 million+ dynamic page views per day for 1 million users with a bunch of webservers and a bunch of database servers. memcached dropped the database load to almost nothing, yielding faster page load times for users, better resource utilization, and faster access to the databases on a memcache miss.

    Memcached is very popular and is used in many websites.

    http://www.danga.com/memcached/">Memcached and merge in all the value added distributed caching features you've ever dreamed of. True, Memcached is free and ScaleOut StateServer is very far from free, but for those looking a for a satisfying out-of-the-box experience, StateServer may be just the caching solution you are looking for. Yes, "solution" is one of those "oh my God I'm going to pay through the nose" indicator words, but it really applies here. Memcached is a framework whereas StateServer has already prepackaged most features you would need to add through your own programming efforts.

    Why use a distributed cache? Because it combines the holly quadrinity of computing: better performance, linear scalability, high availability, and fast application development. Performance is better because data is accessed from memory instead of through a database to a disk.

    Scale is measured relative to your requirements. As long as you can scale enough to solve your problem then you have scale. If you can handle the number of objects and events required for your application then you can scale. It doesn't really matter what the numbers are.

    Scaling often creates a difference in kind for potential solutions. The solution you need to handle a small problem is not the same as you need to handle a large problem. If you incrementally try to evolve one into the other you can be in for a rude surprise, because it won't work as you pass through different points of discontinuity.

    Scale is not language or framework specific. It is a matter of approach and design.

    http://www.possibility.com/epowiki/Wiki.jsp?page=Scalability
    ">Scalability
    is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding. Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure. Application development is faster because there's only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data.

    StateServer follows the RAM is the new disk credo. Memory is assumed to be the system of record, not the database. If you want data to be stored in a database and have the two kept in sync, then you'll have to add that layer yourself. All the standard memcached techniques should work as well for StateServer. Consider however that a database layer may not be needed. Reliability is handled by StateServer because it keeps multiple data copies, reroutes on failure, and has an option for geographical distribution for another layer of added safety. Storing to disk wouldn't make you any safer.

    Via email I asked them a few questions. The key question was how they stacked up against Memcached? As that is surely one of the more popular challenges they would get in any sales cycle, I was very curious about their answer. And they did a great job differentiation themselves. What did they say?

    read more

    Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)

    Thu, 2008-08-28 22:12

    Update 2: H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads.
    Update: interesting related thread on Lamda the Ultimate.

    A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intertube lately. The spirit of the paper is found in the following excerpt:

    In summary, the current RDBMSs were architected for the business data processing market in a time of different user interfaces and different hardware characteristics. Hence, they all include the following System R architectural features:
    * Disk oriented storage and indexing structures
    * Multithreading to hide latency
    * Locking-based concurrency control mechanisms
    * Log-based recovery

    read more

    Product: Amazon's SimpleDB

    Thu, 2008-08-28 17:17

    Update 33: Amazon announces Elastic Block Store (EBS), which provides lots of normal looking disk along with value added features like snapshots and snapshot copying. But database's may find EBS too slow. RightScale tells us Why Amazon’s Elastic Block Store Matters.
    Update 32: You can now get all attributes for a property when querying. Previously only the ID was returned and the attributes had to be returned in separate calls. This makes the programmer's job a lot simpler. Artificial levels of parallelization code can now be dumped.
    Update 31: Amazon fixes a major hole in SimpleDB by adding the ability to sort query results. Previously developers had to sort results by hand which was a non-starter for many. Now you can do basic top 10 type queries with ease.
    Update 30: Amazon SimpleDB - A distributed, highly-scalable, light-weight, query-able, attribute store by Sebastian Stadil. It introduces the CAP theorem and the basics of SimpleDB. Sebastian does a lot of great work in the AWS world and in what must be his limited free time, runs the AWS Meetup group.

    read more

    Useful Cloud Computing Blogs

    Tue, 2008-08-26 15:04

    Can't get enough cloud computing? Then you must really be a glutton for punishment! But just in case, here are some cloud computing resources, collected from various sources, that will help you transform into a Tesla silently flying solo down the diamond lane.

    Meta Sources

  • Cloud Computing Email List: An often lively email list discussing cloud computing.
  • Cloud Computing Blogs & Resources. An excellent and big list of cloud resources.
  • Cloud Computing Portal: A community edited database for making the vendor selection process easier.
  • List of Cloud Platforms, Providers, and Enablers.
  • datacenterknowledge.com's Recap: More than 70 Industry Blogs : A nice set of blog's for: Data Center, Web Hosting, Content Delivery Network (CDN), Cloud Computing
  • Cloud Computing Wiki: A cloud computing wiki started by participants of the cloud email list.

    Specific Blogs

  • James Urquhart's The Wisdom of Clouds : Cloud Computing and Utility Computing for the Enterprise and the Individual. James writes great articles and has a regular can't miss links style post summarizing much of what you need need to know in cloud world.

    Many more below the fold.

    read more

  • An Unorthodox Approach to Database Design : The Coming of the Shard

    Mon, 2008-08-25 14:59

    Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on

    Some advantages are:
    * faster backup
    * faster recovery
    * data can fit into memory
    * data is easier to manage
    * provided more write bandwidth because you aren't writing to a single master. In a single master architecture write bandwidth is throttled.

    This technique is used by many large websites, including eBay, Yahoo, LiveJournal, and Flickr.">Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

    Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

    What is sharding and how has it come to be the answer to large website scaling problems?

    read more

    A Scalable, Commodity Data Center Network Architecture

    Sun, 2008-08-24 20:16

    Looks interesting...

    Abstract:
    Today’s data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Nonuniform bandwidth among data center nodes complicates application design and limits overall system performance.
    In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today’s higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.

    Wuala - P2P Online Storage Cloud

    Mon, 2008-08-18 11:39

    How do you design a reliable distributed file system when the expected availability of the individual nodes are only ~1/5? That is the case for P2P systems. Dominik Grolimund, the founder of a Swiss startup Caleido will show you how! They have launched Wuala, the social online storage service which scales as new nodes join the P2P network.

    The goal of Wua.la is to provide distributed online storage that is:

    • large
    • scalable
    • reliable
    • secure

    by harnessing the idle resources of participating computers.

    This challenge is an old dream of computer science. In fact as Andrew Tanenbaum wrote in 1995:
    "The design of a world-wide, fully transparent distributed filesystem fot simultaneous use by millions of mobile and frequently disconnected users is left as an exercise for the reader"

    After three years of research and development at at ETH Zurich, the Swiss Federal Institute of Technology on a distributed storage system, Caleido is ready to unveil the result: Wuala. Wuala is a new way of storing, sharing, and publishing files on the internet. It enables its users to trade parts of their local storage for online storage and it allows us to provide a better service for free. In this Google Tech Talk, Dominik will explain what Wuala is and how it works, and he will also show a demo.

    read more

    Strategy: Drop Memcached, Add More MySQL Servers

    Sun, 2008-08-17 19:02

    Update 2: Michael Galpin in Cache Money and Cache Discussions likes memcached for it's expiry policy, complex graph data, process data, but says

    http://www.mysql.com/">MySQL has many advantages: SQL, Uniform Data Access, Write-through, Read-through, Replication, Management, Cold starts, LRU eviction.
    Update: Dormando asks Should you use memcached? Should you just shard mysql more?. The idea of caching is the most important part of caching as it transports you beyond a simple CRUD worldview. Plan for caching and sharding by properly abstracting data access methods. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs over planning and wasting valuable time.

    Feedster's François Schiettecatte wonders if Fotolog's 21 memcached servers wouldn't be better used to further shard data by adding more MySQL servers? He mentions Feedster was able to drop memcached once they partitioned their data across more servers. The algorithm: partition until all data resides in memory and then you may not need an additional memcached layer.

    Parvesh Garg goes a step further and asks why people think they should be using MySQL at all?

    Related Articles

  • The Death of Read Replication by Brian Aker. Caching layers have replaced read replication. Cache can't fix a broken database layer. Partition the data that feeds the cache tier: "Keep your front end working through the cache. Keep all of your data generation behind it."
  • Read replication with MySQL by François Schiettecatte. Read replication is dead and it should be used only for backup purposes. Take the memory used for caching and give it to your database servers.
  • Replication++, Replication 2.0, Replication.Next by Ronald Bradford. What should read replication be used for?
  • Replication, caching, and partitioning by Greg Linden. Caching overdone because it adds complexity, latency on a cache miss, and inefficiently uses cluster resources. Hitting disk is the problem.

    Some advantages are:
    * faster backup
    * faster recovery
    * data can fit into memory
    * data is easier to manage
    * provided more write bandwidth because you aren't writing to a single master. In a single master architecture write bandwidth is throttled.

    This technique is used by many large websites, including eBay, Yahoo, LiveJournal, and Flickr.">Shard more and get your data in memory.

  • Strategy: Serve Pre-generated Static Files Instead Of Dynamic Pages

    Sat, 2008-08-16 16:53

    Pre-generating static files is an oldy but a goody, and as Thomas Brox Røst says, it's probably an underused strategy today. At one time this was the dominate technique for structuring a web site. Then the age of dynamic web sites arrived and we spent all our time worrying how to make the database faster and add more caching to recover the speed we had lost in the transition from static to dynamic.

    Static files have the advantage of being very fast to serve. Read from disk and display. Simple and fast. Especially when caching proxies are used. The issue is how do you bulk generate the initial files, how do you serve the files, and how do you keep the changed files up to date? This is the process Thomas covers in his excellent article Serving static files with Django and AWS - going fast on a budget", where he explains how he converted 600K thousand previously dynamic pages to static pages for his site Eventseer.net, a service for tracking academic events.

    Eventseer.net was experiencing performance problems as search engines crawled their 600K dynamic pages. As a solution you could imagine scaling up, adding more servers, adding sharding, etc etc, all somewhat complicated approaches. Their solution was to convert the dynamic pages to static pages in order to keep search engines from killing the site. As an added bonus non logged-in users experienced a much faster site and were more likely to sign up for the service.

    The article does a good job explaining what they did, so I won't regurgitate it all here, but I will cover the highlights and comment on some additional potential features and alternate implementations...

    read more

    Product: Terracotta - Open Source Network-Attached Memory

    Thu, 2008-08-14 16:11

    Terracotta is Network Attached Memory (NAM) for Java VMs. It provides up to a terabyte of virtual heap for Java applications that spans hundreds of connected JVMs.

    NAM is best suited for storing what they call scratch data. Scratch data is defined as object oriented data that is critical to the execution of a series of Java operations inside the JVM, but may not be critical once a business transaction is complete.

    The Terracotta Architecture has three components:

    1. Client Nodes - Each client node corresponds to a client node in the cluster which runs on a standard JVM
    2. Server Cluster - java process that provides the clustering intelligence. The current Terracotta implementation operates in an Active/Passive mode
    3. Storage used as
      • Virtual Heap storage - as objects are paged out of the client nodes, into the server, if the server heap fills up, objects are paged onto disk
      • Lock Arbiter - To ensure that there is no possibility of the classic "split-brain" problem, Terracotta relies on the disk infrastructure to provide a lock.
      • Shared Storage - to transmit the object state from the active to passive, objects are persisted to disk, which then shares the state to the passive server(s).

    JVM-level clustering can turn single-node, multi-threaded apps into distributed, multi-node apps, often with no code changes. This is possible by plugging in to the Java Memory Model in order to maintain key Java semantics of pass-by-reference, thread coordination and garbage collection across the cluster. Terracotta enables this using only declarative configuration with minimal impact to existing code and provides fine-grained field-level replication which means your objects no longer need to implement Java serialization.

    Ari Zilka, the founder and CTO of Terracotta had a
    video session
    organized by Skills Matter. He will show you how it works and how you can start clustering your POJO-based Web applications (based on Spring, Struts, Wicket, RIFE, EHCache, Quartz, Lucene, DWR, Tomcat, JBoss, Jetty or Geronimo etc.).

    Strategy: Limit The New, Not The Old

    Tue, 2008-08-12 19:34

    One of the most popular and effective scalability strategies is to impose limits (GAE Quotas, Fotolog, Facebook) as a means of protecting a website against service destroying traffic spikes. Twitter will reportedly limit the number followers to 2,000 in order to thwart follow spam. This may also allow Twitter to make some bank by going freemium and charging for adding more followers.

    Agree or disagree with Twitter's strategies, the more interesting aspect for me is how do you introduce new policies into an already established ecosystem?

    One approach is the big bang. Introduce all changes at once and let everyone adjust. If users don't like it they can move on. The hope is, however, most users won't be impacted by the changes and that those who are will understand it's all for the greater good of their beloved service. Casualties are assumed, but the damage will probably be minor.

    Now in Twitter's case the people with the most followers tend to be opinion leaders who shape much of the blognet echo chamber. Pissing these people off may not be your best bet.

    What to do? Shegeeks.net makes a great proposal: Limit The New, Not The Old. The idea is to only impose the limits on new accounts, not the old. Old people are happy and new people understand what they are getting into.

    The reason I like this suggestion so much is that it has deep historical roots, all the way back to the fall of the Roman republic and the rise of the empire due to the agrarian reforms laws passed in 133BC. In ancient Rome property and power, as they tend to do, became concentrated in the hands of a few wealthy land owners. Let's call them the nobility. The greatness that was Rome was founded on a agrarian society. People made modest livings on small farms. As power concentrated small farmers were kicked of the land and forced to move to the city. Slaves worked the land while citizens remained unemployed. And cities were no place to make a life. Civil strife broke out. Pliny said that "it was the large estates which destroyed Italy."

    read more

    Distributed Computing & Google Infrastructure

    Tue, 2008-08-12 12:27

    A couple of videos about distributed computing with direct reference on Google infrastructure.
    You will get acquainted with:

    --MapReduce the software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on commodity hardware
    --GFS and the way it stores it's data into 64mb chunks
    --Bigtable which is the simple implementation of a non-relational database at Google

    Cluster Computing and MapReduce Lectures 1-5.

    1234next ›last »

    LinkShare  Referral  Prg

    CheapOair.com

    tags in Arguments

    administrivia bash Debian GNU/Linux OS emacs howto perl scripts web
    more tags

    Navigation

    • Feedback
    • News aggregator
      • Categories
      • Sources

    ICT users' rights

    • FSF and Stephen Fry celebrate the GNU Project 25th anniversary
    • Spring 2008 Bulletin available online
    • Submit your nominations for the 2008 Free Software Awards
    • FSF demonstrates iPhone's incompatibility with free software and GPLv3
    • Atheros releases free software wireless driver; no binary blobs
    more

    High Scalability Architecture

    • Latency is Everywhere and it Costs You Sales - How to Crush it
    • MapReduce framework Disco
    • What CDN would you recommend?
    • SMACKDOWN :: Who are the Open Source Content Management System (CMS) market leaders in 2008?
    • 37signals Architecture
    more

    Debian Security

    • DSA-1634 wordnet
    • DSA-1633 slash
    • DSA-1632 tiff
    • DSA-1631 libxml2
    • DSA-1630 linux-2.6
    more

    Drupal Security

    • SA-2008-048 - CCK - Cross site scripting
    • SA-2008-047 - Drupal core - Multiple vulnerabilities
    • SA-2008-046 - Drupal core - Session fixation
    • SA-2008-045 - OpenID - Multiple vulnerabilities
    • SA-2008-044 - Drupal core - Multiple vulnerabilities
    more

    EFF

    • FBI Withdraws Unconstitutional National Security Letter After ACLU and EFF Challenge
    • EFF and Sheppard Mullin Defend Wikipedia in Defamation Case
    • Congress Must Investigate Electronic Searches at U.S. Borders
    • Betrayed MSN Music Customers Deserve More from Microsoft
    • EFF Report: FBI Slowed Terror Investigation with Improper NSL Request
    more

    Invent Geek

    • the ion cooler 2.0
    • the ultimate dance pad v1.0
    • thermaltake sponsors inventgeek
    • The Thermaltake MiniFridge Case Mod
    • Inventgeek gets a facelift and a butt tuck
    more

     Privacy | Disclaimer | Drupal | Creative Commons

    All content on this site is ditributed under Creative Commons License, each individual author is responsible for its own posts.

    RoopleTheme